Fine-Tuning LLMs For Testers: A Cost-Effective Alternative to GPT
The surge of Large Language Models (LLM) like GPT has undoubtedly revolutionized the way we approach natural language understanding and generation. The testing domain is no exception to that. However, as robust and powerful as these models are, their operational cost can be prohibitive. Especially for a wide range of simple tasks such as user-session labelling, as we do in Gravity (E2E Tests that cover what really matters to your users), where efficiency and speed are key. Fortunately, fine-tuning lightweight LLMs presents a promising, cost-effective alternative. In this article, we will explore why using heavyweight models like GPT may be overkill for many of your testing use cases and how fine-tuning smaller, more economical models can yield excellent results without breaking the bank.
First, we’ll cover the reason why fine-tuning is such a powerful tool, and why it can be the right choice for your LLM needs by improving their abilities to execute a specific testing task. Then we will show a real-life example we implemented at Smartesting.
The Challenge of Cost Using LLMs
Before we dive into why’s and how’s, let’s address the elephant in the room: cost. GPT and similar LLMs require significant computational resources, not only for training but for inference as well. When it comes to any repetitive tasks like labeling user sessions, where perhaps thousands or millions of instances must be processed daily, these costs compound dramatically. For instance, using one of the latest GPT, GPT-4, to label user session of websites can cost up to $0.02 per request depending on the size of your sessions. For a small dataset of only a thousand sessions, it’s more or less $20 to label all of your sessions. On the other hand, a lightweight LLM hosted on a cloud with a GPU, will cost you less than $4 for an hour with similar results.
Not only do financial considerations weigh heavily, but the environmental impact of running such large-scale LLMs operations is non-negligible. It is also our responsibility to seek more sustainable methods when viable alternatives exist.
What is Fine-Tuning?
LLMs are trained on large corpuses of data based on Wikipedia pages, code repository like GitHub or StackExchange etc. These corpuses generally weigh several terabytes. Models like GPT-4 needs days of training on expensive calculation clusters to achieve good results. These trained models are what we call foundational models. A foundational model, often known as a “base model” or a “pre-trained model,” constitutes the fundamental architecture serving as the cornerstone for more specialized models. Typically characterized by their considerable size and extensive vocabulary, these models possess a broad understanding of language. However, further training is necessary to customize them for specific tasks. Chat-GPT requires extra training from GPT (which is a foundational model) to learn to answer in a chat manner.
This extra training step is what is called fine-tuning. It is another step of training for the foundational model using a reduced dataset, specific to a given task. To continue with the Chat-GPT example, its fine-tuning dataset can be composed of questions asked by a user with an expected answer in a chat format.
Why Fine-Tuning smaller LLMs?
This brings us to the core proposition: fine-tuning lightweight LLMs. These smaller LLMs have come a long way and now offer an impressive balance between performance and cost-efficiency. With fine-tuning, we can tailor them to specific domains or tasks, thus maximizing their efficacy.
But what makes fine-tuning an attractive proposition compared to directly using big LLMs like GPT-4? Here are some compelling reasons:
- Faster Turnaround: They use less memory and compute resources, translating to faster training and inference times.
- Budget-Friendly: Reduced computational requirements lead to lower operating costs, making it sustainable to run these LLMs more frequently.
- Greener Footprint: Less energy consumption corresponds to a more environmentally friendly choice.
- Data Control: All the data (even sensitive ones) stay locally under your control, making it easier to comply with data privacy policies.
Now comes the last question: How hard is it? The answer is: Not so much! Let me show you a concrete example for testers.
Experience Report: Fine-Tuning Mistral-7B for User Session Labelling
The goal here is to fine-tune Mistral-7B1, the LLM of the French company Mistral-AI. This allows to label user sessions accurately, which later helps to classify them into user journeys. In Gravity, the anonymized user sessions come in the form of a JSON file. The file represents the different actions a user took on a website under test. The goal is to automatically generate a short business sentence describing the session in Natural Language, at least as well as GPT-4 would do it. In our examples, the Mistral base model is often too wordy, and we would like to fine-tune it so that the answers are always sentences of 8 words or fewer. Even when we include this constraint in the prompt, it frequently is not sufficient. The fine-tuning goal is to fix this behavior.
Step 1: Select Your Model
The decision is based on the computational budget and the complexity of the task. For this case study, we decided to go with the popular LLM of Mistral, Mistral-7B. But this choice is entirely ours, and some other models already trained on code like Llama-Code could be a better alternative for other tester’s use-cases.
Step 2: Preparing Your Dataset
Gather a diverse set of labeled data you want to fine-tune your LLM. If you’re starting from scratch, you may need to manually label a batch of sessions to create a training and a test set. For our session labeling use case, we decided to go the easy way and gave our sessions to GPT-4 for it to create labels for our training dataset. We obtained satisfying results with a dataset comprised of only 300 sessions. However, this number can vary depending on the complexity of your task and the size of your LLM.
Step 3: Fine-Tuning
Time to get your hands dirty! Libraries such as Hugging Face’s Transformers 2 make this step more accessible. Using the base model you chose, the fine-tuning can take different forms, but the simplest form is to use what is called Parameter-Efficient Fine-Tuning (PEFT). Without going into too many details, PEFT focuses on the most important parameters to be fine-tuned and creates an adapter of your use case coming on top of the base model. This makes the process of fine-tuning quicker and more efficient in space, for almost no degradation in performance. You can find more information on PEFT here.
Step 4: Validation of the approach
After fine-tuning, validate your LLM’s performance using your separate test set. In our case, the first metric we checked for good results was the cosine similarity between the GPT-4 labels and those from the fine-tuned model. The other main metric was the length of the labels. Effectively, our main goal was to shorten the answers of mistral-7B. On our test set, the average length of a label went from 15 words to 7 as we managed to eliminate all the hallucinations of several sentences.
Step 5: What’s Next?
Once satisfied with the LLM’s performance, integrate it into your software for real-time labeling. Monitor the model’s predictions regularly to catch any drift or degradation in performance and retrain as necessary.
Practical Considerations and Best Practices for Testers
Here are some tips for fine-tuning and operationalizing your lightweight LLMs:
- Understand Your Data: Context is king. Ensure your model understands the specific jargon and nuances associated with your users.
- Optimize Your Resources: Adjust batch sizes and learning rates to find the sweet spot for your compute resources without sacrificing performance.
- Automate Retraining: Integrate continuous learning pipelines that allow your model to retrain on new data, staying up-to-date with user behaviors.
Conclusion: The Smart Way Forward
By now, it should be clear that while GPT and its larger counterparts offer incredible capabilities. They are not always the most practical solution, particularly for repetitive testing tasks like the user session labeling in Gravity we have shown today. Instead, fine-tuning ligh tweight LLMs can strike that perfect balance between resource management and performance.
As we stand at the edge of very organic changes in our ways of testing, smart choices like this will define how sustainably we progress. Let’s keep a watchful eye on advancements in model optimization and always remember that sometimes, less is more.