The Future of Software Testing: Harnessing Vision-Language Models
You’re probably familiar with LLMs – Large Language Models – the generative AI models used by OpenAI’s ChatGPT or Google’s Gemini. These LLM models have recently been enhanced with a so-called vision capability, which allows them to analyze images (e.g. diagrams, screenshots) combined with reasoning and text generation capabilities. We call these mixed models VLMs – Vision-Language Models.
For us AI-enhanced testers, this is a big step forward!
In my previous blog on GenAI prompting techniques, I mentioned the contribution of vision within VLMs to improving the responses of generative AIs to testing tasks such as test data generation. This new article focuses on the contribution of vision and AI generation, a significant step forward for software testing augmented by AI.
Vision-Language Models in Practice
Large language models are trained on large textual datasets to generate new text, such as test cases or scripts. Recently, these LLMs have integrated a vision module. These vision modules have evolved rapidly, reaching outstanding capabilities in the latest available versions of LLMs. For example GPT-4o, Claude-3.5, and Gemini-1.5, but also in LLMs under open source license, such as LLaVA.
The vision module adds the ability to interpret images into text. This allows the model query to be enriched by combining text with one or more images, such as screenshots or application specification diagrams.
Note that these vision modules have nothing to do with AI image generators such as OpenAI’s DALL-E or Midjourney. A vision module interprets an image and produces text, while image generators produce images from a textual query.
Let’s take an example to illustrate the contribution of vision to LLM. Here’s a screenshot of a demonstration eCommerce site – see https://spree-multi-vendor-demo.herokuapp.com/
Consider the following prompt for a generative AI with vision, including the previous screenshot added to the prompt:
“Here is a screenshot of an online store. What items are shown in the picture? Can I find a shirt? At what price?“
Here’s the answer with Anthropic’s Claude 3.5 Sonnet:
This image shows a screenshot of an online fashion store’s website. The page displays several clothing items for sale:
- Flounced Dress – $68.99
- Suede Biker Jacket – $95.95
- Skinny Pants – $58.99
- Checked Slim Fit Shirt – $48.99
- Long Sleeve Jumper – $74.95
- Elegant Flared Dress – $71.95
To answer your specific questions:
Yes, you can find a shirt in this image. The “Checked Slim Fit Shirt” is listed at a price of $48.99.
The page also includes filter options on the left side for color, length, size, and price. At the top, there’s a banner image featuring two models in black turtlenecks. And the website’s navigation menu includes options for Shop, Brands, Start Selling, and Contact Us.
Not bad, is it?
This response from the LLM shows the different capabilities of vision modules: identifying the different visual components of the image (here, for example, the photos of the items for sale), recognizing written text and, if necessary, translating it into the language of the request, combining these different elements to provide an understanding of the image thanks to the combination with the textual capabilities of the LLMs.
This opens up some very interesting prospects for testing activities. For example, we’ll be able to give context to the AI model by combining images and textual data:
- User stories with acceptance criteria and screen captures to generate test cases and test data;
- Functional requirements and diagrams describing application behavior for AI-driven test design;
- Anomaly reports from automated tests and screenshots at the time of anomaly detection for analysis and synthesis of problems encountered during testing;
- User story acceptance criteria and images of IH/M mock-ups of the application to analyze the completeness and correctness of these acceptance criteria;
- etc.
Testing with Vision-enhanced Generative AI
Generative AI has entered the daily lives of many testers, and the use cases are becoming clearer, helping to increase productivity and speed in software testing activities.
At Smartesting, we’ve been developing AI-based testing tools for years, and generative AI is a big step forward:
- for YEST, visual test design tools, to optimize and refactor a test suite;
- for Gravity, a testing platform focused on user journey analysis, to interpret user journeys and facilitate the analysis of usage coverage by tests;
- Also, through our generative AI for testing training, which is based on 8 practical workshops to develop the AI know-how of software testing professionals.
Generative AI boosts the productivity of test teams across a wide range of test activities, and these benefits are reinforced by the capabilities provided by these LLMs with vision modules. In the table below, we present four groups of test activities, listing examples of tasks benefiting from generative AI and an example of using the vision module.
Table: generative AI with vision for four groups of test activities.
Testing activities | Where GenAI can help | What the vision module can do |
Test Analysis | – Automate requirements analysis. – Identification of test conditions. | – Interpreting IH/M mockups combined with requirements / User Stories. |
Test Design | – Automated test scenario generation. – Optimize test scenarios for maximum coverage with minimum testing. | – Use application screenshots to generate test cases covering acceptance criteria and test conditions. |
Test Implementation | – Definition of test data required for manual testing- Generation of automated test scripts. | – Generate concrete test data from analyzing screenshots in pilot use or exploratory testing. |
Test Result Analysis | – Analyze test results to identify anomalies.- Synthesize different test results and provide trends. | – Visual analysis to compare expected results with screenshots of results obtained, facilitating anomaly detection. |
This table is not exhaustive, but it does give examples of how generative AI can be used with vision for testing, to gain in productivity and velocity in our day-to-day activities.
In the remainder of this article, we illustrate the combined power of LLM’s vision and generation capabilities on an example linked to test analysis: the generation of acceptance criteria from a User Story and a mock-up of the IH/M of the page concerned in the application under test.
Generate acceptance criteria for a given User Story by analyzing an IH/M mockup
This example takes as input an IH/M mockup and a User Story to ask the LLM to generate acceptance criteria.
The IH/M mockup used is for the payment page of an eCommerce site – Source
: https://ecommerceuxdesign.com/post/checkout-types-2 – August 2024
We used AI to help us formulate the acceptance criteria linked to our User Stories. Here’s a sample prompt for this task, with the previous image attached:
#Context
I am working on testing an eCommerce web application. The attached image is a mockup of the payment page. Your task is to help me create the acceptance criteria for the delivery options user story.
#Instructions
1- For the user story below, write a list of acceptance criteria linking the different aspects of the IH/M given in the attached image.
2- Check that the acceptance criteria you have created are complete and consistent with the attached image.
#Data
User Story :
As a customer, I want to be able to choose between different delivery options (standard, express, next business day) to receive my order according to my time and cost preferences.
Here is the LLM response (with Claude 3.5 Sonnet):
Based on the mockup image provided and the user story, here’s a list of acceptance criteria for the delivery options feature:
- The payment page should display a “SELECT SHIPPING OPTION” section.
- Three delivery options should be available for the customer to choose from:
- Standard Delivery
- Express Delivery
- Next Business Day
- Each delivery option should have an associated radio button for selection.
- The cost for each delivery option should be clearly displayed next to the option:
- Standard Delivery: $0.00 (Free)
- Express Delivery: $9.95
- Next Business Day: $29.95
- Only one delivery option can be selected at a time.
- The selected delivery option should be reflected in the order summary on the right side of the page under “Shipping”.
- When a delivery option is selected, the total order amount should be updated accordingly, including the shipping cost.
- The default selected option should be Standard Delivery (free).
- The delivery options should be presented after the shipping address section and before the payment method section.
- The layout and styling of the delivery options should be consistent with the rest of the form, using the same font, color scheme, and overall design.
- The delivery options section should be responsive and display correctly on various screen sizes.
- Selecting a different delivery option should not require a page reload; the order summary should update dynamically.
These acceptance criteria cover the main aspects of the delivery options feature as shown in the mockup image and described in the user story. They ensure that customers can choose between different delivery options based on their time and cost preferences, and that the selection is properly integrated into the overall checkout process.
Here are a few observations from this experience:
Prompt Writting
- The prompt used is structured in three parts Context / Instructions / Data. Structuring prompts for generative AI is a good practice that improves the output.
- The prompting technique used combines the generation request with a request for self-checking of the result by the LLM (cf. item 2 of the instruction given in the request) in order to increase its quality.
Results analysis:
LLM uses not only the analysis of the attached image but also its own knowledge of web application testing (see points 11 and 12 above).
It’s the power of LLM’s training on a large amount of data (including eCommerce sites) that complements the image analysis provided: LLM’s vision and generation capabilities combine to create a powerful tool for testing web applications.
The acceptance criteria
The acceptance criteria generated are coherent and fairly complete, but of course, they can be detailed or completed. The AI’s contribution here is to assist and make suggestions, not to replace the Test Analyst.
Do we have saved time with generative AI in this example.
This certainly depends on your level of business expertise. For me, the answer is clearly yes. It takes 3 minutes to create the query (for someone trained in generative AI and prompting techniques). And 10 seconds to get the AI result, which can be a starting point for further work on acceptance testing scenarios.
Choosing the LLM with a vision module:
As shown in the screenshot below, 3 generative AI models with vision modules were put into competition (i.e. processing the same query in parallel): Claude-3.5-Sonnet, GPT-4o, and Gemini-1.5-Pro. In our example, the results are very close, and it’s hard to pick a winner. However an immediate benefit of this 3-LLM generation is that it allows the result of one LLM to be supplemented by the result of another LLM, with a gain in completeness. In addition, we’re seeing rapid progress on previous versions (Claude-3, GTP-4, and Gemini v1). This bodes well for future versions of these vision-enhanced LLMs, which will continue to improve both in their interpretation of the images provided and in their ability to reason in response to the query.
Conclusion
The availability of the vision module strengthens generative AI’s capabilities for our testing activities. We can now combine our textual and visual inputs to prompt a GenAI model. It’s a real plus, as it speeds up the creation of queries by giving the LLM precise contextual information by combining images and text.
This new ability to combine the vision and generation of today’s LLMs opens up many new use cases for testing activities, including test analysis, test design, test implementation, and anomaly report analysis. This list is far from exhaustive, as there are many more use cases to be discovered and experimented with.
However, generative AI can sometimes give incorrect and/or incomplete answers. To use it correctly, software testing professionals need to know how to use it, be trained, and use testing tools that seamlessly integrate generative AI. This is what we offer at Smartesting, both in our YEST and GRAVITY products, which seamlessly integrate AI services, and in our AI for Testing training, which aims to provide the necessary knowledge and know-how through hands-on practice.