Back to the news
AI Smartesting

Mastering AI Prompting for Effective Software Testing

30.07.24

Increasing Productivity and Speed with Generative AI

Mastering AI Prompting has become indispensable! The expectations for generative AI in software testing primarily focus on gains in productivity and speed. The goal is to perform testing tasks faster and with less effort thanks to AI. These expectations are well illustrated by the annual “World Quality Report 2023,” from which the following figure is extracted: improvements in productivity and greater speed are highlighted by 65% and 53% of the survey respondents, respectively.

Source : World Quality Report 2023-24 – Page 29 https://www.opentext.com/resources/world-quality-report-2023-24 

These results also align with the expectations expressed by participants in the training sessions “Accelerate Your Testing Processes with Generative AI” offered by Smartesting and its partners. Through practice, participants acquire the essential skills to productively use generative AI, master AI Prompting, and achieve the expected gains in productivity and speed.

To achieve these productivity and speed gains using generative AI, it is essential to learn how to prompt with AI models. This is what we will present here, but acquiring the required skills requires practicing and evaluating one’s practice, as we do during the training.

LLM Models: Rapid Progress Useful for Testing Activities

Generative AI can be used directly through an AI Chatbot via conversational prompts with a Large Language Model (LLM) or for predefined tasks through a testing tool that uses an LLM.

This article focuses on the first mode, i.e., querying an LLM for testing tasks, as this mode is directly accessible via available AI Chatbots and offers great flexibility for testers.

Since the release of ChatGPT by OpenAI in November 2022, the range of LLMs has significantly expanded, and their capabilities are rapidly progressing.

These advancements provide usage opportunities for software testing. Here are a few examples:

  • The vision module, present in the main current LLMs such as GPT-4, Claude-3, or Gemini-1.5, enables image analysis, for example, screenshots of the application to be tested. We can thus associate textual data (user stories, acceptance criteria, bug reports, etc.) with screenshots (of the application, error messages, etc.) for more accurate data provided to the LLM.
  • The optimization of LLM models has improved, allowing faster and less costly processing of our queries with equal performance. Claude-3.5-Sonnet or GPT-4-Mini are examples of recent releases of LLMs that are both efficient, less resource-intensive, and faster than previous versions in terms of response time.
  • The increased availability of LLM models with open-source or community licenses such as Mixtral-8x22B or LLama-3 facilitates both the integration of generative AI on limited organizational computing resources and avoids costly subscription models.

With rapid performance improvements, new features, and enhanced accessibility, it is now up to us to exploit the capabilities of LLMs by mastering the best practices of querying – also known as “Prompt Engineering” or “Prompting.”

Learning to Prompt for Test Activities

Prompting techniques aim to design high-quality queries that guide LLM models to produce accurate and relevant outputs for the task at hand. This involves working on the structure, style, and content of the queries, and optimizing their length based on the test task to be performed by the LLM. Your results will depend on the LLM model used and your prompts.

Here are the techniques we review in this article, illustrated with test generation tasks:

  • Structured Prompt in 6 Parts: Role, Context, Instructions, Constraints, Format, Data;
  • Breaking Down Instructions into Steps to assist the model;
  • Asking the Model to Verify Its Results;
  • Meta-Prompting: Asking the model to evaluate, improve, or generate your prompt.

These techniques are simple to use but essential for achieving quality outcomes.

AI For testing - AI Prompting by Smartesting

Implementation for Generating Test Cases for a User Story and Acceptance Criteria

Our test subject is the demo application of the open-source Spree Commerce platform, which provides a front-end/back-end structure for developing an online store. The User Story that serves as the basis for our test case generation is the management of the shopping cart, as shown in the screenshot below.

Shopping Cart Screenshot – obtained with the Spree Commerce demo application – https://spree-multi-vendor-demo.herokuapp.com/ – July 2024

Here is the User Story “Managing Cart Items” that we will be testing, along with its acceptance criteria:

As a customer of the e-commerce site,

I want to be able to change the quantity of items in my cart,

So that I can see the prices and the total of the cart update accordingly.

Acceptance Criteria:

  • The quantity of an item should be modifiable using the “-” and “+” buttons.
  • The update of an item’s quantity should be immediate.
  • The total per item should be updated immediately after changing the quantity (unit price * quantity).
  • The cart subtotal (sum of item totals) should be updated immediately after any change in the quantity of one or more items.
  • Each item line should have a trash icon to allow the item to be removed from the cart.
  • Removing an item from the cart should update the cart subtotal immediately.

Here is the structured prompt we are using:


Here are the results obtained with the LLM Claude-3.5-Sonnet (dated July 2024):

For this User Story “Managing Cart Items,” the analysis of equivalence classes and the corresponding test scenarios are as follows:

Equivalence Classes

1. Quantity Modification

  • Valid classes:
    • Positive integers (1, 2, 3, …)
    • Typical values: 1, 5, 10
    • Boundary values: 1 (minimum), maximum allowed (e.g., 99 or 999, depending on system limits)
  • Invalid classes:
    • Zero (0)
    • Negative integers (-1, -2, …)
    • Non-integer values (1.5, 2.3, …)
    • Extremely large values (e.g., 1000000, depending on system limits)
    Reasoning: The quantity should be a positive integer. Zero or negative quantities don’t make sense for a shopping cart, and non-integer values are not typically allowed for item quantities.

2. Price

  • Valid classes:
    • Positive numbers with two decimal places
    • Typical values: $10.00, $50.00, $100.00
    • Boundary values: $0.01 (minimum), maximum allowed (e.g., $9999.99, depending on system limits)
  • Invalid classes:
    • Negative numbers
    • Numbers with more than two decimal places
    Reasoning: Prices are typically positive numbers with up to two decimal places. Negative prices or prices with more precision are usually not valid in e-commerce systems.

3. Total per Item

  • Valid classes:
    • Positive numbers with two decimal places
    • Calculated as (unit price * quantity)
  • Invalid classes:
    • Any value that doesn’t match the calculation (unit price * quantity)
    Reasoning: The total per item should always be the result of multiplying the unit price by the quantity.

4. Subtotal

  • Valid classes:
    • Positive numbers with two decimal places
    • Calculated as the sum of all item totals
  • Invalid classes:
    • Any value that doesn’t match the sum of all item totals
    Reasoning: The subtotal should always be the sum of all item totals in the cart.

5. Item Removal

  • Valid classes:
    • Removing an existing item
  • Invalid classes:
    • Attempting to remove a non-existent item
    Reasoning: Users should only be able to remove items that are actually in the cart.

Now, let’s create the test scenarios in Gherkin format:

Feature: Cart Item Management

  Background:
    Given the user is on the shopping cart page
    And the shopping cart contains the following items:
      | Product       | Price  | Quantity | Total  |
      | Checked Shirt | $53.99 | 1        | $53.99 |
      | Polo T Shirt  | $36.99 | 2        | $73.98 |
    And the subtotal is $127.97

  Scenario: Increase item quantity
    When the user clicks the "+" button for the "Checked Shirt"
    Then the quantity for "Checked Shirt" should immediately update to 2
    And the total for "Checked Shirt" should immediately update to $107.98
    And the subtotal should immediately update to $181.96

  Scenario: Decrease item quantity
    When the user clicks the "-" button for the "Polo T Shirt"
    Then the quantity for "Polo T Shirt" should immediately update to 1
    And the total for "Polo T Shirt" should immediately update to $36.99
    And the subtotal should immediately update to $90.98

  Scenario: Remove item from cart
    When the user clicks the trash can icon for the "Checked Shirt"
    Then the "Checked Shirt" should be immediately removed from the cart
    And the subtotal should immediately update to $73.98

  Scenario: Attempt to decrease quantity below 1
    Given the quantity of "Checked Shirt" is 1
    When the user clicks the "-" button for the "Checked Shirt"
    Then the quantity for "Checked Shirt" should remain 1
    And the total for "Checked Shirt" should remain $53.99
    And the subtotal should remain $127.97


  Scenario Outline: Update quantity using multiple clicks
    When the user clicks the "<button>" button for the "<product>" <clicks> times
    Then the quantity for "<product>" should immediately update to <final_quantity>
    And the total for "<product>" should immediately update to $<final_total>
    And the subtotal should immediately update to $<final_subtotal>

    Examples:
      | product       | button | clicks | final_quantity | final_total | final_subtotal |
      | Checked Shirt | +      | 3      | 4              | 215.96      | 289.94         |
      | Polo T Shirt  | -      | 1      | 1              | 36.99       | 90.98          |
      | Polo T Shirt  | +      | 5      | 7              | 258.93      | 312.92         |

  Scenario: Verify immediate update of item quantity
    When the user clicks the "+" button for the "Checked Shirt"
    Then the quantity for "Checked Shirt" should update to 2 without page refresh

  Scenario: Verify immediate update of item total
    When the user clicks the "+" button for the "Checked Shirt"
    Then the total for "Checked Shirt" should update to $107.98 without page refresh

  Scenario: Verify immediate update of subtotal
    When the user clicks the "+" button for the "Checked Shirt"
    Then the subtotal should update to $181.96 without page refresh

  Scenario: Verify presence of trash can icon
    Then each item in the cart should have a trash can icon for removal

  Scenario: Attempt to add invalid quantity
    When the user attempts to enter "-1" as the quantity for "Checked Shirt"
    Then the quantity for "Checked Shirt" should remain 1
    And the total for "Checked Shirt" should remain $53.99
    And the subtotal should remain $127.97

  Scenario: Verify correct calculation of item total
    When the user changes the quantity of "Polo T Shirt" to 3
    Then the total for "Polo T Shirt" should be exactly $110.97

  Scenario: Verify correct calculation of subtotal
    Given the shopping cart contains the following items:
      | Product       | Price  | Quantity | Total   |
      | Checked Shirt | $53.99 | 2        | $107.98 |
      | Polo T Shirt  | $36.99 | 3        | $110.97 |
    Then the subtotal should be exactly $218.95

  Scenario: Remove all items from cart
    When the user removes all items from the cart
    Then the cart should be empty
    And the subtotal should be $0.00

  Scenario: Verify maximum quantity limit
    When the user attempts to set the quantity of "Checked Shirt" to 1000
    Then an error message should be displayed
    And the quantity for "Checked Shirt" should remain at the maximum allowed

  Scenario: Verify minimum quantity limit
    Given the quantity of "Checked Shirt" is 1
    When the user attempts to set the quantity of "Checked Shirt" to 0
    Then the quantity for "Checked Shirt" should remain 1
    And the total for "Checked Shirt" should remain $53.99
    And the subtotal should remain $127.97

These additional scenarios cover various aspects of the user story and acceptance criteria, including:

  1. Multiple quantity updates
  2. Immediate updates without page refresh
  3. Presence of trash can icons
  4. Handling of invalid inputs
  5. Correct calculation of item totals and subtotal
  6. Removing all items from the cart
  7. Testing maximum and minimum quantity limits

Self-verification:

  1. Modifying quantity using “-” and “+” buttons: Covered in multiple scenarios
  2. Immediate update of item quantity: Specifically covered in “Verify immediate update of item quantity” scenario
  3. Immediate update of item total: Specifically covered in “Verify immediate update of item total” scenario
  4. Immediate update of basket subtotal: Specifically covered in “Verify immediate update of subtotal” scenario
  5. Presence of trash

Analysis of the Results Obtained by the AI

Several observations emerge from the results obtained from the structured prompt and the provided data. We offer an analysis of various aspects of the quality of the tests.

Test Data and Scenario Variability

We can see that adding the screenshot of the page to be tested (the cart) in the prompt allows the model to determine precise test data, referencing what is visualized in the scenario. This is highly relevant and enables dynamic management of this variability because changing the screenshot, displaying different items in the cart, will suffice to obtain Gherkin scenarios with different variability.

Writing Gherkin Scenarios and Completeness of Acceptance Criteria Coverage

Here are some observations from analyzing the Gherkin scenarios generated by the LLM:

  • Coverage of Acceptance Criteria: The coverage of the acceptance criteria is good and in line with what is requested in the prompt. The division of instructions into two parts, one for equivalence classes and the other for generating scenarios, is effective.
  • Calculation of Expected Results: When manually verifying this result, we find that the calculations of the expected results are correct for all test scenarios. The calculations of the “Total” and “Subtotal” fields for establishing the verdict are accurate.
  • Use of the “Context” Clause: The “Context” in Gherkin defines the initial conditions common to all scenarios of a feature, avoiding repetition and improving readability. It sets up the basic test environment, allowing individual scenarios to focus solely on their specific actions and verifications. The context generated by the LLM relies entirely on the screenshot. This would allow for dynamic updates by providing another screenshot for the LLM to analyze in the prompt data.

The result is seven correct, precise scenarios that are directly aligned with the provided data (User Story with its acceptance criteria, Screenshot).

This assessment is positive in terms of our productivity and velocity expectations: the proposed prompt is generic, and the data from the User Story and screenshot are easy to obtain. The result is achieved in a few seconds with an LLM, and verification takes a few minutes. This result allows for iteration with the LLM, for example, to request additional scenarios and expand the coverage of acceptance criteria if desired.


Meta-Prompting: Improving the Prompt by the LLM

Meta-prompting involves asking the LLM for improvements on an initial prompt. This technique saves time by providing suggestions that facilitate refining the prompt through successive trials. The LLM’s suggestions help clarify and specify the instructions, reducing ambiguities and optimizing results by identifying and correcting gaps for more relevant outcomes.

Here is a meta-prompt for our previously discussed Gherkin scenario generation prompt:

#Instructions

Here is a prompt for generating test scenarios in Gherkin format. Analyze this prompt and suggest any possible improvements.

#Prompt to Analyze

<Copy and paste the prompt to be analyzed and improved>

With Claude-3.5-Sonnet, here is the result of this meta-prompt:

The prompt provided is generally well-structured and comprehensive. However, there are a few areas where it could be improved to make it even more effective:

  1. Clarity on Test Data Source:
  • Explicitly state that test data should be derived from both the user story/acceptance criteria and the provided screenshot.
  • Encourage the use of real product names, prices, and quantities from the screenshot in the test scenarios.

2. Performance Considerations:

  • Add a note about including scenarios that test the responsiveness of updates (e.g., testing with a large number of items or rapid consecutive changes).

3. Error Handling:

  • Include a point about creating scenarios for error conditions (e.g., network issues, server errors) and how the system should behave in these cases.

4. Usability Testing:

  • Suggest including scenarios that test the user experience, such as keyboard navigation or accessibility features.

5. Edge Cases:

  • Explicitly mention the need to consider edge cases, such as maximum allowed quantity or price precision issues.

6. Mobile Responsiveness:

  • If applicable, add a note about creating scenarios to test the cart’s behavior on different device sizes.

7. Internationalization:

  • If relevant, suggest including scenarios for different currency formats or language settings.

8. Security Considerations:

  • Add a point about including scenarios that test for basic security issues, such as input validation.

9. Background Context:

  • Suggest including a “Background” section in the Gherkin scenarios to set up the initial state of the cart.

10. Consistency Checks:

  • Add a point about verifying consistency between different parts of the UI (e.g., individual item totals matching the subtotal).

11. Output Format:

  • Here’s a suggested addition to the prompt:
#Additional Considerations
- Include scenarios for performance testing, error handling, and usability.
- Consider edge cases, mobile responsiveness, and internationalization if applicable.
- Include basic security testing scenarios.
- Use a "Background" section in your Gherkin scenarios to set up the initial cart state.
- Include scenarios that verify consistency across the UI.
- Provide your scenarios in a single Markdown code block.

By incorporating these suggestions, the prompt will guide the creation of more comprehensive and robust test scenarios.

Improved Version of the prompt

This improved prompt provides a more comprehensive framework for generating test scenarios. It incorporates all the suggested improvements, including clearer instructions on data sources, additional testing considerations, and specific output formatting requirements. This should result in more thorough and well-structured test scenarios that cover a wider range of use cases and potential issues.


Explanatory Note on the Choice of LLM and the Experiments Conducted:

The choice of LLM fell on Claude-3.5-Sonnet because it is a generative AI model that offers good performance for testing activities by combining text and screenshot in the query. The version used is dated June 20, 2024. We used the Smartesting LLM portal for the various queries made. This internal Smartesting portal has an educational purpose and is used for learning AI prompting techniques. The portal provides access to more than 12 LLMs, the most relevant for automating testing activities.

Conclusion

In this article, we illustrated the importance of learning how to master AI prompting to obtain good results with generative AI for software testing activities. The example we used is a common activity of designing and writing test scenarios in Gherkin format. For other testing tasks, such as analyzing User Stories, optimizing existing test cases, generating automated test scripts, and analyzing bug reports, the same observation applies: to achieve good results and effectively use generative AI, mastering prompting techniques is essential.

Mastering Prompting techniques is accessible to all testers and will be increasingly used as AI Chatbots become available in companies and their usage becomes part of testers’ daily routines to gain productivity and speed.

It is a skill acquired through training and practice. Learning and master AI prompting is one of the goals of the training course “Accelerate Your Testing Processes with Generative AI.” Our training offers 8 workshops on software testing use cases, provides access to 12 different LLMs for practice, under open-source or commercial licenses, and presents the various prompting techniques useful for software testing, studied and experimented with during the workshops. Nearly two-thirds of the training is dedicated to guided practice, allowing the acquisition of this necessary know-how for the proper use of generative AI in software testing.

AI For testing - AI Prompting by Smartesting

Stay tuned!

Automated Testing: Your Team's Safety Net

Automated Testing: Your Team’s Safety Net

Gravity Testing Gravity

Benefits of Automated Testing In Agile and DevOps environments, automated testing serves as the team’s safety net by offering a…

Test Oracles

Test Oracles and AI: Maximizing Testing Efficiency!

AI Gravity Testing Gravity

What is a Test Oracle? At their core, test oracles are mechanisms for determining whether a system’s behavior is correct.…

Embracing AI to Test Smarter, Not Harder

Embracing AI to Test Smarter, Not Harder

AI Gravity Gravity

Effective testing can be quite challenging AI to test smarter not harder – Introduction Software testing, since the beginning of…