Key information

Manage AI variability – Run tests multiple times to account for GenAI’s indeterminism. Confidence comes from results like “98 out of 100 runs successful,” not just one good answer.
Catch regressions early – Detect when new updates or configuration changes silently break flows that used to work.
Validate complex scenarios – Confirm that conversations, follow-ups, and side questions are handled as expected.
Scenario-based – Define realistic customer situations with goals, instructions, and success criteria.
Automated assurance – Replace manual QA reviews with repeatable, scalable testing.
Detailed reporting – Review conversation logs, success/failure breakdowns, and run history.
Current scope – Supports testing of knowledge-based answers, reasoning flows, and sales experience. Testing of processes in the Decision Engine is planned in future releases.

⚠️
Early Access notice – AI Tester is in Early Access. Functionality may be limited, bugs may occur, and configuration is supported by Zowie’s Customer Success team.

Getting started

AI Tester is available in AI Agent > Testing.

Test case configuration

Each test case consists of several key components:

1. Basic information

Name: Give your test case a descriptive name (e.g., "return policy information")
Description: Provide context about what this test is validating (e.g., "You are a kind user named Alex, who wants to ask about the possibility to return a product. You bought lipstick but the color is not as you expected. Ask followup questions about the payment methods.")

2. Termination conditions

Define what constitutes a successful test by setting:

Timeout: Maximum time (in seconds) for the test to complete (e.g., 10 seconds)
Max iterations: Maximum number of conversation turns allowed (e.g., 3 iterations)

You can add multiple termination conditions with specific confidence threshold values. For the return policy example:

"User is informed about possibility of returning a lipstick" (Confidence threshold: 70%)
"User is informed about the payment methods details" (Confidence threshold: 50%)

📘
What are confidence thresholds?
In the context of GenAI testing, a confidence threshold represents the minimum acceptable level of certainty that a specific condition has been met.
For example, a 70% threshold means the system needs to be at least 70% confident that the agent provided the correct information. This accounts for the natural variability in AI responses while ensuring quality standards are met.

3. Evaluation criteria

Set up the criteria that will determine if your test passes or fails:

Define specific conditions the agent's response must meet
Set confidence threshold values for each criterion if confidence is above the threshold, the condition passes (0% to 100%).
Add multiple evaluation criteria for comprehensive testing

For the return policy example, you might include:

AI Agent informs about possibility of returning a lipstick" (Confidence threshold: 50%)
"AI Agent informs that a return should be initiated on https://website.com/return" (Confidence threshold: 70%)
"Agent is kind and straight to the point" (Confidence threshold: 30%)

For the test to pass, all Evaluation criteria must pass.

Running tests

Tests can be executed in two ways:

Run single: Execute an individual test case
Run all: Execute all test cases in your suite

Test results

After running tests, you'll see:

Status indicators:
- ✅ Success (green) - Test passed
- ❌ Flaky (orange) - Test has inconsistent results
- ❌ Failed (red) - Test did not meet criteria
Timestamps: When each test was last run
Detailed logs: Click on any test run to view the complete conversation and evaluation results

Managing test runs

The testing module provides several management features:

Run History: View all previous test executions with their results
Session Logs: Access detailed conversation logs for each test run
Bulk Operations: Set the number of parallel test runs for more thorough validation

Best practices

Start simple: Begin with basic test cases for your most critical knowledge responses
Be specific: Write clear, specific test descriptions and evaluation criteria
Regular resting: Run your test suite regularly, especially before making significant changes
Iterative improvement: Continuously refine your test cases based on real user interactions
Coverage: Aim to cover your most important use cases and edge scenarios

Example test case

Here's an example of a well-structured test case based on the return policy scenario:

Name: Return Policy Information

Description: You are a kind user named Alex, who wants to ask about possibility to return a product. You bought lipstick but the color is not as you expected. Ask followup questions about the payment methods. Then, ask for a URL with information about return policy.

Termination Conditions:

Timeout: 30 seconds
Max iterations: 3
User is informed about possibility of returning a lipstick (Confidence threshold: 70%)
User is informed about the payment methods details (Confidence threshold: 70%)

Evaluation Criteria:

AI Agent informs about possibility of returning a lipstick (Confidence threshold: 70%)
AI Agent informs that a return should be initiated on https://example-website.com/return (Confidence threshold: 70%)
AI Agent is kind and straight to the point (Confidence threshold 30%)