AI Tester
AI Tester is Zowie’s automated QA tool for AI Agents. It simulates real customer scenarios, executes them repeatedly, and validates outcomes against predefined criteria. This ensures your AI Agent delivers accurate, high-quality responses — not just once, but reliably at scale. AI Tester is part of Zowie’s Quality & Control Suite (alongside Supervisor and Staging) and is currently available in Early Access.
Key information
- Manage AI variability – Run tests multiple times to account for GenAI’s indeterminism. Confidence comes from results like “98 out of 100 runs successful,” not just one good answer.
- Catch regressions early – Detect when new updates or configuration changes silently break flows that used to work.
- Validate complex scenarios – Confirm that conversations, follow-ups, and side questions are handled as expected.
- Scenario-based – Define realistic customer situations with goals, instructions, and success criteria.
- Automated assurance – Replace manual QA reviews with repeatable, scalable testing.
- Detailed reporting – Review conversation logs, success/failure breakdowns, and run history.
- Current scope – Supports testing of knowledge-based answers, reasoning flows, and sales experience. Testing of processes in the Decision Engine is planned in future releases.
Early Access notice – AI Tester is in Early Access. Functionality may be limited, bugs may occur, and configuration is supported by Zowie’s Customer Success team.
Getting started
AI Tester is available in AI Agent > Testing.
Test case configuration
Each test case consists of several key components:
1. Basic information
- Name: Give your test case a descriptive name (e.g., "return policy information")
- Description: Provide context about what this test is validating (e.g., "You are a kind user named Alex, who wants to ask about the possibility to return a product. You bought lipstick but the color is not as you expected. Ask followup questions about the payment methods.")
2. Termination conditions
Define what constitutes a successful test by setting:
- Timeout: Maximum time (in seconds) for the test to complete (e.g., 10 seconds)
- Max iterations: Maximum number of conversation turns allowed (e.g., 3 iterations)
You can add multiple termination conditions with specific confidence threshold values. For the return policy example:
- "User is informed about possibility of returning a lipstick" (Confidence threshold: 70%)
- "User is informed about the payment methods details" (Confidence threshold: 50%)
What are confidence thresholds?In the context of GenAI testing, a confidence threshold represents the minimum acceptable level of certainty that a specific condition has been met.
For example, a 70% threshold means the system needs to be at least 70% confident that the agent provided the correct information. This accounts for the natural variability in AI responses while ensuring quality standards are met.
3. Evaluation criteria
Set up the criteria that will determine if your test passes or fails:
- Define specific conditions the agent's response must meet
- Set confidence threshold values for each criterion if confidence is above the threshold, the condition passes (0% to 100%).
- Add multiple evaluation criteria for comprehensive testing
For the return policy example, you might include:
- AI Agent informs about possibility of returning a lipstick" (Confidence threshold: 50%)
- "AI Agent informs that a return should be initiated on https://website.com/return" (Confidence threshold: 70%)
- "Agent is kind and straight to the point" (Confidence threshold: 30%)
For the test to pass, all Evaluation criteria must pass.
Running tests
Tests can be executed in two ways:
- Run single: Execute an individual test case
- Run all: Execute all test cases in your suite
Test results
After running tests, you'll see:
- Status indicators:
- ✅ Success (green) - Test passed
- ❌ Flaky (orange) - Test has inconsistent results
- ❌ Failed (red) - Test did not meet criteria
- Timestamps: When each test was last run
- Detailed logs: Click on any test run to view the complete conversation and evaluation results
Managing test runs
The testing module provides several management features:
- Run History: View all previous test executions with their results
- Session Logs: Access detailed conversation logs for each test run
- Bulk Operations: Set the number of parallel test runs for more thorough validation
Best practices
- Start simple: Begin with basic test cases for your most critical knowledge responses
- Be specific: Write clear, specific test descriptions and evaluation criteria
- Regular resting: Run your test suite regularly, especially before making significant changes
- Iterative improvement: Continuously refine your test cases based on real user interactions
- Coverage: Aim to cover your most important use cases and edge scenarios
Example test case
Here's an example of a well-structured test case based on the return policy scenario:
Name: Return Policy Information
Description: You are a kind user named Alex, who wants to ask about possibility to return a product. You bought lipstick but the color is not as you expected. Ask followup questions about the payment methods. Then, ask for a URL with information about return policy.
Termination Conditions:
- Timeout: 30 seconds
- Max iterations: 3
- User is informed about possibility of returning a lipstick (Confidence threshold: 70%)
- User is informed about the payment methods details (Confidence threshold: 70%)
Evaluation Criteria:
- AI Agent informs about possibility of returning a lipstick (Confidence threshold: 70%)
- AI Agent informs that a return should be initiated on https://example-website.com/return (Confidence threshold: 70%)
- AI Agent is kind and straight to the point (Confidence threshold 30%)