If you’ve worked in software testing, you’re used to crisp boundaries: you give an input, you expect a deterministic output, and you can assert that expected === actual. But what happens when the “system under test” isn’t a deterministic program — it’s another AI model?
That’s where things get strange.
Welcome to the world of agentic testing, where one AI acts as a tester, prompting and validating the responses of another AI. The challenge isn’t just about writing prompts or checking outputs.
It’s about dealing with non-determinism, subjective correctness, and the recursive feeling of agents testing agents. It’s a little like the movie Inception: one agent goes a layer deeper to test another, and you need to make sure you don’t get lost in the dream.
The Problem: Determinism Meets Probabilism
Traditional QA works because software is deterministic. A login function that checks username === “admin” will either pass or fail.
No ambiguity.
No uncertainty.
But ask an AI:
“What do you get when you multiply 6 × 7?”
You might get:
- 42
- 6 x 7 = 42
- The answer is Forty-two.
- The answer to the Great Question of Life, the Universe, and Everything.(If your LLM thinks you’re a hitchhiker’s guide to the galaxy fan)
All of these are correct, but none match exactly. If you wrote your assertion as answer === “42”, the test could fail.
This is the biggest difficulty of testing LLM’s: outputs are probabilistic. The tester agent has to decide what “correct enough” means.
The Core Challenge of Agentic Testing
When one AI tests another, three main challenges emerge:
- Prompting Correctly
The tester has to craft instructions in a way the “AI under test” can respond consistently. This is a meta-prompting problem. The tester isn’t answering itself but guiding another model into a shape that can be validated. - Handling Non-Determinism
Because the model can phrase answers differently, the tester must use fuzzy logic, normalization, or semantic matching rather than exact string equality. We don’t just have “1” and “0”, but everything in between now! - Validating Without Hallucinating
The tester AI can make mistakes too. If both tester and subject hallucinate, you may end up with a false positive (passing a wrong answer) or a false negative (failing a correct answer). Anchoring with ground truth is critical, otherwise it’s just the blind leading the blind.
A Simple Example: Math Chatbot
Let’s imagine we want to test a math-focused chatbot.
- Tester Agent Prompt:
“Ask the chatbot: What is 6 × 7? Record the answer and check if it equals 42.” - AI Under Test:
“six times seven equals forty-two.” - Tester Agent Evaluation:
- Strip words.
- Convert text into numbers.
- Compare against ground truth (42).
If the tester can normalize responses into a canonical form (a number in this case), it can handle variations and still evaluate correctness.
Lesson: Even trivial tests require interpretation layers.
The “Inception” Problem
Here’s where it gets fun (and messy).
Let’s say you ask an AI tester to test a chatbot by prompting it with a question. But the tester itself is also an AI that uses reasoning chains. Now you have:
- Tester Agent: prompts and evaluates.
- AI Under Test: responds with an answer.
- Evaluator Logic: sometimes handled by the tester agent itself.
At two layers, things are already complex. At three, you risk an inception loop.
For example:
- Tester asks: “What’s 5 + 5?”
- AI Under Test says: “10.”
- Tester mistakenly interprets: “10” as invalid because it expected “ten.”
The evaluator’s mistake introduces false negatives. You can’t blindly trust either side.
That’s why agentic testing requires anchors to reality — deterministic systems (calculators, APIs, reference datasets) that provide ground truth. Without them, you’re just AIs agreeing or disagreeing with each other in a vacuum.
Like with a disagreement between my neighbour and myself over the concerning number of lawn gnomes I’m collecting (no I won’t chill out Fred!), there are three truths:
“His truth”, that the pile of gnomes is an eye sore and broken pieces keep ending up in his yard, cutting his feet constantly.
“My truth”, that there is no hobby more noble than saving gnomes.
And “The Truth”, that I probably have a couple too many.
Techniques for Reliable Agentic Testing
How can we make this practical? Here are a few strategies:
1. Use Structured Outputs
Instead of free-form answers, instruct the AI under test to respond in JSON:
{ “answer”: 42 }
This reduces ambiguity and gives the tester an easier job.
2. Semantic Similarity
When free text is unavoidable, use embeddings or semantic similarity scoring. For instance, “The answer is forty-two” should be judged equivalent to “42.”
3. External Oracles
Always anchor results with deterministic tools where possible:
- Math → calculator API
- Dates/times → Date library
- Facts → trusted dataset
4. Multiple Verifiers
Have more than one agent verify the same answer. If 3 out of 4 verifiers agree, you reduce single-agent error risk. You could use a different LLM provider or a different model to help co-verify results, but be aware of the pros and cons of each.
Similarly, to my mild gnome disagreement, my wife might side with me as she also enjoys gnomes, but my neighbour on the other side of my house has been secretly rage smashing them across my lawn at night (which probably explains the shards scattering across all of our lawns).
5. Tolerant Assertions
Replace strict equality with looser checks. Instead of answer === “42”, use rules like:
- Contains “42”
- Matches number regex
- Within tolerance range (for numerical calculations)
Why This Matters
This isn’t just an academic exercise. AI-powered apps are proliferating: chatbots, copilots, workflow agents. Many of them depend on models making the “right” decisions in response to prompts. Traditional QA doesn’t fit, because we can’t just assert a single exact string.
Agentic testing offers a scalable way to test AI systems by:
- Running through real-world scenarios automatically
- Handling messy, non-deterministic outputs
- Introducing verification layers that balance AI judgment with external ground truth
If done right, it can help answer questions like:
- Is my AI bot giving consistent answers?
- How does it behave across different phrasings of the same question?
- Is it improving or degrading with new versions?
- Am I right to be collecting so many gnomes?
Future Directions: Layers Within Layers
The real frontier of agentic testing isn’t just validating simple Q&A. It’s testing complex multi-step workflows.
Imagine:
- Tester agent prompts a customer-service AI to reset a password.
- That AI goes through a flow: authenticate → generate reset link → send email.
- Tester has to follow along and verify each step.
At this point, you’re multiple layers deep: an agent testing an agent performing a workflow powered by other services.
It’s pure inception. And it’s where the field is headed.
Closing Thoughts
Testing AI with AI is like navigating dreams within dreams. Each layer introduces noise, ambiguity, and risk. But it also offers power: you can scale testing across thousands of prompts, scenarios, and workflows in a way human testers never could.
The key is deciding where to anchor your reality: ground truth oracles, structured outputs, and multi-agent consensus.
Done right, agentic testing won’t just keep AIs in check. It could become the compass we use to navigate the new, layered landscape of AI-powered systems.


