The Updated Framework for Voice AI Testing: What’s New in JKB v1.5
Voice AI is spreading fast. Customer service, delivery updates, appointment reminders, insurance claims––bots are taking calls everywhere. But while deployments multiply, testing practices haven’t kept pace.
Most teams still rely on methods built for apps or websites, where outputs are predictable and bugs are easier to spot. Voice AI doesn’t work that way. A single accent, bit of background noise, or misheard phrase can throw the entire exchange off track.
One small slip in speech recognition, language model, or text-to-speech often cascades into confusion. Traditional QA checklists aren’t designed for this, and spot-checks rarely catch the real issues.
That’s why our conversational voice AI testing framework (or as we call it, the JKB Testing Framework) was created. It provides a structured way to evaluate voice bots under real-world conditions, with parameters that reflect how people actually experience a call: clarity, logic, safety, sentiment, and compliance.
We detailed out the development process for version 1 in this blog. With version 1.5, the framework is sharper and more practical.
It helps teams separate minor slips from major risks, focus review time where it matters, and apply consistent scoring across languages and industries.
Why Voice AI Needs a Different QA Framework
Voice AI doesn’t behave like traditional software. Each call passes through speech recognition, a language model, and text-to-speech. A small slip at any stage, like a misheard word or a confused intent, can throw the entire exchange off track.
Add real-world factors like accents, background noise, or code-switching, and the testing problem gets even harder.
Most teams still rely on random call checks or manual reviews, but that approach is slow and inconsistent. It often misses the rare but critical failures, such as missing consent prompts or unsafe responses, that matter most.
The JKB Testing Framework was designed to close that gap. It uses a two-phase process:
Quick Scan reviews a larger sample at speed, flagging calls with signs of risk such as technical dropouts, incoherent replies, or frustrated users.
Deep Audit then digs into the flagged calls in detail, checking for issues like consent handling, skipped questions, context retention, and clarity.
Every call ends with a clear score: Pass, Acceptable, Fail, or Critical Fail. That way, teams can compare results across projects and track quality over time. This structure ensures QA efforts focus on the calls that matter most, while still giving consistent benchmarks for all conversations.
Version 1.5 builds on this foundation, sharpening parameters and making the process more practical for adopters.
What’s New in Version 1.5
Version 1.5 of the JKB Testing Framework builds on the original by tightening sampling rules, clarifying parameters, and simplifying scoring. These changes come directly from auditor feedback and real-world testing, making the framework both more rigorous and easier to use.
Refined Sampling Method
Earlier versions left sampling too open-ended. This new version introduces a three-bucket system that ensures risky calls aren’t overlooked:
Early hang-ups (<1 minute): Capture technical dropouts, wrong numbers, or instant user frustration.
Normal calls (1–4 minutes): Represent typical everyday interactions.
Long calls (>4 minutes): Surface looping issues, fatigue, or complex problem handling.
The recommended split is 40% early, 40% normal, 20% long (out of a sample of 300 from 1,000 total calls).
If one bucket has fewer calls than its quota, the shortage is redistributed in a set order (Early → Normal → Long). This prevents samples from skewing toward “easy” calls and ensures the riskiest buckets stay overrepresented
Sharper Quick Scan Parameters
The Quick Scan phase now uses a clear 5-parameter set:
Compliance + Safe-AI: Check that consent prompts are present, no unsafe content is generated, and personal data isn’t requested without cause
Data-Capture Quality: Confirm all mandatory fields (like ID, phone number, complaint type) are recorded accurately
Technical Stability: Look for dead air, audio glitches, overlaps, or ASR failures that break the conversation
Linguistic/Logic Fitness: Ensure replies are on-topic, coherent, and free of looping or repeated questions
User Sentiment: Detect frustration, disengagement, or early hang-ups as signs of poor user experience
This narrowed scope allows auditors to scan a large batch of calls quickly while still surfacing the most important red flags.
Expanded Deep Audit Layer
If a call fails or looks risky in the Quick Scan, it moves to Deep Audit. Version 1.5 expands this stage with more granular parameters, each with pass/fail rules:
Consent Handling: Prompt must appear early; if declined, the bot must stop
Skipped Questions: No jumping past mandatory items in the script
Interruptions: Tracks how often the bot cuts off the user or overlaps responses
Context Retention: Checks if the bot remembers previous answers or re-asks questions unnecessarily
Accent Handling: Evaluates how well the bot responds to regional or strong accents
Colloquial Language: Looks at whether the bot uses natural, conversational phrasing instead of robotic or overly formal language
Noise Handling: Measures whether background noise (like TV or traffic) disrupts recognition
Speech Clarity: Reviews pronunciation, pacing, and delivery quality
Each parameter is scored as Pass, Acceptable, Major Fail, or Critical Fail. For example, one skipped question might be marked Acceptable, while three skipped questions or missing consent would trigger a Critical Fail.
Simplified Final Scoring
To make results easier to interpret, every call now falls into one of four categories:
Pass (Good Call): Smooth, usable, with only minor slips.
Acceptable (Usable Call): Slightly messy but still functional.
Fail (Broken Call): Disrupted flow or missing data makes it unreliable.
Critical Fail (Unsafe Call): Any issue that breaches safety, compliance, or operational boundaries
This consistency means a “Fail” means the same thing across different teams, languages, and projects, removing subjective judgment.
How Adopters Can Use the Framework
The framework isn’t just for QA specialists, it creates a shared language across technical and business roles. Here’s how different teams benefit:
QA teams get a structured way to prioritize. Instead of combing through hundreds of clean calls, they can focus deep-audit hours on the 20–30% that show real risk.
Developers can trace recurring issues, like ASR consistently failing on regional accents or TTS mispronouncing certain words, without wading through subjective notes.
Compliance officers see immediate value. Safe-AI and consent checks are non-negotiable, and the framework makes them first-class parameters.
Product owners gain measurable quality benchmarks. Over time, they can track how changes to prompts, models, or vendors shift the ratio of Pass, Acceptable, and Fail calls.
A typical flow looks like this: out of 1,000 calls, 300 are sampled for Quick Scan. Around one-third of those (roughly 100) move into Deep Audit, where each is scored in detail. The outcome is a clear ratio of Pass, Acceptable, Fail, and Critical Fail calls, which is easy to track, easy to present, and easy to act on.
The main advantage is consistency. Whether the bot is deployed in healthcare, banking, or retail, teams can adapt the framework to domain-specific requirements while relying on the same backbone of parameters. It scales across languages, user types, and call volumes without losing clarity.
Why This Matters Beyond QA
Voice AI now goes far beyond simple tasks like booking tables or confirming deliveries. It is being used for insurance claims, prescription refills, banking transactions, and government services. In these areas, quality is not optional. It is directly tied to compliance, user trust, and safety.
The JKB v1.5 framework helps teams meet these higher stakes. It ensures consent is captured correctly, unsafe responses are flagged, and calls are scored against clear rules. This reduces the chance of harmful failures reaching real users.
Another benefit is consistency. The framework gives teams a standard way to measure and report call quality across different projects, industries, and vendors.
For adopters, this means more than efficiency. It means building voice systems that regulators, users, and internal teams can trust.
Next Steps
The JKB Testing Framework is not a fixed rulebook. It is a living standard that continues to improve with input from the people who use it. Version 1.5 is stronger because of the feedback we gathered, and future versions will depend on the same kind of collaboration.
If you are building or testing voice AI, the best way to get involved is simple:
Start using the framework in your own QA process.
Share edge cases, feedback, or improvements you discover along the way.
Join the effort to make this a shared industry standard.
You can access the full framework document and send comments or examples to Santosh at santosh@ekstep.org.
Every contribution helps refine the guidelines so they work better across domains, languages, and real-world scenarios.
Testing voice AI will never be easy. The conditions are too varied and the technology is always evolving. But with a shared playbook, it can be consistent, transparent, and trustworthy. JKB v1.5 is a step toward that goal, and your participation will help shape what comes next.

