Voice Agent Evaluation Benchmarks

Scorecard

Evaluate The Conversation, Not Just The Model

A voice agent is a chain of probabilistic components plus product behavior. Advanced evaluation keeps component metrics, end-to-end task metrics, and operational metrics visible at the same time.

Task Success

Completion rate, correct action rate, clarification quality, escalation precision, and recovery after wrong turns.

Question: Why is task success stronger than transcript accuracy?

Transcript accuracy is only one dependency. The product succeeds when the user intent is understood, the right knowledge is found, the answer is grounded, the spoken output is usable, and risky actions are handled correctly.

Grounding And RAG

Retrieval recall, reranker precision, citation support, unsupported claim rate, stale-document rate, and refusal quality.

Question: What RAG metric can hide a bad voice experience?

High retrieval recall can hide poor query understanding, overly long chunks, slow first audio, awkward summaries, missing caveats, or answers that are faithful to a document but not to the user's spoken intent.

Conversation Timing

First partial, endpointing delay, retrieval latency, first token, first audio byte, interruption handling, and full-turn latency.

Question: Which latency metrics deserve separate SLOs?

Separate first partial, endpointing, first audio byte, and full turn latency. Users perceive waiting, interruption, and silence differently, and each metric points to a different subsystem.

Safety And Operations

PII exposure, spoken prompt injection, tool confirmation, abuse rate, cost per resolved task, fallback rate, and review burden.

Question: What should never be the only safety signal?

Aggregate pass rate should not be the only safety signal. Track high-risk slices, tool calls, replay or impersonation attempts, sensitive entities, consent boundaries, and human-review escapes.

Benchmark Design

Create Benchmarks That Survive Production Reality

A useful benchmark has clear scenarios, synthetic or sanitized audio fixtures, expected outcomes, slice labels, versioned rubrics, and rollout thresholds. Do not benchmark only clean text queries.

Benchmark 1: Spoken Manual Lookup

Users ask product-support questions that include model numbers, abbreviations, background noise, and follow-up corrections.

Hidden answer: benchmark fields and gates

Store a sanitized spoken query, ASR hypothesis, expected entity, expected document IDs, allowed answer facts, disallowed claims, latency budgets, and consented or coarse slice tags such as noise, device class, and domain term. Gate on entity accuracy, retrieval recall at k, grounded answer score, clarification behavior, and first audio byte.

Benchmark 2: Multi-Turn Account Help

The agent must ask for clarification, remember safe context, avoid speaking private details, and escalate when identity is uncertain.

Hidden answer: what to measure

Measure state carryover, privacy-safe redaction, identity checks, escalation precision, tool-call confirmation, refusal wording, and transcript retention policy. Add adversarial cases where the user changes account, interrupts mid-confirmation, or asks the agent to ignore policy.

Benchmark 3: Speech-To-Speech Repair

The user interrupts the agent, corrects a misheard entity, and expects the next answer to reflect the correction.

Hidden answer: strong evaluation angle

Score barge-in detection, cancellation of stale TTS, state update, correction acknowledgement, retrieval refresh, and new answer grounding. A system can have good ASR and TTS while failing this benchmark because orchestration state is stale.

Production Drills

Debug Evaluation Failures Like Incidents

Incident Drill 1: Offline Score Improved, Escalations Spiked

A new retriever improves offline grounding by 6%, but live human escalations rise 18% for support calls with serial numbers.

Hidden answer: first-hour triage

Slice by serial-number presence, ASR confidence, retrieval query rewrite, top-k disagreement, latency, clarification rate, and agent policy version. Compare old and new traces using redacted entities. Mitigate by rolling back the retriever for affected slices, adding lexical/entity fallback, or asking targeted clarification before escalation.

Incident Drill 2: TTS Upgrade Lowers Cost, Users Interrupt More

A cheaper voice lowers serving cost by 25%, but barge-in and repeat request rates increase.

Hidden answer: cost and quality tradeoff

Inspect first audio byte, chunk cadence, intelligibility, prosody, speech rate, text segmentation, and whether users interrupt because the answer starts late or sounds uncertain. A strong answer prices the savings against completion loss and review cost, then proposes a slice-specific fallback or rollback gate.

Lab

Implement A Tiny Voice-Agent Eval Aggregator

Use synthetic aggregate records only. The function below illustrates how a release review can combine product, quality, latency, safety, and cost signals.

Coding Lab: Release Decision From Eval Records

Given slice-level metrics for a baseline and candidate, return continue, pause, or rollback with reasons.

Hidden answer: invariants, tests, and Python solution

Invariants: critical safety regressions override cost wins, every slice is checked independently, lower-is-better and higher-is-better metrics use opposite delta signs, and missing launch-critical metrics pause the rollout, malformed metric values pause the rollout, and thin slices need enough baseline and candidate examples before the comparison is trusted. Test clean wins, cost-only wins, safety regressions, latency regressions, missing slice metrics, non-finite telemetry, invalid budgets, and underpowered slices.

import math


def finite_number(value):
    return isinstance(value, (int, float)) and not isinstance(value, bool) and math.isfinite(value)


def decide_voice_agent_release(baseline, candidate, budgets):
    reasons = []

    for slice_name, metric_budgets in budgets.items():
        old = baseline.get(slice_name)
        new = candidate.get(slice_name)
        if old is None or new is None:
            reasons.append(("pause", slice_name, "missing slice"))
            continue

        min_count = metric_budgets.get("_min_count", 0)
        if not isinstance(min_count, int) or min_count < 0:
            reasons.append(("pause", slice_name, "invalid min_count"))
            continue
        old_n = old.get("n", 0)
        new_n = new.get("n", 0)
        if (
            not isinstance(old_n, int)
            or not isinstance(new_n, int)
            or isinstance(old_n, bool)
            or isinstance(new_n, bool)
            or old_n < 0
            or new_n < 0
        ):
            reasons.append(("pause", slice_name, "invalid sample count"))
            continue
        if old_n < min_count or new_n < min_count:
            reasons.append(("pause", slice_name, "underpowered slice"))
            continue

        for metric, rule in metric_budgets.items():
            if metric == "_min_count":
                continue
            if metric not in old or metric not in new:
                reasons.append(("pause", slice_name, f"missing {metric}"))
                continue
            if not isinstance(rule, dict) or "direction" not in rule or "allowed_delta" not in rule:
                reasons.append(("pause", slice_name, f"bad rule for {metric}"))
                continue

            direction = rule["direction"]
            allowed = rule["allowed_delta"]
            severity = rule.get("severity", "normal")
            if not finite_number(old[metric]) or not finite_number(new[metric]):
                reasons.append(("pause", slice_name, f"invalid {metric}"))
                continue
            if not finite_number(allowed) or allowed < 0:
                reasons.append(("pause", slice_name, f"bad budget for {metric}"))
                continue
            if direction not in {"lower_is_better", "higher_is_better"}:
                reasons.append(("pause", slice_name, f"bad direction for {metric}"))
                continue

            delta = new[metric] - old[metric]
            regressed = (
                direction == "lower_is_better" and delta > allowed
            ) or (
                direction == "higher_is_better" and -delta > allowed
            )
            if regressed:
                action = "rollback" if severity == "critical" else "pause"
                reasons.append((action, slice_name, metric, round(delta, 4)))

    if any(reason[0] == "rollback" for reason in reasons):
        return "rollback", reasons
    if reasons:
        return "pause", reasons
    return "continue", []

Interview And Exam

Timed Prompts

Prompt 1: Design An Eval Plan For A Voice Banking Agent

The agent answers policy questions, verifies identity, and can start sensitive account workflows. What do you evaluate before launch?

Hidden answer: strong outline

Cover ASR entity accuracy, identity verification flow, spoken prompt injection, retrieval grounding, answer correctness, tool-call confirmation, escalation, consent, redaction, latency, cost, audit logs, rollback, and human-review burden. Include adversarial audio, interrupted confirmations, low-confidence ASR, and stale policy documents.

Prompt 2: Explain A Launch Gate To A Product Manager

The candidate model is cheaper and sounds better, but noisy-device task completion drops by 3%. What do you say?

Hidden answer: communication pattern

State the user impact, affected slice, confidence, business tradeoff, and recommended action. A strong answer does not hide behind a single aggregate score. It proposes a targeted rollback, extra eval data, or slice-specific fallback while preserving the cost win where quality stays inside budget.