Speech Evaluation, RAG, And Quality Systems

Evaluation Stack

Measure Product Quality By Layer

Audio-text systems fail across multiple boundaries. Keep separate metrics for the acoustic model, text model, retrieval layer, response generator, speech renderer, and end-to-end user task.

ASR And Transcript Quality

Track WER, CER, entity error rate, timestamp error, diarization error, partial churn, endpointing delay, and language ID accuracy.

Question: Why can lower WER hurt users?

WER weights every word equally. Users may care more about names, numbers, product codes, commands, timestamps, punctuation, and diarization. A model can reduce filler-word errors while making more business-critical entity mistakes.

TTS And Speech Output

Track intelligibility, MOS or preference win rate, first audio byte, streaming stability, speaker similarity, prosody, safety, and interruption behavior.

Question: What should block a TTS release?

Block release if intelligibility drops, first-audio-byte exceeds the conversational SLO, voices become unstable, unsafe text is spoken, speaker identity controls regress, or a quality gain only appears on clean lab prompts but fails on real product text.

End-To-End Assistant Quality

Track task completion, correction rate, barge-in success, turn latency, hallucination rate, refusal quality, and recovery from interrupted speech.

Question: Why keep component and end-to-end metrics?

End-to-end metrics tell whether users succeed, but they are hard to debug alone. Component metrics localize the failure. If task completion drops, separate ASR errors, retrieval misses, reasoning failures, TTS latency, and UI playback problems.

Slice Coverage

Report metrics by approved aggregate or consented slices such as pronunciation variation, noise, channel, microphone, language, domain terms, speaking rate, long silence, overlap, and device class.

Question: What makes a slice launch-critical?

A slice is launch-critical when it maps to a major user group, safety risk, contractual requirement, revenue workflow, or known model weakness. Experienced engineers protect these slices with explicit regression budgets and rollback criteria while avoiding raw sensitive labels unless they are policy-approved and needed.

RAG For Speech Products

Retrieval Changes The Failure Modes

Voice assistants often answer from docs, tickets, transcripts, or user memory. Retrieval improves grounding, but ASR errors and spoken ambiguity make retrieval evaluation different from text-only RAG.

Prompt 1: Design RAG For A Voice Support Agent

A customer asks spoken questions about a product manual. The system must answer by voice, cite the manual internally, and avoid exposing private transcript text in logs.

Hidden answer: advanced design outline

Use streaming ASR with confidence and timing metadata, normalize product terms, retrieve against chunked manual passages, rerank with query and conversation context, generate a grounded answer, and synthesize speech. Evaluate ASR entity accuracy, retrieval recall at k, answer faithfulness, citation support, latency, and redaction. Treat retrieved passages and transcript text as untrusted inputs: ignore instructions found inside evidence, refuse unsupported personal-data requests, and require answers to cite approved source chunks. Log trace IDs, versions, aggregate metrics, and privacy-safe error categories rather than raw private audio by default.

Prompt 2: ASR Errors Break Retrieval

Users ask for "XR-17 battery calibration", but ASR often returns "ex are seventeen battery celebration". Retrieval misses the right manual page. What do you change?

Hidden answer: mitigation options

Add domain vocabulary biasing, pronunciation variants, entity normalization, fuzzy lexical retrieval, character n-gram features, audio-confidence-aware query rewriting, and retrieval evaluation with noisy ASR hypotheses. Bound rewrites to known product IDs or catalog terms, preserve the original ASR hypothesis for debugging, and keep a fallback clarification question when confidence is low or top passages disagree.

Judges

Use LLM Judges Carefully

Good Uses

Classify answer helpfulness, check claims against retrieved evidence, summarize error reports, and prioritize human review queues after calibration.

Question: How do you calibrate a judge?

Build a labeled set with consented or sanitized transcripts, retrieved passages, answer text, and any audio-specific human ratings needed for the task. Compare judge decisions to human labels, inspect disagreements, measure precision and recall by slice, lock the judge prompt and model version, hide system identity during grading, randomize pairwise order when comparing releases, and avoid using the judge as the only launch gate for high-risk behavior.

Risky Uses

Replacing human labels for safety-critical cases, judging private content without retention controls, scoring speech quality with a text-only judge, or using a judge that shares the same blind spots as the generator.

Question: What should be logged for judge outputs?

Log judge model version, prompt version, rubric, score, reason category, request trace ID, and redacted inputs when allowed. Recheck calibration after judge model or prompt updates, and keep enough metadata to reproduce decisions without making the monitoring system a private-data sink.

Incident Drills

Quality Debugging Prompts

Incident 1: RAG Grounding Scores Improved, Complaints Increased

Offline faithfulness improved after a retriever update, but users report that spoken answers are less useful.

Hidden answer: debug path

Check whether the eval set overrepresents written queries instead of ASR transcripts, whether retrieved chunks are too long for spoken answers, whether citations are correct but answers omit the user's actual intent, and whether TTS latency or prosody makes good text feel bad. Compare query ASR, retrieval recall, generator answer, TTS output, and user task completion on the same traces.

Incident 2: Human Review Cost Tripled

A new monitoring rule sends too many voice-agent sessions to human review. Accuracy is better, but the operations team is overloaded.

Hidden answer: strong response

Measure alert precision, duplicate alerts per session, slice concentration, confidence thresholds, and whether a cheap automatic classifier can triage low-risk cases. Keep high-risk review paths, but tune routing, batch similar failures, sample for trend detection, and report the cost-quality tradeoff explicitly.

Lab

Build A Speech RAG Eval Sheet

Create a small, privacy-safe evaluation table for a voice RAG system. Use synthetic or sanitized examples only.

Inputs: write the spoken request, expected ASR pitfalls, allowed sanitized transcript, and the ground-truth document IDs or note that evidence is intentionally absent.
Retrieval: record sanitized ASR text, alternate hypotheses when available, normalized query text, top-k results, recall against expected evidence, and reranker notes.
Answer: score correctness, faithfulness to retrieved evidence, missing caveats, unsupported-answer behavior, refusal quality, retrieved-instruction resistance, and spoken clarity.
Latency: measure ASR partial time, retrieval time, LLM first token, first audio byte, and final audio.
Decision: write pass, fail, or needs human review with the rollback or mitigation action.

Question: What makes this lab advanced?

It connects offline model metrics to production decisions. An experienced engineer does not stop at "the model improved"; they define the user task, separate ASR noise from retrieval misses and generation errors, protect private data, price the review burden, and state what would block rollout.