Evaluation Stack
Measure Product Quality By Layer
Audio-text systems fail across multiple boundaries. Keep separate
metrics for the acoustic model, text model, retrieval layer, response
generator, speech renderer, and end-to-end user task.
ASR And Transcript Quality
Track WER, CER, entity error rate, timestamp error, diarization error, partial churn, endpointing delay, and language ID accuracy.
Question: Why can lower WER hurt users?
WER weights every word equally. Users may care more about names,
numbers, product codes, commands, timestamps, punctuation, and
diarization. A model can reduce filler-word errors while making
more business-critical entity mistakes.
TTS And Speech Output
Track intelligibility, MOS or preference win rate, first audio byte, streaming stability, speaker similarity, prosody, safety, and interruption behavior.
Question: What should block a TTS release?
Block release if intelligibility drops, first-audio-byte exceeds
the conversational SLO, voices become unstable, unsafe text is
spoken, speaker identity controls regress, or a quality gain only
appears on clean lab prompts but fails on real product text.
End-To-End Assistant Quality
Track task completion, correction rate, barge-in success, turn latency, hallucination rate, refusal quality, and recovery from interrupted speech.
Question: Why keep component and end-to-end metrics?
End-to-end metrics tell whether users succeed, but they are hard
to debug alone. Component metrics localize the failure. If task
completion drops, separate ASR errors, retrieval misses, reasoning
failures, TTS latency, and UI playback problems.
Slice Coverage
Report metrics by accent, noise, channel, microphone, language, domain terms, speaking rate, long silence, overlap, and device class.
Question: What makes a slice launch-critical?
A slice is launch-critical when it maps to a major user group,
safety risk, contractual requirement, revenue workflow, or known
model weakness. Experienced engineers protect these slices with
explicit regression budgets and rollback criteria.