Production Practice

Speech AI Debugging Casebook

Practice debugging speech AI systems the way ML engineers are evaluated: isolate user impact, prove or disprove hypotheses with slices, choose mitigations, and leave behind stronger release gates.

Debugging Method

Turn Incidents Into Hypothesis Tests

Production speech failures often cross model, product, client, and infrastructure boundaries. A strong strong answer keeps the first hour disciplined: protect users, define observable symptoms, compare against a control, and avoid exposing private audio or transcripts.

  1. Impact: quantify affected users, traffic share, SLO burn, regions, languages, clients, and model versions.
  2. Mitigation: freeze rollout, roll back, reduce concurrency, route a slice to fallback, or disable a risky feature flag.
  3. Hypotheses: list likely causes across data, model, decoder, client, serving, retrieval, and policy layers.
  4. Slices: compare canary versus control by noise, accent proxy, duration, query type, text length, tenant, and route.
  5. Prevention: add a launch gate, synthetic fixture, dashboard, alert, model-card note, or runbook step.
Question: What separates advanced debugging from metric chasing?

Production debugging connects each metric to a user-visible failure and a decision. It distinguishes correlation from causality, checks control cohorts, protects privacy, and chooses the smallest mitigation that stops harm while preserving evidence for root-cause analysis.

Case Bank

Interview-Style Production Debugging Cases

Time-box each case to 20 minutes. Write impact, mitigation, slices, likely root causes, and the release gate you would add before opening the hidden answer.

Case 1: ASR Entity Recall Drops After Better WER

A new ASR model improves overall WER by 3 percent relative, but support calls now miss account numbers and medicine names more often. Product asks why the launch gate passed.

Hidden answer: strong diagnosis

WER hid a task-critical entity regression. Compare entity recall, slot edit rate, confidence calibration, escalation rate, and correction events by domain and noise slice. Check normalization, hotword biasing, decoder vocabulary, punctuation, and post-ASR extraction. The fix is not just another aggregate WER gate: add entity-weighted evals, domain fixtures, and rollback triggers for critical slots.

Case 2: Voice Agent Latency Is Flat But Users Interrupt More

p95 end-to-end latency is unchanged, yet barge-in events and abandoned turns increased after a speech-to-speech release.

Hidden answer: what to inspect

Inspect first useful token, first audio byte, TTS chunk cadence, silence before speech, prosody, interruption handling, client event ordering, and answer verbosity. A flat p95 can hide worse perceived latency if the system speaks later or says more before the useful answer. Add conversational metrics: time to first useful audio, token-to-audio lag, barge-in success, and turn completion.

Case 3: RAG Grounding Fails Only On Spoken Queries

Text RAG evals pass, but the voice assistant gives stale or unrelated answers when users speak the same requests.

Hidden answer: likely failure chain

Spoken queries add ASR substitutions, punctuation loss, partial hypotheses, language-ID errors, and shorter conversational phrasing. Compare retrieval recall using clean text, final transcripts, noisy transcripts, and partial transcripts. Add spoken-query evals, ASR-noise augmentation, freshness filters, citation checks, and a fallback when retrieval confidence is low.

Case 4: Quantization Saves Cost But Breaks Safety Classifiers

A quantized routing model lowers CPU cost. After rollout, more spoken prompt-injection attempts reach tools that require caution.

Hidden answer: mitigation and prevention

Roll back or route sensitive traffic to the higher-precision model. Slice by prompt-injection class, ASR-noise level, language, tool type, threshold bucket, and false-negative examples from synthetic fixtures. Quantization must be evaluated on safety recall and calibration, not only average task accuracy or cost.

Coding Lab

Incident Utility Functions

These small functions mirror advanced practice follow-ups. They use synthetic aggregate metrics and deliberately avoid private recordings, transcripts, or account data.

Lab 1: Find The First Bad Release

Given ordered release summaries and per-metric budgets, return the first release where any required metric exceeds its budget.

Hidden answer: invariant, mistakes, and Python solution

Invariant: all earlier releases are acceptable under the same budget rules. Common mistakes are treating missing metrics as pass, mixing lower-is-better and higher-is-better metrics, and reporting the worst release instead of the first bad one.

def first_bad_release(releases, budgets):
    for release in releases:
        metrics = release["metrics"]
        for metric, rule in budgets.items():
            if metric not in metrics:
                return release["id"], metric, "missing"

            value = metrics[metric]
            if rule["direction"] == "max" and value > rule["limit"]:
                return release["id"], metric, value
            if rule["direction"] == "min" and value < rule["limit"]:
                return release["id"], metric, value

    return None

Lab 2: Pick A Speech Rollback Target

Given candidate model versions, choose the newest version that passes quality, latency, safety, and cost gates for a specific traffic slice.

Hidden answer: tests and Python solution

Test no passing candidate, missing slice data, exactly-on-threshold metrics, and a newer model that passes aggregate metrics but fails the requested slice. Rollback selection must be slice-aware.

def choose_rollback_target(versions, slice_name, gates):
    for version in reversed(versions):
        metrics = version["slices"].get(slice_name)
        if metrics is None:
            continue

        ok = True
        for metric, limit in gates.items():
            if metric not in metrics:
                ok = False
                break
            if metrics[metric] > limit:
                ok = False
                break

        if ok:
            return version["id"]

    return None

Lab 3: Explain A Regression Without Private Data

Build a compact incident packet from aggregate slice metrics and a list of sanitized hypotheses.

Hidden answer: privacy-safe packet builder

The packet should contain aggregate deltas, affected slices, mitigation, hypotheses, next checks, and prevention actions. It should not include raw transcripts, audio clips, user identifiers, voiceprints, or unredacted support tickets.

def build_incident_packet(incident_id, impact, mitigation, top_slices, hypotheses):
    return {
        "incident_id": incident_id,
        "impact": impact,
        "mitigation": mitigation,
        "top_regressed_slices": top_slices[:5],
        "hypotheses": hypotheses[:5],
        "privacy_note": "aggregate metrics only; no raw audio or transcripts",
    }

Advanced Exam Prompts

Answer Out Loud, Then Check Yourself

Prompt 1: Design A Debugging Dashboard

Design the minimum dashboard for a shared ASR, LLM, retrieval, and TTS voice-agent platform. Include SLOs, slices, alerts, and rollback decisions.

Hidden answer: dashboard outline

Include traffic, error rate, queue age, first partial transcript, final transcript latency, WER/entity slices, retrieval recall, groundedness, first audio byte, chunk underruns, barge-in success, cost per minute, GPU utilization, and canary versus control comparisons. Alerts should map to actions: page, freeze rollout, route fallback, or roll back.

Prompt 2: Defend A Rollback Decision

A model improves average quality but regresses a high-value noisy mobile slice. Leadership asks you to keep shipping. What do you say?

Hidden answer: strong response

State the user impact and business risk, show slice evidence, offer constrained alternatives, and define the data needed to proceed. A strong answer can propose model routing, narrower canary, extra labeling, or targeted fine-tuning, but it does not hide a known high-value regression behind aggregate improvement.