Speech AI Debugging Casebook

Debugging Method

Turn Incidents Into Hypothesis Tests

Production speech failures often cross model, product, client, and infrastructure boundaries. A strong answer keeps the first hour disciplined: protect users, define observable symptoms, compare against a control, and avoid exposing private audio or transcripts.

Impact: quantify affected users, traffic share, SLO burn, regions, languages, clients, and model versions.
Mitigation: freeze rollout, roll back, reduce concurrency, route a slice to fallback, or disable a risky feature flag.
Hypotheses: list likely causes across data, model, decoder, client, serving, retrieval, and policy layers.
Slices: compare canary versus control by noise, privacy-safe accent or locale proxies, duration, query type, text length, approved aggregate tenant or plan buckets, and route; do not infer or expose protected attributes from raw speech.
Prevention: add a launch gate, synthetic fixture, dashboard, alert, model-card note, or runbook step.

Question: What separates advanced debugging from metric chasing?

Production debugging connects each metric to a user-visible failure and a decision. It distinguishes correlation from causality, checks control cohorts, protects privacy, and chooses the smallest mitigation that stops harm while preserving evidence for root-cause analysis.

Case Bank

Interview-Style Production Debugging Cases

Time-box each case to 20 minutes. Write impact, mitigation, slices, likely root causes, and the release gate you would add before opening the hidden answer.

Case 1: ASR Entity Recall Drops After Better WER

A new ASR model improves overall WER by 3 percent relative, but support calls now miss account numbers and medicine names more often. Product asks why the launch gate passed.

Hidden answer: strong diagnosis

WER hid a task-critical entity regression. Compare entity recall, slot edit rate, confidence calibration, escalation rate, and correction events by domain and noise slice. Check normalization, hotword biasing, decoder vocabulary, punctuation, and post-ASR extraction. The fix is not just another aggregate WER gate: add entity-weighted evals, domain fixtures, and rollback triggers for critical slots.

Case 2: Voice Agent Latency Is Flat But Users Interrupt More

p95 end-to-end latency is unchanged, yet barge-in events and abandoned turns increased after a speech-to-speech release.

Hidden answer: what to inspect

Inspect first useful token, first audio byte, TTS chunk cadence, silence before speech, prosody, interruption handling, client event ordering, and answer verbosity. A flat p95 can hide worse perceived latency if the system speaks later or says more before the useful answer. Add conversational metrics: time to first useful audio, token-to-audio lag, barge-in success, and turn completion.

Case 3: RAG Grounding Fails Only On Spoken Queries

Text RAG evals pass, but the voice assistant gives stale or unrelated answers when users speak the same requests.

Hidden answer: likely failure chain

Spoken queries add ASR substitutions, punctuation loss, partial hypotheses, language-ID errors, and shorter conversational phrasing. Compare retrieval recall using clean text, final transcripts, noisy transcripts, and partial transcripts. Add spoken-query evals, ASR-noise augmentation, freshness filters, citation checks, and a fallback when retrieval confidence is low.

Case 4: Quantization Saves Cost But Breaks Safety Classifiers

A quantized routing model lowers CPU cost. After rollout, more spoken prompt-injection attempts reach tools that require caution.

Hidden answer: mitigation and prevention

Roll back or route sensitive traffic to the higher-precision model. Slice by prompt-injection class, ASR-noise level, approved locale/language eval buckets, tool type, threshold bucket, and false-negative examples from synthetic or consented redacted fixtures. Quantization must be evaluated on safety recall and calibration, not only average task accuracy or cost.

Coding Lab

Incident Utility Functions

These small functions mirror advanced practice follow-ups. They use synthetic aggregate metrics and deliberately avoid private recordings, transcripts, or account data.

Lab 1: Find The First Bad Release

Given ordered release summaries and per-metric budgets, return the first release where any required metric exceeds its budget.

Hidden answer: invariant, mistakes, and Python solution

Invariant: all earlier releases are acceptable under the same budget rules. Common mistakes are treating missing metrics as pass, mixing lower-is-better and higher-is-better metrics, and reporting the worst release instead of the first bad one. Unknown budget directions should fail loudly so a malformed gate does not approve a risky release. Metric values and budget limits should also be finite numbers; NaN, infinity, booleans, or strings should fail the gate instead of silently passing a comparison.

import math


def require_finite_number(value, metric):
    if isinstance(value, bool) or not isinstance(value, (int, float)):
        raise ValueError(f"{metric} must be a finite number")
    if not math.isfinite(value):
        raise ValueError(f"{metric} must be a finite number")
    return value


def first_bad_release(releases, budgets):
    for release in releases:
        metrics = release["metrics"]
        for metric, rule in budgets.items():
            direction = rule["direction"]
            limit = require_finite_number(rule["limit"], f"{metric} limit")
            if direction not in {"max", "min"}:
                raise ValueError(f"unknown budget direction: {direction}")

            if metric not in metrics:
                return release["id"], metric, "missing"

            value = require_finite_number(metrics[metric], metric)
            if direction == "max" and value > limit:
                return release["id"], metric, value
            if direction == "min" and value < limit:
                return release["id"], metric, value

    return None

Lab 2: Pick A Speech Rollback Target

Given candidate model versions, choose the newest version that passes quality, latency, safety, and cost gates for a specific traffic slice. Each gate declares whether higher or lower values are better.

Hidden answer: tests and Python solution

Test no passing candidate, missing slice data, exactly-on-threshold metrics, and a newer model that passes aggregate metrics but fails the requested slice. Rollback selection must be slice-aware and must not treat quality, recall, or safety-recall gates as lower-is-better. Reject malformed metric values or thresholds instead of letting NaN, infinity, or booleans bypass a production rollback gate.

import math


def require_finite_number(value, metric):
    if isinstance(value, bool) or not isinstance(value, (int, float)):
        raise ValueError(f"{metric} must be a finite number")
    if not math.isfinite(value):
        raise ValueError(f"{metric} must be a finite number")
    return value


def choose_rollback_target(versions, slice_name, gates):
    for version in reversed(versions):
        metrics = version["slices"].get(slice_name)
        if metrics is None:
            continue

        ok = True
        for metric, rule in gates.items():
            if metric not in metrics:
                ok = False
                break

            direction = rule["direction"]
            limit = require_finite_number(rule["limit"], f"{metric} limit")
            if direction not in {"max", "min"}:
                raise ValueError(f"unknown gate direction: {direction}")

            value = require_finite_number(metrics[metric], metric)
            if direction == "max" and value > limit:
                ok = False
                break
            if direction == "min" and value < limit:
                ok = False
                break

        if ok:
            return version["id"]

    return None

Lab 3: Explain A Regression Without Private Data

Build a compact incident packet from aggregate slice metrics and a list of sanitized hypotheses.

Hidden answer: privacy-safe packet builder

The packet should contain aggregate deltas, affected slices, mitigation, hypotheses, next checks, and prevention actions. It should not include raw transcripts, audio clips, user identifiers, voiceprints, or unredacted support tickets. Rare slice labels should be bucketed before the packet leaves the incident room, so a one-off locale, tenant, or device combination cannot identify a user.

def bucket_rare_slices(slices, min_count=3):
    safe = []
    rare_count = 0

    for item in slices:
        count = item.get("count", 0)
        if count >= min_count:
            safe.append(dict(item))
        else:
            rare_count += count

    if rare_count:
        safe.append({"slice": "rare_or_low_count", "count": rare_count})

    return safe


def build_incident_packet(incident_id, impact, mitigation, top_slices, hypotheses):
    return {
        "incident_id": incident_id,
        "impact": impact,
        "mitigation": mitigation,
        "top_regressed_slices": bucket_rare_slices(top_slices[:5]),
        "hypotheses": hypotheses[:5],
        "privacy_note": "aggregate metrics only; no raw audio or transcripts",
    }

Advanced Exam Prompts

Answer Out Loud, Then Check Yourself

Prompt 1: Design A Debugging Dashboard

Design the minimum dashboard for a shared ASR, LLM, retrieval, and TTS voice-agent platform. Include SLOs, slices, alerts, and rollback decisions.

Hidden answer: dashboard outline

Include traffic, error rate, queue age, first partial transcript, final transcript latency, approved aggregate WER/entity slices, retrieval recall, groundedness, first audio byte, chunk underruns, barge-in success, cost per minute, GPU utilization, and canary versus control comparisons. Alerts should map to actions: page, freeze rollout, route fallback, or roll back.

Prompt 2: Defend A Rollback Decision

A model improves average quality but regresses a high-value noisy mobile slice. Leadership asks you to keep shipping. What do you say?

Hidden answer: strong response

State the user impact and business risk, show slice evidence, offer constrained alternatives, and define the data needed to proceed. A strong answer can propose model routing, narrower canary, extra labeling, or targeted fine-tuning, but it does not hide a known high-value regression behind aggregate improvement.