Production Readiness

Speech ML/AI Production Incident Drills

Practice the production skill of debugging speech systems under pressure: stabilize users first, isolate the regression, preserve privacy, and turn the incident into stronger release gates.

First Hour

A Repeatable Triage Loop

Audio incidents are rarely one metric. Partial latency, final quality, client buffering, VAD, punctuation, retrieval, and TTS playback can all fail together. A strong response separates user impact from root cause hunting.

Declare impact: affected products, tenants, languages, devices, model versions, traffic percentage, and SLO burn.
Stop harm: freeze rollout, route to fallback, lower concurrency, disable risky feature flags, or roll back by model version.
Slice evidence: compare baseline versus canary by language, noise, duration, endpoint, client, region, and prompt/config version.
Protect data: use aggregate metrics, hashes, synthetic repros, and consented examples; do not paste private audio or transcripts into tickets.
Close the loop: add a release gate, dashboard, alert, runbook step, or synthetic fixture that would have caught the failure earlier.

Question: What should you say when you do not know the root cause yet?

Say what is known about impact, what has been mitigated, what evidence is being sliced next, and when the next update will arrive. Avoid guessing. Advanced incident communication separates confirmed facts from hypotheses.

Scenario Bank

Six High-Yield Audio Incidents

Treat each prompt as a 25-minute interview exercise. Write the first mitigation, the likely slices, the owner map, and the prevention gate before opening the hidden answer.

Incident 1: Streaming ASR Partial Churn Spike

Users complain that live captions keep rewriting earlier words. Final WER is nearly unchanged, but support tickets tripled after a decoder update.

Hidden answer: triage and prevention

Mitigate by rolling back the decoder or raising partial-stability thresholds for affected routes. Slice by utterance duration, endpoint VAD state, beam size, language, device, and network jitter. Look for changed emission timing, punctuation instability, hotword biasing, or aggressive rescoring. Add partial churn, first stable token latency, and finalization delay to release gates; final WER alone is not enough for streaming UX.

Incident 2: TTS First Audio Byte Regression

A new expressive voice wins offline preference tests but increases p95 first audio byte from 350 ms to 1.4 s during peak traffic.

Hidden answer: triage and prevention

Mitigate with fallback voice routing, shorter first sentence chunks, warm pools, or canary rollback. Slice text length, language, voice, region, vocoder path, cache hit rate, autoscaling cold starts, and queue depth. Prevention needs conversational latency gates, not only MOS or preference tests. Track first audio byte, chunk cadence, playback underruns, cost per generated minute, and abandonment.

Incident 3: Voice RAG Answers Become Ungrounded

A spoken support assistant starts giving confident but unsupported answers after a document-ingestion pipeline change.

Hidden answer: triage and prevention

Mitigate by pinning the previous index, disabling affected document namespaces, or forcing human handoff when retrieval confidence is weak. Slice by document version, embedding model, chunker, ASR error class, query language, and top-k overlap with the old index. Add retrieval regression tests from spoken queries, citation coverage, answer-grounding checks, and index lineage to the release gate.

Incident 4: Cost Cut Hurts Noisy Accent Slice

A cheaper ASR model reduces GPU spend by 35 percent, but one accented noisy-call slice regresses badly while aggregate WER still passes.

Hidden answer: triage and prevention

Pause rollout or route the affected slice to the stronger model. Compare WER, entity error rate, confidence calibration, correction rate, and escalation rate by accent proxy, SNR bucket, duration, device, and domain. Use governed, consented labels or coarse operational slices; do not infer protected attributes from private audio just to debug the incident. A staff-level tradeoff may keep a tiered model router, but only if slice quality and fairness gates are explicit. Add cost-quality frontier reviews instead of aggregate-only cost wins.

Incident 5: Barge-In Stops Working

A speech-to-speech agent keeps talking over users after a client SDK update. ASR and TTS services appear healthy in isolation.

Hidden answer: triage and prevention

Mitigate by reverting the SDK, disabling full-duplex mode, or routing to push-to-talk for affected clients. Slice by SDK version, device, echo cancellation state, VAD endpointing, playback state, and event ordering between microphone frames and TTS cancellation. Prevention needs end-to-end turn-taking tests with overlapping audio, not only component health checks.

Incident 6: Silent Drift In Language Mix

Overall traffic is stable, but a regional launch quietly changes the language and code-switching mix. ASR quality complaints rise slowly.

Hidden answer: triage and prevention

Mitigate with slice-specific routing, contextual language hints, or rollback of the regional launch path if quality is below contract. Monitor aggregate feature drift, language-id distribution, OOV rate, correction rate, fallback rate, and human escalation. Use privacy-safe aggregates and governed samples. Prevention needs traffic-mix drift alerts tied to eval coverage, not just model metric dashboards.

Coding Follow-Ups

Small Utilities For Incident Rounds

These examples mirror practical interview follow-ups: compare slices, recommend rollback, and identify the most suspicious regression.

Lab 1: Rank Slice Regressions

Given baseline and candidate metrics by slice, return slices that exceed budgets, sorted by severity.

Hidden answer: invariant, tests, and Python solution

Invariant: each slice is compared only against its own baseline. Test missing baseline, missing metric, exactly-on-budget deltas, lower-is-better metrics like WER, and higher-is-better metrics like task success.

def rank_regressions(baseline, candidate, budgets):
    findings = []
    for slice_name, metrics in candidate.items():
        old = baseline.get(slice_name)
        if old is None:
            findings.append((float("inf"), slice_name, "missing_baseline", None))
            continue

        for metric, rule in budgets.items():
            if metric not in metrics or metric not in old:
                findings.append((float("inf"), slice_name, metric, "missing_metric"))
                continue

            delta = metrics[metric] - old[metric]
            allowed = rule["allowed_delta"]
            direction = rule["direction"]
            if direction == "lower_is_better":
                severity = delta - allowed
            elif direction == "higher_is_better":
                severity = -delta - allowed
            else:
                raise ValueError(f"unsupported direction for {metric}: {direction}")

            if severity > 0:
                findings.append((severity, slice_name, metric, delta))

    return sorted(findings, reverse=True)

Lab 2: Rollback Recommendation

Convert SLO burn, critical slice regressions, and business wins into a simple recommendation: continue, pause, partial rollback, or rollback.

Hidden answer: invariant, tests, and Python solution

Invariant: user harm overrides aggregate wins. Test high SLO burn, one critical slice, many noncritical warnings, and a clean canary with cost or latency improvement.

def recommend_rollout(slo_burn, critical_regressions, warnings, aggregate_wins):
    if slo_burn >= 2.0:
        return "rollback", "SLO burn is too high"
    if critical_regressions:
        affected = sorted({item["slice"] for item in critical_regressions})
        if len(affected) <= 2:
            return "partial rollback", "route affected slices: " + ", ".join(affected)
        return "rollback", "critical regressions are broad"
    if len(warnings) >= 3:
        return "pause", "too many unresolved warning signals"
    if aggregate_wins:
        return "continue", "wins are acceptable and guardrails are clean"
    return "pause", "no clear win yet"

Lab 3: Build A Privacy-Safe Incident Packet

Given request records with optional transcript text, produce a packet with aggregate counts and no raw transcript content.

Hidden answer: invariant, tests, and Python solution

Invariant: the output contains counts, versions, and metric summaries only, with rare slices bucketed so a one-off language/device pair does not identify a user. Test that transcript, audio path, user id, and raw text fields never appear in the packet, that missing latency values do not crash the report, and that slice keys are stable strings suitable for JSON incident reports.

from collections import Counter, defaultdict
from math import ceil


def nearest_rank(values, percentile):
    index = ceil(percentile * len(values)) - 1
    return values[index]


def incident_packet(records, min_slice_count=3):
    counts = Counter()
    raw_latency = defaultdict(list)
    versions = Counter()

    for row in records:
        language = row.get("language", "unknown")
        device = row.get("device", "unknown")
        key = f"language={language}|device={device}"
        counts[key] += 1
        if "latency_ms" in row:
            raw_latency[key].append(row["latency_ms"])
        if "model_version" in row:
            versions[row["model_version"]] += 1

    safe_counts = Counter()
    latency = defaultdict(list)
    for key, count in counts.items():
        safe_key = key if count >= min_slice_count else "language=rare|device=rare"
        safe_counts[safe_key] += count
        latency[safe_key].extend(raw_latency[key])

    latency_summary = {}
    for key, values in latency.items():
        if not values:
            continue
        values = sorted(values)
        latency_summary[key] = {
            "count": len(values),
            "p50_ms": nearest_rank(values, 0.50),
            "p95_ms": nearest_rank(values, 0.95),
        }

    return {
        "slice_counts": dict(safe_counts),
        "latency": latency_summary,
        "model_versions": dict(versions),
    }

Timed Practice

Advanced Practice Prompts

Prompt 1: Explain Rollback Versus Fix-Forward

In five minutes, explain when you would roll back an ASR model versus fix forward with a decoder, prompt, or router change.

Hidden answer: strong response outline

Roll back when user harm is active, SLO burn is high, privacy or safety is at risk, or the blast radius is not understood. Fix forward only when the failure is narrow, the mitigation is lower risk than rollback, validation is fast, and fallback remains ready. Mention model version flags, canary scope, eval slices, owner approval, and post-incident release-gate changes.

Prompt 2: Design The Dashboard

Name the dashboard panels you want for a production speech-to-speech assistant and explain which alert should page a human.

Hidden answer: strong response outline

Include turn latency, first partial latency, finalization delay, TTS first audio byte, barge-in success, ASR confidence, retrieval grounding, answer safety, fallback/handoff, cost per minute, error budget burn, and slice drift. Page on customer-impacting SLO burn, safety/privacy failures, broad fallback spikes, and sustained groundedness collapse; ticket lower-severity drift and cost trends.