Production Lab

Speech MLOps, Serving, And Incident Response

Speech ML/AI work is measured after launch. This lesson trains the production habits behind reliable ASR, TTS, embeddings, speech-to-speech, and audio classification systems.

Architecture

Reference Serving System

Use this as the baseline design in interviews and course projects. Replace parts only when a constraint forces a different choice.

  1. Ingress: authenticate requests, assign trace IDs, validate format, and reject unsafe payloads early.
  2. Preprocess: resample, normalize loudness, run VAD, chunk streams, and preserve timing metadata.
  3. Route: select CPU, GPU, local, batch, or streaming path by model type, latency SLO, and tenant policy.
  4. Infer: use bounded queues, dynamic batching, KV-cache or decoder state reuse, and timeout-aware cancellation.
  5. Postprocess: restore timestamps, punctuation, speaker labels, redaction, confidence, and downstream event schema.
  6. Observe: emit privacy-safe metrics, traces, eval samples, cost tags, model version, and data-slice labels.
Question: Why keep preprocessing versioned with the model?

Feature code is part of the model contract. A sample-rate change, mel-filter configuration change, text normalization rule, tokenizer update, or VAD threshold can change accuracy and latency even when weights are identical. A production model artifact should include weights, code commit, feature config, tokenizer, dependency lockfile, eval report, training data snapshot, and owner.

CI/CD

Promotion Gates For Audio Models

Code Gates

Unit tests, type checks, HTML/site checks, lint, reproducible environment build, and no private audio or transcript artifacts in Git.

Hidden answer: Minimum strong checklist

Include tests for feature shapes, normalization, decoding edge cases, model wrapper schema, retry behavior, timeout behavior, and one-batch overfit sanity. For this course repo, keep using HTML parsing, ASCII checks, git diff --check, and local server smoke tests.

Model Gates

Baseline comparison, slice metrics, regression budget, latency budget, memory budget, privacy review, and rollback plan.

Hidden answer: What should block promotion?

Block promotion if WER or entity error regresses on critical slices, TTS quality drops below the acceptance threshold, p95 or p99 latency breaks SLO, memory exceeds capacity, transcripts are logged unsafely, rollback is untested, or the eval set no longer represents production traffic.

Monitoring

Metrics That Catch Real Failures

Question: How can drift be monitored without storing private audio?

Prefer aggregate and derived signals: duration buckets, language probabilities, acoustic quality summaries, confidence distributions, error reports from opted-in samples, redacted transcripts, synthetic canaries, and human review queues with explicit retention controls. Store enough to diagnose trends, not raw private content by default.

Incident Drills

Practice First-Hour Debugging

Incident 1: Streaming ASR Partials Became Unstable

Users see words appear, disappear, and reappear for several seconds. Final WER is unchanged, but the product feels broken.

Hidden answer: Debug path

Check endpointing, chunk size, partial stabilization rules, decoder beam settings, language-model rescoring, VAD thresholds, network jitter, and whether the UI is rendering interim hypotheses as committed text. Add a metric for partial churn and compare old and new traces on the same audio fixtures.

Incident 2: TTS First Audio Byte Doubled After A Voice Update

The new voice has better MOS in offline tests, but conversational latency crossed the product SLO.

Hidden answer: Mitigation options

Measure text normalization, acoustic model time, vocoder time, streaming chunk release, cold starts, and device placement. Mitigate with rollback, canary pause, warm pools, smaller voice, lower vocoder quality for long responses, sentence-level streaming, cache common prompts, or split traffic by latency-sensitive use case. Do not hide latency regressions behind offline MOS.

Incident 3: GPU Cost Rose 40 Percent With No Traffic Increase

Requests per minute are flat, but the audio model serving bill jumped after a deployment.

Hidden answer: Cost diagnosis

Compare model version, average audio duration, batch fill rate, padding waste, beam size, max tokens, quantization, cache reuse, autoscaler target, replica count, retries, and failed requests. Check whether a fallback path silently moved from CPU to GPU or whether canary plus shadow traffic doubled inference.

Advanced Practice Practice

Prompts That Test Production Judgment

Use these prompts for mock interviews and written exams. A strong answer should connect model behavior, product constraints, serving math, observability, rollout, and rollback.

Prompt 1: Design A Speech-To-Speech Assistant Rollout

You are launching a new cascaded ASR-LLM-TTS assistant for internal users. It must feel conversational, protect private audio, and stay under a fixed monthly GPU budget. Explain the rollout plan.

Hidden answer: strong structure

Start with SLOs: first partial transcript, first audio byte, turn completion, task success, cost per minute, and privacy retention. Launch behind flags with synthetic canaries, internal dogfood, small cohort canary, and rollback to the prior assistant. Monitor ASR WER slices, entity error, LLM groundedness, TTS first audio byte, queue time, GPU memory, retry rate, cancellation, and user correction rate. Keep raw audio out of logs by default, store redacted traces, and route high-risk or high-cost sessions to a fallback path.

Prompt 2: Debug A Drift Alert Without Raw Audio

After a microphone firmware update, ASR confidence drops and user corrections increase. Policy prevents retaining raw customer audio. What do you do in the first day?

Hidden answer: privacy-safe debugging plan

Compare aggregate acoustic features before and after the firmware: sample rate, clipping, silence ratio, loudness, SNR proxy, channel count, duration, VAD segments, language probabilities, confidence, and correction categories. Reproduce with public fixtures recorded through the same device path, synthetic noisy clips, and opted-in samples if policy allows. Mitigate with input normalization, firmware-specific routing, VAD threshold changes, or rollback of the firmware path. Update eval fixtures so the failure cannot return unnoticed.

Prompt 3: Choose Between Smaller ASR And Larger ASR

A smaller ASR model is 45 percent cheaper and 30 percent faster but has worse WER on noisy far-field audio and domain terms. How do you decide whether to ship it?

Hidden answer: decision framework

Do not decide from global WER alone. Compare slice metrics, entity error, downstream task success, latency, cost, user segment, and recovery behavior. A likely strong answer is a cascade: route clean, short, high-confidence utterances to the smaller model and route noisy, long, low-confidence, or domain-heavy utterances to the larger model. Canary the router, cap blast radius, monitor routing mistakes, and keep a one-switch rollback to the baseline.

Prompt 4: Explain CI/CD For A Model And Feature Change

A pull request changes log-mel feature extraction and updates ASR weights. What must pass before production?

Hidden answer: gate list

Require unit tests for feature shape, sample-rate handling, normalization, streaming chunk boundaries, and model wrapper schema. Require deterministic eval on public or approved fixtures, slice regression checks, latency and memory benchmarks, artifact versioning, dependency lockfile, model card, rollback plan, and privacy review. The deployment should stage through offline eval, shadow traffic, canary, and gradual rollout with automatic halt criteria.

Release Lab

Production Exercises

These drills combine interview-style reasoning with the practical work of shipping, monitoring, and rolling back speech models.

Exercise 1: Build A CI Gate For ASR Slice Regressions

You receive old and new metric maps by slice. Decide whether a pull request can merge when global WER improves but a protected domain slice regresses.

Hidden answer: strong checklist and Python sketch

A strong gate treats protected slices as first-class release criteria. Merge should be blocked or paused when a critical slice exceeds its regression budget, even if the aggregate metric improves. Include missing-slice handling so eval coverage cannot silently disappear.

def asr_slice_gate(baseline, candidate, budgets):
    findings = []
    for slice_name, budget in budgets.items():
        if slice_name not in baseline or slice_name not in candidate:
            findings.append(("pause", slice_name, "missing eval slice"))
            continue
        old_wer = baseline[slice_name]["wer"]
        new_wer = candidate[slice_name]["wer"]
        delta = new_wer - old_wer
        if delta > budget["max_wer_delta"]:
            findings.append(("block", slice_name, round(delta, 4)))

    if any(item[0] == "block" for item in findings):
        return "block_merge", findings
    if findings:
        return "pause_for_eval", findings
    return "merge_allowed", []

Exercise 2: Design A Rollback Packet

Before launching a TTS model, write the minimum packet that lets an on-call engineer roll back safely at 2 AM without the model author.

Hidden answer: rollback packet contents

Include model version, feature flag name, owner, launch cohort, dashboard links, SLO thresholds, known risks, exact rollback command or UI path, data compatibility notes, cache behavior, verification query, customer-support message, and criteria for re-enabling. The packet should avoid private transcripts while still giving enough metrics and traces to make a decision.

Exercise 3: Capacity Plan For A Voice Agent Exam

Estimate whether a speech-to-speech agent can meet a 700 ms first response target with ASR, retrieval, LLM, and TTS stages.

Hidden answer: answer shape

Break the budget into capture/VAD, ASR partial, retrieval, first LLM tokens, TTS first audio byte, network, and client playback. Use streaming overlap instead of serial totals where possible: ASR partials can trigger early retrieval, the LLM can emit a short first clause, and TTS can stream before the full answer exists. State which stages need warm pools, which can batch, and which should bypass expensive models for low-risk turns.

Exercise 4: Production Debugging Narrative

A canary improves offline WER by 2 percent, but live correction rate rises for Spanish-English code-switching users. Explain the first hour, the next day, and the prevention work.

Hidden answer: incident narrative

First hour: pause or roll back the canary for the affected slice, compare request routing, language ID, contextual biasing, text normalization, and entity errors against the baseline. Next day: build a targeted eval slice, inspect redacted or approved examples, check label quality, and test routing to the prior model for mixed language turns. Prevention: add code-switching gates, dashboard slices, launch alerts, and a release note requiring language-mix analysis for ASR changes.

Tradeoffs

Cost, Latency, And Quality Knobs

Batching

Raises throughput but can add queue delay. Good for offline ASR and embeddings, risky for conversational first-token latency.

Quantization

Reduces memory and can improve latency, but must be checked by slice because rare words, accents, and prosody can regress.

Cascading

Route easy requests to small models and hard requests to large models. Monitor routing mistakes and user-visible consistency.

Question: How should an experienced engineer present a tradeoff?

Name the product goal, the constraint, the options, the expected impact, the measurement plan, and the rollback path. Example: "Quantize the ASR encoder to reduce GPU memory by 35 percent, canary to 5 percent of English call-center traffic, block if entity error or p95 latency regresses, and keep the old model loaded for immediate rollback."