Production Speech MLOps And Incident Lab

Architecture

Reference Serving System

Use this as the baseline design in interviews and course projects. Replace parts only when a constraint forces a different choice.

Ingress: authenticate requests, assign trace IDs, validate format, and reject unsafe payloads early.
Preprocess: resample, normalize loudness, run VAD, chunk streams, and preserve timing metadata.
Route: select CPU, GPU, local, batch, or streaming path by model type, latency SLO, and tenant policy.
Infer: use bounded queues, dynamic batching, KV-cache or decoder state reuse, and timeout-aware cancellation.
Postprocess: restore timestamps, punctuation, speaker labels, redaction, confidence, and downstream event schema.
Observe: emit privacy-safe metrics, redacted traces, approved eval samples, cost tags, model version, and aggregate slice tags.

Question: Why keep preprocessing versioned with the model?

Feature code is part of the model contract. A sample-rate change, mel-filter configuration change, text normalization rule, tokenizer update, or VAD threshold can change accuracy and latency even when weights are identical. A production model artifact should include weights, code commit, feature config, tokenizer, dependency lockfile, eval report, training data snapshot, and owner.

CI/CD

Promotion Gates For Audio Models

Code Gates

Unit tests, type checks, HTML/site checks, lint, reproducible environment build, and no private audio or transcript artifacts in Git.

Hidden answer: Minimum strong checklist

Include tests for feature shapes, normalization, decoding edge cases, model wrapper schema, retry behavior, timeout behavior, and one-batch overfit sanity. For this course repo, keep using HTML parsing, ASCII checks, git diff --check, and local server smoke tests.

Model Gates

Baseline comparison, slice metrics, regression budget, latency budget, memory budget, privacy review, and rollback plan.

Hidden answer: What should block promotion?

Block promotion if WER or entity error regresses on critical slices, TTS quality drops below the acceptance threshold, p95 or p99 latency breaks SLO, memory exceeds capacity, transcripts are logged unsafely, rollback is untested, or the eval set no longer represents production traffic.

Monitoring

Metrics That Catch Real Failures

Latency: queue time, preprocessing time, first partial token, final transcript, first audio byte, and end-to-end turn time.
Quality: WER, CER, entity error rate, timestamp error, diarization error rate, MOS proxy, and task completion.
Traffic: request rate, concurrent streams, chunk sizes, audio duration, retry rate, cancellation rate, and tenant mix.
Infrastructure: GPU memory, GPU utilization, CPU load, batch size, cache hit rate, cold starts, OOMs, and throttling.
Data health: sample rate distribution, clipping, silence ratio, SNR proxy, language mix, domain-term frequency, and missing metadata.

Question: How can drift be monitored without storing private audio?

Prefer aggregate and derived signals: duration buckets, language probabilities, acoustic quality summaries, confidence distributions, error reports from opted-in samples, redacted transcripts, synthetic canaries, and human review queues with explicit retention controls. Store enough to diagnose trends, not raw private content by default.

Incident Drills

Practice First-Hour Debugging

Incident 1: Streaming ASR Partials Became Unstable

Users see words appear, disappear, and reappear for several seconds. Final WER is unchanged, but the product feels broken.

Hidden answer: Debug path

Check endpointing, chunk size, partial stabilization rules, decoder beam settings, language-model rescoring, VAD thresholds, network jitter, and whether the UI is rendering interim hypotheses as committed text. Add a metric for partial churn and compare old and new traces on the same audio fixtures.

Incident 2: TTS First Audio Byte Doubled After A Voice Update

The new voice has better MOS in offline tests, but conversational latency crossed the product SLO.

Hidden answer: Mitigation options

Measure text normalization, acoustic model time, vocoder time, streaming chunk release, cold starts, and device placement. Mitigate with rollback, canary pause, warm pools, smaller voice, lower vocoder quality for long responses, sentence-level streaming, cache common prompts, or split traffic by latency-sensitive use case. Do not hide latency regressions behind offline MOS.

Incident 3: GPU Cost Rose 40 Percent With No Traffic Increase

Requests per minute are flat, but the audio model serving bill jumped after a deployment.

Hidden answer: Cost diagnosis

Compare model version, average audio duration, batch fill rate, padding waste, beam size, max tokens, quantization, cache reuse, autoscaler target, replica count, retries, and failed requests. Check whether a fallback path silently moved from CPU to GPU or whether canary plus shadow traffic doubled inference.

Advanced Practice

Prompts That Test Production Judgment

Use these prompts for mock interviews and written exams. A strong answer should connect model behavior, product constraints, serving math, observability, rollout, and rollback.

Prompt 1: Design A Speech-To-Speech Assistant Rollout

You are launching a new cascaded ASR-LLM-TTS assistant for internal users. It must feel conversational, protect private audio, and stay under a fixed monthly GPU budget. Explain the rollout plan.

Hidden answer: strong structure

Start with SLOs: first partial transcript, first audio byte, turn completion, task success, cost per minute, and privacy retention. Launch behind flags with synthetic canaries, internal dogfood, small cohort canary, and rollback to the prior assistant. Monitor ASR WER slices, entity error, LLM groundedness, TTS first audio byte, queue time, GPU memory, retry rate, cancellation, and user correction rate. Keep raw audio out of logs by default, store redacted traces, and route high-risk or high-cost sessions to a fallback path.

Prompt 2: Debug A Drift Alert Without Raw Audio

After a microphone firmware update, ASR confidence drops and user corrections increase. Policy prevents retaining raw customer audio. What do you do in the first day?

Hidden answer: privacy-safe debugging plan

Compare aggregate acoustic features before and after the firmware: sample rate, clipping, silence ratio, loudness, SNR proxy, channel count, duration, VAD segments, language probabilities, confidence, and correction categories. Reproduce with public fixtures recorded through the same device path, synthetic noisy clips, and opted-in samples if policy allows. Mitigate with input normalization, firmware-specific routing, VAD threshold changes, or rollback of the firmware path. Update eval fixtures so the failure cannot return unnoticed.

Prompt 3: Choose Between Smaller ASR And Larger ASR

A smaller ASR model is 45 percent cheaper and 30 percent faster but has worse WER on noisy far-field audio and domain terms. How do you decide whether to ship it?

Hidden answer: decision framework

Do not decide from global WER alone. Compare slice metrics, entity error, downstream task success, latency, cost, approved operational context, and recovery behavior. A likely strong answer is a cascade: route clean, short, high-confidence utterances to the smaller model and route noisy, long, low-confidence, or domain-heavy utterances to the larger model. Canary the router, cap blast radius, monitor routing mistakes, and keep a one-switch rollback to the baseline.

Prompt 4: Explain CI/CD For A Model And Feature Change

A pull request changes log-mel feature extraction and updates ASR weights. What must pass before production?

Hidden answer: gate list

Require unit tests for feature shape, sample-rate handling, normalization, streaming chunk boundaries, and model wrapper schema. Require repeatable offline eval on public, synthetic, or approved retained fixtures, slice regression checks, latency and memory benchmarks, artifact versioning, dependency lockfile, model card, rollback plan, and privacy review. The deployment should stage through offline eval, policy-approved shadow traffic, canary, and gradual rollout with automatic halt criteria.

Release Lab

Production Exercises

These drills combine interview-style reasoning with the practical work of shipping, monitoring, and rolling back speech models.

Exercise 1: Build A CI Gate For ASR Slice Regressions

You receive old and new metric maps by slice. Decide whether a pull request can merge when global WER improves but a launch-critical approved eval slice regresses.

Hidden answer: strong checklist and Python sketch

A strong gate treats approved launch-critical slices as first-class release criteria. Merge should be blocked or paused when a critical slice exceeds its regression budget, even if the aggregate metric improves. Include invalid-budget, missing-slice, and missing-metric handling, plus range checks for WER values, so eval coverage and metric quality cannot silently disappear.

import math
from collections.abc import Mapping


def valid_rate(value):
    return (
        not isinstance(value, bool)
        and isinstance(value, (int, float))
        and math.isfinite(value)
        and 0 <= value <= 1
    )


def asr_slice_gate(baseline, candidate, budgets):
    if not all(isinstance(group, Mapping) for group in (baseline, candidate, budgets)):
        return "pause_for_eval", [("pause", "all", "metric inputs must be maps")]

    findings = []
    for slice_name, budget in budgets.items():
        if not isinstance(budget, Mapping):
            findings.append(("pause", slice_name, "invalid budget entry"))
            continue
        max_delta = budget.get("max_wer_delta")
        if not valid_rate(max_delta):
            findings.append(("pause", slice_name, "invalid regression budget"))
            continue
        if slice_name not in baseline or slice_name not in candidate:
            findings.append(("pause", slice_name, "missing eval slice"))
            continue
        old_metrics = baseline[slice_name]
        new_metrics = candidate[slice_name]
        if not isinstance(old_metrics, Mapping) or not isinstance(new_metrics, Mapping):
            findings.append(("pause", slice_name, "invalid metric entry"))
            continue
        if "wer" not in old_metrics or "wer" not in new_metrics:
            findings.append(("pause", slice_name, "missing WER metric"))
            continue
        old_wer = old_metrics["wer"]
        new_wer = new_metrics["wer"]
        if not valid_rate(old_wer) or not valid_rate(new_wer):
            findings.append(("pause", slice_name, "WER outside 0..1 range"))
            continue
        delta = new_wer - old_wer
        if delta > max_delta:
            findings.append(("block", slice_name, round(delta, 4)))

    if any(item[0] == "block" for item in findings):
        return "block_merge", findings
    if findings:
        return "pause_for_eval", findings
    return "merge_allowed", []

Exercise 2: Design A Rollback Packet

Before launching a TTS model, write the minimum packet that lets an on-call engineer roll back safely at 2 AM without the model author.

Hidden answer: rollback packet contents

Include model version, feature flag name, owner, launch cohort, dashboard links, SLO thresholds, known risks, exact rollback command or UI path, data compatibility notes, cache behavior, retention limits, verification query, customer-support message, and criteria for re-enabling. The packet should avoid private transcripts and raw audio while still giving enough aggregate metrics and redacted traces to make a decision.

Exercise 3: Capacity Plan For A Voice Agent Exam

Estimate whether a speech-to-speech agent can meet a 700 ms first response target with ASR, retrieval, LLM, and TTS stages.

Hidden answer: answer shape

Break the budget into capture/VAD, ASR partial, retrieval, first LLM tokens, TTS first audio byte, network, and client playback. Use streaming overlap instead of serial totals where possible: ASR partials can trigger early retrieval, the LLM can emit a short first clause, and TTS can stream before the full answer exists. State which stages need warm pools, which can batch, and which should bypass expensive models for low-risk turns.

Exercise 4: Production Debugging Narrative

A canary improves offline WER by 2 percent, but live correction rate rises on an approved Spanish-English code-switching eval slice. Explain the first hour, the next day, and the prevention work.

Hidden answer: incident narrative

First hour: pause or roll back the canary for the affected approved aggregate slice, compare request routing, language ID, contextual biasing, text normalization, and entity errors against the baseline. Next day: build a targeted eval slice, inspect redacted or approved examples, check label quality, and test routing to the prior model for mixed language turns. Prevention: add code-switching gates, dashboard slices, launch alerts, and a release note requiring language-mix analysis for ASR changes.

Tradeoffs

Cost, Latency, And Quality Knobs

Batching

Raises throughput but can add queue delay. Good for offline ASR and embeddings, risky for conversational first-token latency.

Quantization

Reduces memory and can improve latency, but must be checked on approved aggregate eval slices because rare words, language mix, and prosody can regress.

Cascading

Route easy requests to small models and hard requests to large models. Monitor routing mistakes and user-visible consistency.

Question: How should an experienced engineer present a tradeoff?

Name the product goal, the constraint, the options, the expected impact, the measurement plan, and the rollback path. Example: "Quantize the ASR encoder to reduce GPU memory by 35 percent, canary to 5 percent of English call-center traffic, block if entity error or p95 latency regresses, and keep the old model loaded for immediate rollback."