Production ML Track

Speech Serving Observability And SLOs

Turn ASR, TTS, speech-to-speech, and audio RAG systems into services you can launch, monitor, debug, and roll back with advanced discipline.

Service Level Thinking

Define User-Visible Reliability Before Metrics

A speech system can look healthy on average while the conversation feels broken. Experienced engineers translate user pain into measurable quality, latency, availability, privacy, and cost budgets.

Streaming ASR SLO

Track first partial latency, partial churn, finalization delay, WER slices, entity errors, timeout rate, and cost per audio minute.

Hidden answer: minimum release gate

Require no critical slice WER regression, bounded entity error, p95 first partial under the product target, p99 finalization under the SLO, no increase in privacy-sensitive logging, and a versioned rollback switch. Aggregate WER alone is not enough.

Conversational TTS SLO

Track first audio byte, chunk cadence, underruns, MOS or preference score, interruption rate, abandonment, and cost per generated second.

Hidden answer: quality versus latency tradeoff

A better voice can still be a worse product if it delays the first audible response. Use a two-tier gate: conversational latency and playback stability must pass before preference wins can justify a rollout.

Dashboards

Build Views Around Decisions

Dashboards should answer whether to continue, pause, roll back, route around a dependency, or start a data investigation.

  1. Executive view: SLO burn, error budget, active incidents, rollout stage, and user-visible impact.
  2. Model view: model version, language, acoustic condition, WER/CER, entity error, confidence, drift, and correction signals.
  3. Serving view: queue age, batch size, model time, preprocessing time, GPU memory, autoscaler lag, retries, and fallbacks.
  4. Privacy view: raw audio access, retention, redaction status, opt-in review counts, and synthetic fixture coverage.
  5. Cost view: cost per audio minute, cost per generated second, GPU utilization, cache hit rate, and waste from retries.
Question: Why separate model dashboards from serving dashboards?

Model metrics explain quality regressions; serving metrics explain resource and latency behavior. Mixing them hides root cause. For example, worse p99 latency with stable model time points at queues, batching, routing, autoscaling, dependencies, or retries.

Coding Labs

Small Utilities For Production Decisions

These snippets are intentionally interview-sized. Each one trains the habit of turning vague operational symptoms into testable signals.

Lab 1: SLO Burn Rate

Given request counts and bad request counts in a window, compute error-budget burn rate relative to an allowed bad-request fraction.

Hidden answer: invariant, tests, and Python solution

Invariant: burn rate compares observed bad fraction to the allowed bad fraction. Test zero requests, no errors, exactly-on-budget errors, and a window that burns several times faster than budget.

def burn_rate(total_requests, bad_requests, allowed_bad_fraction):
    if total_requests == 0:
        return 0.0
    if allowed_bad_fraction <= 0:
        raise ValueError("allowed_bad_fraction must be positive")

    observed = bad_requests / total_requests
    return observed / allowed_bad_fraction

Lab 2: Detect Slice Drift

Given baseline and current aggregate feature summaries for audio slices, return slices whose relative change exceeds a threshold.

Hidden answer: invariant, tests, and Python solution

Invariant: compare each slice only when both baseline and current values exist. Treat a zero baseline as a special case so the code does not hide a new population. Test missing slices, zero baseline, small changes, and large increases in silence or clipping.

def drifted_slices(baseline, current, threshold):
    drifted = []
    for name, now in current.items():
        if name not in baseline:
            drifted.append((name, "new_slice", now))
            continue

        before = baseline[name]
        if before == 0:
            if now != 0:
                drifted.append((name, "zero_baseline", now))
            continue

        relative = abs(now - before) / abs(before)
        if relative > threshold:
            drifted.append((name, "relative_change", relative))
    return drifted

Lab 3: Rollback Recommendation

Given canary metrics and budgets, return whether a speech model rollout should continue, pause, or roll back.

Hidden answer: invariant, tests, and Python solution

Invariant: critical metrics override noncritical wins. Test a latency win with entity error regression, missing critical metrics, exactly-on-budget deltas, and a noncritical cost regression that should pause instead of roll back.

def rollout_recommendation(metrics, budgets):
    findings = []
    for name, budget in budgets.items():
        if name not in metrics:
            findings.append(("pause", name, "missing"))
            continue

        value = metrics[name]
        limit = budget["limit"]
        direction = budget["direction"]
        critical = budget.get("critical", False)
        failed = (
            direction == "lower_is_better" and value > limit
        ) or (
            direction == "higher_is_better" and value < limit
        )
        if failed:
            action = "rollback" if critical else "pause"
            findings.append((action, name, value))

    if any(item[0] == "rollback" for item in findings):
        return "rollback", findings
    if findings:
        return "pause", findings
    return "continue", []

Interview Prompts

Advanced Production Questions

Prompt 1: A Canary Improves Average WER But Hurts One Region

The new ASR model improves aggregate WER by 4 percent, but a noisy mobile cohort in one region regresses by 11 percent. What do you do?

Hidden answer: strong response

Do not hide behind the aggregate win. Pause or exclude the affected cohort, verify sample size and label quality, inspect audio feature drift, compare entity error and correction rate, and define a slice gate before resuming. A strong answer protects harmed cohorts while preserving the ability to ship where the model is truly better.

Prompt 2: GPU Cost Doubles Without A Quality Change

ASR quality and latency are stable, but GPU cost per audio minute doubles after a release. How do you investigate?

Hidden answer: root-cause map

Check model artifact size, precision, batch shape, request mix, VAD savings, retries, shadow traffic, autoscaler target, warm pool size, GPU utilization, cache behavior, and route weights. Mitigate with quota, route rollback, batch tuning, disabling shadow traffic, or reverting precision/config changes while preserving quality.

Prompt 3: What Belongs In A Speech Postmortem?

An endpointing bug clipped the first word for quiet speakers for three hours. What should the postmortem contain?

Hidden answer: postmortem outline

Include timeline, detection gap, user impact by slice, root cause, why tests missed quiet speakers, rollback timeline, privacy-safe evidence, immediate remediation, new synthetic fixtures, release gate updates, owners, deadlines, and a follow-up check that proves the prevention work actually shipped.