Speech Serving Observability And SLOs

Service Level Thinking

Define User-Visible Reliability Before Metrics

A speech system can look healthy on average while the conversation feels broken. Experienced engineers translate user pain into measurable quality, latency, availability, privacy, and cost budgets.

Streaming ASR SLO

Track first partial latency, partial churn, finalization delay, approved aggregate WER slices, entity errors, timeout rate, and cost per audio minute.

Hidden answer: minimum release gate

Require no critical slice WER regression, bounded entity error, p95 first partial under the product target, p99 finalization under the SLO, no increase in privacy-sensitive logging, and a versioned rollback switch. Aggregate WER alone is not enough.

Conversational TTS SLO

Track first audio byte, chunk cadence, underruns, MOS or preference score, interruption rate, abandonment, and cost per generated second.

Hidden answer: quality versus latency tradeoff

A better voice can still be a worse product if it delays the first audible response. Use a two-tier gate: conversational latency and playback stability must pass before preference wins can justify a rollout.

Dashboards

Build Views Around Decisions

Dashboards should answer whether to continue, pause, roll back, route around a dependency, or start a data investigation.

Executive view: SLO burn, error budget, active incidents, rollout stage, and user-visible impact.
Model view: model version, language, acoustic condition, approved aggregate WER/CER slices, entity error, confidence, drift, and correction signals.
Serving view: queue age, batch size, model time, preprocessing time, GPU memory, autoscaler lag, retries, and fallbacks.
Privacy view: raw audio access, retention, redaction status, opt-in review counts, and synthetic fixture coverage.
Cost view: cost per audio minute, cost per generated second, GPU utilization, cache hit rate, and waste from retries.

Question: Why separate model dashboards from serving dashboards?

Model metrics explain quality regressions; serving metrics explain resource and latency behavior. Mixing them hides root cause. For example, worse p99 latency with stable model time points at queues, batching, routing, autoscaling, dependencies, or retries.

Question: How should burn-rate alerts page the team?

Avoid paging from one noisy short window alone. For production speech services, pair a fast window with a longer confirmation window, then route slower budget burn to tickets. This matches the Google SRE Workbook guidance on multi-window, multi-burn-rate alerts and keeps transient audio spikes from becoming unnecessary incidents.

Source: Google SRE Workbook, Alerting on SLOs.

Coding Labs

Small Utilities For Production Decisions

These snippets are intentionally interview-sized. Each one trains the habit of turning vague operational symptoms into testable signals.

Lab 1: SLO Burn Rate

Given request counts and bad request counts in a window, compute error-budget burn rate relative to an allowed bad-request fraction.

Hidden answer: invariant, tests, and Python solution

Invariant: burn rate compares observed bad fraction to the allowed bad fraction. Test zero requests, no errors, exactly-on-budget errors, invalid negative counts, bad counts above total, invalid budget fractions, and a window that burns several times faster than budget. The function below scores one window; a real pager should combine short and long windows before waking an owner.

def burn_rate(total_requests, bad_requests, allowed_bad_fraction):
    if total_requests < 0 or bad_requests < 0:
        raise ValueError("request counts must be non-negative")
    if bad_requests > total_requests:
        raise ValueError("bad_requests cannot exceed total_requests")
    if allowed_bad_fraction <= 0 or allowed_bad_fraction > 1:
        raise ValueError("allowed_bad_fraction must be in (0, 1]")
    if total_requests == 0:
        return 0.0

    observed = bad_requests / total_requests
    return observed / allowed_bad_fraction

Lab 2: Detect Slice Drift

Given baseline and current aggregate feature summaries for audio slices, return slices whose relative change exceeds a threshold.

Hidden answer: invariant, tests, and Python solution

Invariant: report slices that appear or disappear, and compare a slice only when both baseline and current values exist. Treat a zero baseline as a special case so the code does not hide a new population. Test missing slices, zero baseline, invalid negative thresholds, small changes, and large increases in silence or clipping.

def drifted_slices(baseline, current, threshold):
    if threshold < 0:
        raise ValueError("threshold must be non-negative")

    drifted = []
    for name in sorted(set(baseline) | set(current)):
        if name not in baseline:
            now = current[name]
            drifted.append((name, "new_slice", now))
            continue
        if name not in current:
            drifted.append((name, "missing_current", baseline[name]))
            continue

        now = current[name]
        before = baseline[name]
        if before == 0:
            if now != 0:
                drifted.append((name, "zero_baseline", now))
            continue

        relative = abs(now - before) / abs(before)
        if relative > threshold:
            drifted.append((name, "relative_change", relative))
    return drifted

Lab 3: Rollback Recommendation

Given canary metrics and budgets, return whether a speech model rollout should continue, pause, or roll back.

Hidden answer: invariant, tests, and Python solution

Invariant: critical metrics override noncritical wins. Test a latency win with entity error regression, missing critical metrics, an invalid direction string, exactly-on-budget deltas, and a noncritical cost regression that should pause instead of roll back.

def rollout_recommendation(metrics, budgets):
    findings = []
    for name, budget in budgets.items():
        if name not in metrics:
            findings.append(("pause", name, "missing"))
            continue

        value = metrics[name]
        limit = budget["limit"]
        direction = budget["direction"]
        if direction not in {"lower_is_better", "higher_is_better"}:
            raise ValueError(f"unknown direction for {name}: {direction}")

        critical = budget.get("critical", False)
        failed = (
            direction == "lower_is_better" and value > limit
        ) or (
            direction == "higher_is_better" and value < limit
        )
        if failed:
            action = "rollback" if critical else "pause"
            findings.append((action, name, value))

    if any(item[0] == "rollback" for item in findings):
        return "rollback", findings
    if findings:
        return "pause", findings
    return "continue", []

Interview Prompts

Advanced Production Questions

Prompt 1: A Canary Improves Average WER But Hurts One Region

The new ASR model improves aggregate WER by 4 percent, but a noisy consented or approved aggregate mobile cohort in one region regresses by 11 percent. What do you do?

Hidden answer: strong response

Do not hide behind the aggregate win. Pause or exclude the affected cohort, verify sample size and label quality, inspect audio feature drift, compare entity error and correction rate, and define a slice gate using approved aggregate or consented tags before resuming. A strong answer protects harmed cohorts while preserving the ability to ship where the model is truly better.

Prompt 2: GPU Cost Doubles Without A Quality Change

ASR quality and latency are stable, but GPU cost per audio minute doubles after a release. How do you investigate?

Hidden answer: root-cause map

Check model artifact size, precision, batch shape, request mix, VAD savings, retries, shadow traffic, autoscaler target, warm pool size, GPU utilization, cache behavior, and route weights. Mitigate with quota, route rollback, batch tuning, disabling shadow traffic, or reverting precision/config changes while preserving quality.

Prompt 3: What Belongs In A Speech Postmortem?

An endpointing bug clipped the first word for quiet speakers for three hours. What should the postmortem contain?

Hidden answer: postmortem outline

Include timeline, detection gap, user impact by approved aggregate or consented slice, root cause, why tests missed quiet speakers, rollback timeline, privacy-safe evidence, immediate remediation, new synthetic fixtures, release gate updates, owners, deadlines, and a follow-up check that proves the prevention work actually shipped.