Model Serving On-Call Runbook

Operating Model

Debug The Serving Path As A Queueing System

Speech products hide multiple queues: audio capture, VAD, ASR chunking, LLM decoding, retrieval, TTS synthesis, network delivery, and client playback. Production on-call work starts by locating the queue that is growing, then deciding whether the safest fix is rollback, capacity, degradation, or traffic shaping.

Stabilize: protect users with rollback, traffic shed, fallback model, feature flag, or human handoff.
Localize: split symptoms by model, version, tenant, region, device, language, codec, request path, and approved aggregate or consented high-risk cohorts.
Quantify: compare p50, p95, p99, error rate, queue depth, GPU utilization, cost per minute, and quality slices.
Hypothesize: map each metric change to a concrete failure mode such as KV-cache pressure, VAD drift, prompt bloat, or TTS warmup.
Recover: choose the smallest reversible action, record why it was safe, and add a prevention gate.

Question: Why is p95 latency often more useful than average latency during a voice incident?

Averages can hide a small but severe tail that ruins interactive conversations. Voice users notice delayed first partials, delayed first audio byte, and long finalization. p95 or p99 by slice reveals whether one model version, language, tenant, device, or route is creating the experience regression.

Incident Drills

Four Realistic On-Call Scenarios

For each drill, write the first five dashboard queries, the immediate mitigation, the rollback trigger, and the prevention item before opening the hidden answer.

Drill 1: Streaming ASR Partial Latency Doubles

A streaming ASR release keeps final WER flat but first partial p95 moves from 420 ms to 930 ms for mobile users on noisy calls.

Hidden answer: strong triage outline

Slice by model version, device, codec, language, VAD state, chunk size, region, connection type, and approved aggregate or consented high-risk cohorts. Check VAD false silence, audio chunk buffering, batching policy, GPU queue depth, and client upload cadence. Mitigate with canary rollback or reduced batch delay on interactive traffic. Add a release gate for first partial latency and partial churn by noisy/mobile slices, not only final WER.

Drill 2: TTS First Audio Byte Regresses After A Voice Upgrade

A new neural voice improves preference scores but increases p95 first audio byte and callers interrupt or abandon more often.

Hidden answer: mitigation and prevention

Compare text normalization, sentence segmentation, acoustic model warmup, vocoder time, cache hit rate, region placement, and playback errors. Mitigate by rolling back the voice for interactive flows, using a short first segment, warming pools, or routing long-form synthesis to batch capacity. Prevention: add conversational latency, interruption rate, and abandonment gates alongside MOS/preference.

Drill 3: Shared LLM Serving Costs Spike Overnight

GPU spend rises 70 percent while request count rises only 8 percent. Voice-agent task success is unchanged, but p99 latency worsens.

Hidden answer: likely causes and controls

Look for prompt growth, retrieval returning too many documents, longer conversations, disabled caching, lower batch efficiency, speculative decoding fallback, model mix changes, and retry storms. Stabilize with token budgets, admission control, priority queues, fallback models, and retrieval caps. Add cost-per-success and tokens-per-turn gates to release reviews.

Drill 4: Spoken RAG Gives Ungrounded Answers For One Tenant

A support voice agent starts answering from stale policy documents for one enterprise tenant after an index refresh.

Hidden answer: investigation path

Check index version, document ACLs, embedding model version, chunking, freshness watermark, ASR entity substitutions, reranker logs, and prompt grounding instructions. Mitigate by pinning the tenant to the last good index or forcing human handoff for affected intents. Prevention: add per-tenant retrieval freshness and grounded answer evals with noisy ASR queries before index promotion.

Coding Labs

Small Utilities For Production Judgment

These examples use synthetic aggregate metrics. They are interview useful because they force you to encode invariants instead of giving vague operational advice.

Lab 1: Queue Saturation Detector

Given per-minute aggregate serving metrics with numeric minute indexes, flag minutes where queue growth and tail latency suggest overload even if error rate is low.

Hidden answer: invariant, test cases, and Python solution

Invariant: overload is a joint signal. High utilization alone can be healthy; queue growth plus tail latency means work is waiting. Sort or require monotonic telemetry before comparing adjacent minutes. Test flat queues, rising queues with low latency, high latency with no queue growth, missing minutes, duplicate minutes, and impossible negative counters.

def detect_saturation(minutes, queue_growth_threshold=25, p95_threshold_ms=1200):
    if queue_growth_threshold < 0 or p95_threshold_ms < 0:
        raise ValueError("thresholds must be non-negative")

    alerts = []
    previous_minute = None
    previous_depth = None

    for point in sorted(minutes, key=lambda item: item["minute"]):
        minute = point["minute"]
        depth = point["queue_depth"]
        p95 = point["p95_ms"]
        if depth < 0 or p95 < 0:
            raise ValueError(f"minute {minute}: queue depth and p95 must be non-negative")
        if previous_depth is None:
            previous_minute = minute
            previous_depth = depth
            continue
        if minute == previous_minute:
            raise ValueError(f"duplicate telemetry for minute {minute}")

        if minute - previous_minute > 1:
            alerts.append({
                "minute": minute,
                "missing_minutes": minute - previous_minute - 1,
                "action": "check_telemetry_gap_before_scaling_decision",
            })

        growth = depth - previous_depth
        if growth >= queue_growth_threshold and p95 >= p95_threshold_ms:
            alerts.append({
                "minute": minute,
                "queue_growth": growth,
                "p95_ms": p95,
                "action": "shed_batch_or_add_capacity",
            })
        previous_minute = minute
        previous_depth = depth

    return alerts

Lab 2: Cost Per Successful Voice Turn

Compute cost per successful turn by route and flag routes that are expensive without a matching task-success gain.

Hidden answer: common mistakes and Python solution

Common mistakes are dividing by all turns instead of successful turns, ignoring zero-success routes, and comparing cost without quality. Also reject impossible aggregate telemetry before making routing decisions. A strong answer treats cost as a constraint attached to user value, not as an isolated infrastructure metric.

def flag_cost_regressions(routes, baseline_cost_per_success, min_success_gain=0.02):
    if baseline_cost_per_success < 0:
        raise ValueError("baseline cost per success must be non-negative")

    flagged = []
    for name, metrics in routes.items():
        turns = metrics["turns"]
        task_success_rate = metrics["task_success_rate"]
        baseline_rate = metrics["baseline_task_success_rate"]
        total_cost = metrics["total_cost_usd"]

        if turns < 0 or total_cost < 0:
            raise ValueError(f"{name}: turns and total cost must be non-negative")
        if not 0 <= task_success_rate <= 1 or not 0 <= baseline_rate <= 1:
            raise ValueError(f"{name}: success rates must be between 0 and 1")

        successes = turns * task_success_rate
        if successes == 0:
            cost_per_success = float("inf")
        else:
            cost_per_success = total_cost / successes

        success_gain = task_success_rate - baseline_rate
        if cost_per_success > baseline_cost_per_success and success_gain < min_success_gain:
            flagged.append({
                "route": name,
                "cost_per_success": cost_per_success,
                "success_gain": success_gain,
                "action": "cap_tokens_or_route_to_cheaper_model",
            })

    return sorted(flagged, key=lambda item: item["cost_per_success"], reverse=True)

Lab 3: Rollback Candidate Selector

Given model versions and slice regressions, choose the version that should be rolled back first.

Hidden answer: invariant and Python solution

Invariant: a critical-slice regression can outweigh a larger aggregate win. Weight user-harm slices such as emergency intents, medical vocabulary, identity verification, or payment flows higher than generic traffic.

def select_rollback_candidate(version_slices, critical_weights):
    if not version_slices:
        raise ValueError("at least one model version is required")

    scores = {}
    for version, slices in version_slices.items():
        score = 0.0
        for slice_name, regression in slices.items():
            if regression < 0:
                raise ValueError(f"{version}/{slice_name}: regression must be non-negative")
            weight = critical_weights.get(slice_name, 1.0)
            if weight < 0:
                raise ValueError(f"{slice_name}: critical weight must be non-negative")
            score += max(0.0, regression) * weight
        scores[version] = score

    version, score = max(scores.items(), key=lambda item: item[1])
    return None if score == 0 else version

Interview Prompts

Advanced Answers Should Name The Tradeoff

Prompt 1: Interactive Versus Batch Capacity

Why should an audio ML platform isolate interactive speech traffic from batch transcription and offline evaluation traffic?

Hidden answer

Interactive traffic has strict latency and deadline requirements, while batch work values throughput and cost efficiency. Shared queues let cheap batch work create tail-latency incidents for live conversations. Isolation enables priority scheduling, admission control, predictable SLOs, and independent scaling policies.

Prompt 2: When To Degrade Instead Of Roll Back

Give an example where graceful degradation is better than rolling back a speech-to-speech system.

Hidden answer

If the LLM path is slow but ASR and account lookup are healthy, the system can degrade to shorter answers, fewer retrieved documents, a smaller model, or human handoff for high-risk intents. Rollback is better when a known release created user harm or quality regression; degradation is better when demand temporarily exceeds capacity and the reduced path still satisfies the product contract.

Prompt 3: Production Debugging Without Private Transcripts

How can you debug ASR quality regressions without storing private raw audio or transcripts in incident tickets?

Hidden answer

Use aggregate slice metrics, consented eval sets, synthetic fixtures, hashed identifiers, redacted examples, confusion classes, correction rates, endpointing statistics, and private review tools with access controls. Incident tickets should carry enough evidence to reproduce the failure class without exposing sensitive content.