Serving Reliability Exam

Speech Serving Scaling And Reliability Exam

Timed advanced practice for capacity planning, SLOs, rollout gates, cost-aware routing, rollback judgment, and production debugging across ASR, TTS, LLM, retrieval, and speech-to-speech serving.

Exam Setup

Think Like The Owner Of The Serving Plane

In a advanced practice or launch review, correctness is not enough. You need to explain what traffic is isolated, which metrics gate a release, how you recover, and how cost and latency change when the system grows. Use the prompts below as a two-hour practice exam.

  1. Scope: define request paths, user promises, hard dependencies, and failure domains.
  2. Budget: split end-to-end latency into ASR first partial, final transcript, retrieval, LLM first token, TTS first audio byte, and playback.
  3. Scale: estimate peak QPS, concurrent streams, token rate, audio seconds, GPU memory, and warm pool needs.
  4. Guard: choose CI gates, offline evals, canary metrics, SLO alerts, and automatic rollback triggers.
  5. Operate: describe dashboards, runbooks, privacy-safe logs, drift checks, and post-incident prevention.
Question: What is the key difference between model accuracy readiness and serving readiness?

Accuracy readiness asks whether the model is good enough on expected slices. Serving readiness asks whether the whole product can deliver that quality within latency, availability, privacy, cost, rollback, and operational constraints under live traffic.

Research Update

Full-Duplex Serving Changes The Reliability Problem

Full-duplex speech-to-speech systems are moving beyond simple listen-then-speak voice agents. The serving plane now has to support continuous user audio ingestion while assistant audio, control tokens, tool actions, and safety decisions may be emitted on the same clock.

Native Duplex SpeechLM

Moshi modeled user and assistant speech streams in parallel. BayLing-Duplex extends this direction by converting a turn-based GLM-4-Voice style backbone into a native full-duplex model using a small set of dialogue-state tokens, so listen, speak, and stop decisions become next-token prediction.

Micro-Turn Cascades

DuplexCascade shows that ASR-LLM-TTS stacks can still compete if they remove brittle VAD segmentation and stream chunk-wise micro-turns with conversational control tokens. This preserves strong text LLM reasoning while improving interruption and backchannel behavior.

Role And Voice Control

PersonaPlex adds role conditioning and voice control to duplex speech models. This matters for serving because product traffic may require different role prompts, cloned or selected voices, and persona-specific safety gates without losing low-latency behavior.

Action Streams

DuplexSLA points toward speech-language-action models where listening, speaking, planning, and tool calls share a synchronized timeline. Serving reliability must therefore gate tool actions and speech output together, not as separate post-turn steps.

Question: What new SLOs appear in full-duplex serving?

Add interruption success, false interruption rate, backchannel timing, assistant stop latency, overlap recovery, user barge-in latency, micro-turn freshness, action-stream safety latency, and speech/audio continuity. Keep classic p95/p99 stage latency, error budget burn, and cost-per-success, but slice them by duplex event type.

Question: When would you choose native duplex over a micro-turn cascade?

Choose native duplex when overlapping speech, natural backchannels, emotion/prosody preservation, and sub-second responsiveness dominate. Choose a micro-turn cascade when reasoning quality, debuggability, model replacement, tool integration, and controllable production behavior are more important than fully learned conversational timing.

System Design

Timed Advanced Prompts

Prompt 1: Scale A Real-Time Speech-To-Speech Assistant

Design the serving architecture for a bilingual speech-to-speech assistant that must support 20,000 concurrent calls during peak traffic. The product promise is first partial transcript under 500 ms p95 and first synthesized audio under 1.8 seconds p95.

Hidden answer: advanced design outline

Start with a streaming gateway, per-call session state, VAD, ASR stream workers, retrieval and policy services, LLM serving, TTS streaming, and playback telemetry. Isolate real-time traffic from batch jobs, reserve warm model pools, and use priority queues for interactive turns. Track concurrent streams, audio seconds per second, partial latency, finalization latency, token throughput, TTS first audio byte, error budget burn, and cost per successful turn. Roll out by language, region, device class, and tenant.

Prompt 2: Choose A Routing Policy For Cost And Quality

You have a large ASR model with better noisy-call WER and a smaller model with half the cost and lower latency. Design a routing policy for production.

Hidden answer: policy, gates, and failure modes

Route by observable aggregate-safe signals such as language, device, expected noise class, tenant tier, real-time requirement, and confidence from early chunks. Use the large model for hard slices, high-value calls, low confidence, or escalation. Gate the policy on WER/CER slices, first partial latency, abandonment, downstream task success, and cost per successful minute. Failure modes include biased routing, stale noise classifiers, retry loops, and silently routing rare accents to the wrong model.

Prompt 3: Design CI/CD For A Multi-Model Voice Agent

ASR, retrieval, prompt templates, LLM, TTS, and safety classifiers can all ship independently. Design release gates and rollback strategy.

Hidden answer: release engineering answer

Give every artifact a version, owner, data lineage record, eval report, and rollback target. CI validates schemas, model cards, safety checks, privacy constraints, and reproducible eval inputs. Pre-prod gates cover noisy ASR queries, retrieval freshness, grounded answers, tool-call safety, TTS latency, and end-to-end spoken-task success. Canary gates compare slice metrics against the previous bundle, not only global averages. Rollback can pin one artifact, restore a bundle, disable a feature, or route to a safer fallback.

Incident Response

First-Hour Debugging Drills

Drill 1: Tail Latency Regression With No Error Spike

After a canary reaches 25 percent, p99 end-to-end turn latency jumps from 3.8 seconds to 8.4 seconds, while error rate and average latency look normal.

Hidden answer: triage and rollback trigger

Split by stage latency, model version, route, language, region, request length, prompt tokens, retrieval document count, TTS voice, and GPU queue depth. Averages can stay flat when one slice has a bad tail. Check continuous batching, KV-cache eviction, prompt growth, cache misses, retry storms, and slow TTS warmup. Roll back or freeze the canary if p99 burn persists for the real-time SLO window, even without a 5xx spike.

Drill 2: Drift In Noisy Accent Slice

Weekly aggregate WER is flat, but support tickets mention worse recognition for noisy accented calls after a data refresh.

Hidden answer: slice investigation

Compare slice-level WER, entity error rate, endpointing errors, partial churn, and downstream task success before and after the refresh. Audit labeling mix, augmentation, sample weights, language detection, and test-set leakage. Mitigate by routing the slice to the last good model or raising confidence thresholds. Add active learning coverage checks, label audit gates, and drift monitors for the affected slice.

Drill 3: Cost Spike During Normal Traffic

Daily spend rises 45 percent with no matching request growth. Task success is unchanged, but GPU utilization is less efficient.

Hidden answer: cost debugging checklist

Inspect tokens per turn, retrieved chunks, prompt template size, retries, context carryover, cache hit rate, model mix, batch occupancy, speculative decoding acceptance, and fallback frequency. Stabilize with token caps, retrieval caps, priority classes, fallback routing, and retry budgets. Prevention is a release gate on cost per successful turn and a dashboard that ties spend to quality, latency, and model version.

Coding

Production-Flavored Algorithm Drills

These are not replacements for the Blind 75 chapters. They show how the same invariants appear in serving infrastructure interviews.

Drill 1: Peak Concurrent Streams

Given call intervals [start_ms, end_ms], compute the maximum number of concurrent speech streams. Explain tests and common mistakes.

Hidden answer: invariant, tests, and Python solution

Invariant: a sweep-line count equals active streams after applying every event at a timestamp. Process end events before start events when an interval ending at t does not overlap an interval starting at t. Test empty input, touching intervals, nested intervals, identical starts, and one long call.

def peak_concurrent_streams(intervals):
    events = []
    for start, end in intervals:
        events.append((start, 1))
        events.append((end, -1))

    active = 0
    peak = 0
    for _, delta in sorted(events, key=lambda item: (item[0], item[1])):
        active += delta
        peak = max(peak, active)
    return peak

Drill 2: Canary Rollback Gate

Given old and new aggregate metrics by slice, return the slices that should block promotion when latency, quality, or cost regresses.

Hidden answer: invariant, edge cases, and Python solution

Invariant: promotion is safe only if every protected slice stays within agreed thresholds. Test missing slices, improved quality with worse latency, low-volume slices, and zero old cost. A strong answer describes minimum sample rules instead of trusting tiny samples.

def rollback_blockers(old, new, min_calls=200):
    blockers = []
    for slice_name, current in new.items():
        baseline = old.get(slice_name)
        if baseline is None or current["calls"] < min_calls:
            continue

        wer_delta = current["wer"] - baseline["wer"]
        p95_delta = current["p95_ms"] - baseline["p95_ms"]
        cost_ratio = current["cost_per_min"] / max(baseline["cost_per_min"], 0.01)

        reasons = []
        if wer_delta > 0.015:
            reasons.append("quality")
        if p95_delta > 250:
            reasons.append("latency")
        if cost_ratio > 1.2:
            reasons.append("cost")
        if reasons:
            blockers.append({"slice": slice_name, "reasons": reasons})
    return blockers

Drill 3: Sliding Window Error Budget Burn

Given per-minute bad-request counts and total-request counts, flag windows where the error rate exceeds the SLO budget.

Hidden answer: strategy and Python solution

This is the same window invariant as Blind 75 substring problems: maintain exactly the aggregate state for the current window, update it when the right edge enters, and remove the left edge when it leaves. Test all-good traffic, all-bad traffic, zero traffic, and a single bad spike.

def burn_windows(points, window_size=5, budget_rate=0.01):
    bad = 0
    total = 0
    alerts = []

    for right, point in enumerate(points):
        bad += point["bad"]
        total += point["total"]

        if right >= window_size:
            left = points[right - window_size]
            bad -= left["bad"]
            total -= left["total"]

        if right + 1 >= window_size and total > 0:
            rate = bad / total
            if rate > budget_rate:
                alerts.append({
                    "ending_minute": point["minute"],
                    "error_rate": rate,
                    "window_size": window_size,
                })
    return alerts

Rubric

What A Strong Answer Must Include

Strong Signals

Clear SLOs, stage-level latency budgets, capacity math, slice-aware evaluation, privacy-safe observability, reversible rollouts, rollback targets, and cost-per-success reasoning.

Weak Signals

Only quoting global WER, ignoring tail latency, treating all traffic as one queue, omitting rollback, skipping data lineage, or proposing logs that expose private audio or transcripts.

Question: How should you close this exam in an interview?

Summarize the chosen architecture, name the highest-risk assumption, state the first metric you would watch after launch, and describe the exact rollback or degradation path if that metric fails.