Speech Serving Scaling and Reliability Exam

Exam Setup

Think Like The Owner Of The Serving Plane

In an advanced practice or launch review, correctness is not enough. You need to explain what traffic is isolated, which metrics gate a release, how you recover, and how cost and latency change when the system grows. Use the prompts below as a two-hour practice exam.

Scope: define request paths, user promises, hard dependencies, and failure domains.
Budget: split end-to-end latency into ASR first partial, final transcript, retrieval, LLM first token, TTS first audio byte, and playback.
Scale: estimate peak QPS, concurrent streams, token rate, audio seconds, GPU memory, and warm pool needs.
Guard: choose CI gates, offline evals, canary metrics, SLO alerts, and automatic rollback triggers.
Operate: describe dashboards, runbooks, privacy-safe logs, drift checks, and post-incident prevention.

Question: What is the key difference between model accuracy readiness and serving readiness?

Accuracy readiness asks whether the model is good enough on expected slices. Serving readiness asks whether the whole product can deliver that quality within latency, availability, privacy, cost, rollback, and operational constraints under live traffic.

Research Update

Full-Duplex Serving Changes The Reliability Problem

Full-duplex speech-to-speech systems are moving beyond simple listen-then-speak voice agents. The serving plane now has to support continuous user audio ingestion while assistant audio, control tokens, tool actions, and safety decisions may be emitted on the same clock.

Native Duplex SpeechLM

Moshi modeled user and assistant speech streams in parallel. BayLing-Duplex extends this direction by converting a turn-based GLM-4-Voice style backbone into a native full-duplex model using a few added dialogue-state tokens, so listen, speak, and stop decisions become next-token prediction.

Micro-Turn Cascades

DuplexCascade shows that ASR-LLM-TTS stacks can still compete if they remove brittle VAD segmentation and stream chunk-wise micro-turns with conversational control tokens. This preserves strong text LLM reasoning while improving interruption and backchannel behavior.

Role And Voice Control

PersonaPlex adds role conditioning and voice control to duplex speech models. This matters for serving because product traffic may require different role prompts, cloned or selected voices, and persona-specific safety gates without losing low-latency behavior.

Action Streams

DuplexSLA points toward speech-language-action models where listening, speaking, planning, and tool calls share a synchronized timeline. Serving reliability must therefore gate tool actions and speech output together, not as separate post-turn steps.

Question: What new SLOs appear in full-duplex serving?

Add interruption success, false interruption rate, backchannel timing, assistant stop latency, overlap recovery, user barge-in latency, micro-turn freshness, action-stream safety latency, and speech/audio continuity. Keep classic p95/p99 stage latency, error budget burn, and cost-per-success, but slice them by duplex event type.

Primary references for this research update

Verify the serving claims against the primary sources: Moshi, BayLing-Duplex, DuplexCascade, PersonaPlex, and DuplexSLA. Treat performance and benchmark claims as fast-moving research evidence, not as production guarantees.

Question: When would you choose native duplex over a micro-turn cascade?

Choose native duplex when overlapping speech, natural backchannels, emotion/prosody preservation, and sub-second responsiveness dominate. Choose a micro-turn cascade when reasoning quality, debuggability, model replacement, tool integration, and controllable production behavior are more important than fully learned conversational timing.

System Design

Timed Advanced Prompts

Prompt 1: Scale A Real-Time Speech-To-Speech Assistant

Design the serving architecture for a bilingual speech-to-speech assistant that must support 20,000 concurrent calls during peak traffic. The product promise is first partial transcript under 500 ms p95 and first synthesized audio under 1.8 seconds p95.

Hidden answer: advanced design outline

Start with a streaming gateway, per-call session state, VAD, ASR stream workers, retrieval and policy services, LLM serving, TTS streaming, and playback telemetry. Isolate real-time traffic from batch jobs, reserve warm model pools, and use priority queues for interactive turns. Track concurrent streams, audio seconds per second, partial latency, finalization latency, token throughput, TTS first audio byte, error budget burn, and cost per successful turn. Roll out by language, region, device class, and tenant.

Prompt 2: Choose A Routing Policy For Cost And Quality

You have a large ASR model with better noisy-call WER and a smaller model with half the cost and lower latency. Design a routing policy for production.

Hidden answer: policy, gates, and failure modes

Route by observable aggregate-safe signals such as language, device, expected noise class, tenant tier, real-time requirement, and confidence from early chunks. Use the large model for hard slices, high-value calls, low confidence, or escalation. Gate the policy on WER/CER slices, first partial latency, abandonment, downstream task success, and cost per successful minute. Failure modes include biased routing, stale noise classifiers, retry loops, and silently routing approved aggregate or consented pronunciation variation slices to the wrong model.

Prompt 3: Design CI/CD For A Multi-Model Voice Agent

ASR, retrieval, prompt templates, LLM, TTS, and safety classifiers can all ship independently. Design release gates and rollback strategy.

Hidden answer: release engineering answer

Give every artifact a version, owner, data lineage record, eval report, and rollback target. CI validates schemas, model cards, safety checks, privacy constraints, and reproducible eval inputs. Pre-prod gates cover noisy ASR queries, retrieval freshness, grounded answers, tool-call safety, TTS latency, and end-to-end spoken-task success. Canary gates compare slice metrics against the previous bundle, not only global averages. Rollback can pin one artifact, restore a bundle, disable a feature, or route to a safer fallback.

Incident Response

First-Hour Debugging Drills

Drill 1: Tail Latency Regression With No Error Spike

After a canary reaches 25 percent, p99 end-to-end turn latency jumps from 3.8 seconds to 8.4 seconds, while error rate and average latency look normal.

Hidden answer: triage and rollback trigger

Split by stage latency, model version, route, language, region, request length, prompt tokens, retrieval document count, TTS voice, and GPU queue depth. Averages can stay flat when one slice has a bad tail. Check continuous batching, KV-cache eviction, prompt growth, cache misses, retry storms, and slow TTS warmup. Roll back or freeze the canary if p99 burn persists for the real-time SLO window, even without a 5xx spike.

Drill 2: Drift In A Consented Noisy Speech Slice

Weekly aggregate WER is flat, but support tickets mention worse recognition for an approved aggregate or consented noisy pronunciation-variation slice after a data refresh.

Hidden answer: slice investigation

Compare privacy-approved slice-level WER, entity error rate, endpointing errors, partial churn, and downstream task success before and after the refresh. Audit labeling mix, augmentation, sample weights, language detection, consent or aggregation policy, and test-set leakage. Mitigate by routing the slice to the last good model or raising confidence thresholds. Add active learning coverage checks, label audit gates, and drift monitors for the affected approved aggregate or consented slice.

Drill 3: Cost Spike During Normal Traffic

Daily spend rises 45 percent with no matching request growth. Task success is unchanged, but GPU utilization is less efficient.

Hidden answer: cost debugging checklist

Inspect tokens per turn, retrieved chunks, prompt template size, retries, context carryover, cache hit rate, model mix, batch occupancy, speculative decoding acceptance, and fallback frequency. Stabilize with token caps, retrieval caps, priority classes, fallback routing, and retry budgets. Prevention is a release gate on cost per successful turn and a dashboard that ties spend to quality, latency, and model version.

Coding

Production-Flavored Algorithm Drills

These drills keep the coding practice inside speech serving reliability: capacity, rollout gates, and SLO windows all depend on careful handling of telemetry edges.

Drill 1: Peak Concurrent Streams

Given call intervals [start_ms, end_ms], compute the maximum number of concurrent speech streams. Explain tests and common mistakes.

Hidden answer: invariant, tests, and Python solution

Invariant: a sweep-line count equals active streams after applying every event at a timestamp. Treat intervals as half-open [start_ms, end_ms) call spans and validate that every call has positive duration before the sweep. Process end events before start events when an interval ending at t does not overlap an interval starting at t. Test empty input, touching intervals, nested intervals, identical starts, zero-length intervals, inverted intervals, and one long call.

def peak_concurrent_streams(intervals):
    events = []
    for start, end in intervals:
        if end <= start:
            raise ValueError("end_ms must be greater than start_ms")
        events.append((start, 1))
        events.append((end, -1))

    active = 0
    peak = 0
    for _, delta in sorted(events, key=lambda item: (item[0], item[1])):
        active += delta
        peak = max(peak, active)
    return peak

Drill 2: Canary Rollback Gate

Given old and new aggregate metrics by slice, return the slices that should block promotion when latency, quality, or cost regresses.

Hidden answer: invariant, edge cases, and Python solution

Invariant: promotion is safe only if every launch-critical slice stays within agreed thresholds. Test missing slices, improved quality with worse latency, low-volume old or new slices, and zero old cost. A strong answer describes minimum sample rules instead of trusting tiny samples.

def rollback_blockers(old, new, min_calls=200):
    if min_calls <= 0:
        raise ValueError("min_calls must be positive")

    blockers = []
    for slice_name, current in new.items():
        baseline = old.get(slice_name)
        if baseline is None:
            continue

        if current["calls"] < 0 or baseline["calls"] < 0:
            raise ValueError("calls must be non-negative")
        if current["wer"] < 0 or baseline["wer"] < 0:
            raise ValueError("wer must be non-negative")
        if current["p95_ms"] < 0 or baseline["p95_ms"] < 0:
            raise ValueError("p95_ms must be non-negative")
        if current["cost_per_min"] < 0 or baseline["cost_per_min"] < 0:
            raise ValueError("cost_per_min must be non-negative")

        if current["calls"] < min_calls or baseline["calls"] < min_calls:
            continue

        wer_delta = current["wer"] - baseline["wer"]
        p95_delta = current["p95_ms"] - baseline["p95_ms"]
        cost_ratio = current["cost_per_min"] / max(baseline["cost_per_min"], 0.01)

        reasons = []
        if wer_delta > 0.015:
            reasons.append("quality")
        if p95_delta > 250:
            reasons.append("latency")
        if cost_ratio > 1.2:
            reasons.append("cost")
        if reasons:
            blockers.append({"slice": slice_name, "reasons": reasons})
    return blockers

Drill 3: Sliding Window Error Budget Burn

Given per-minute bad-request counts and total-request counts, flag windows where the error rate exceeds the SLO budget.

Hidden answer: strategy and Python solution

Maintain exactly the aggregate state for the current SLO window, update it when the right edge enters, and remove the left edge when it leaves. Reject non-positive window sizes, negative request counts, and bad counts above total counts before computing burn. Test all-good traffic, all-bad traffic, zero traffic, invalid telemetry, and a single bad spike.

def burn_windows(points, window_size=5, budget_rate=0.01):
    if window_size <= 0:
        raise ValueError("window_size must be positive")
    if budget_rate < 0:
        raise ValueError("budget_rate must be non-negative")

    bad = 0
    total = 0
    alerts = []

    for right, point in enumerate(points):
        if point["bad"] < 0 or point["total"] < 0:
            raise ValueError("request counts must be non-negative")
        if point["bad"] > point["total"]:
            raise ValueError("bad requests cannot exceed total requests")

        bad += point["bad"]
        total += point["total"]

        if right >= window_size:
            left = points[right - window_size]
            bad -= left["bad"]
            total -= left["total"]

        if right + 1 >= window_size and total > 0:
            rate = bad / total
            if rate > budget_rate:
                alerts.append({
                    "ending_minute": point["minute"],
                    "error_rate": rate,
                    "window_size": window_size,
                })
    return alerts

Rubric

What A Strong Answer Must Include

Strong Signals

Clear SLOs, stage-level latency budgets, capacity math, slice-aware evaluation, privacy-safe observability, reversible rollouts, rollback targets, and cost-per-success reasoning.

Weak Signals

Only quoting global WER, ignoring tail latency, treating all traffic as one queue, omitting rollback, skipping data lineage, or proposing logs that expose private audio or transcripts.

Question: How should you close this exam in an interview?

Summarize the chosen architecture, name the highest-risk assumption, state the first metric you would watch after launch, and describe the exact rollback or degradation path if that metric fails.