Module 33

Real-Time Speech System Design Lab

Practice the advanced practice loop for low-latency ASR, speech-to-speech, spoken RAG, and TTS systems: draw the serving path, budget latency, define rollback gates, debug production symptoms, and write small deterministic utilities over aggregate telemetry.

System Design

Start With The Spoken Turn Contract

A real-time speech product is not only a model. It is a timed contract across capture, VAD, ASR, policy, retrieval, LLM reasoning, TTS, playback, observability, and rollback. The best strong answers make the contract explicit before debating model choices.

Interactive Path

Wake word, VAD, streaming ASR, endpointing, intent detection, retrieval, LLM/tool call, TTS first chunk, playback, and barge-in.

Question: Which metrics belong on the interactive path?

Time to first partial, partial rewrite rate, endpointing delay, first audio byte, complete spoken turn latency, cancellation success, barge-in recovery, task success, correction rate, and cost per successful turn. Final WER alone is too narrow.

Control Plane

Model registry, prompt registry, dataset manifests, eval reports, canary routing, tenant quotas, rollback targets, and incident state.

Question: What makes the control plane interview-worthy?

It shows how research changes become reversible production changes. A strong design records model hashes, feature contracts, prompt versions, eval slices, privacy policy, canary cohorts, SLOs, cost gates, and known-good rollback targets.

Prompt: Design A Real-Time Meeting Assistant

Build a meeting assistant that streams captions, answers questions about the meeting, and produces a summary. It must support noisy rooms, private enterprise data, and rollback during business hours.

Hidden answer: strong architecture outline

Separate the live caption path from slower summarization and retrieval jobs. Use client audio capture, VAD, streaming ASR, stable-prefix rendering, speaker diarization if required, a consent-aware transcript store, retrieval over authorized meeting context, and background summarization. Gate releases by partial latency, final WER by noise and accent slice, speaker attribution, retrieval grounding, privacy checks, and cost per meeting hour. Keep previous ASR, diarization, prompt, and index versions available for rollback.

Production Debugging

Debug The Whole Turn, Not Just The Model

Many real-time regressions hide in orchestration, queueing, endpointing, caches, and client rendering. Practice asking for the smallest trace that separates model quality from system behavior.

Symptom: Good Final WER, Bad UX

Users say captions feel unstable. Offline final WER is unchanged.

Hidden answer: first checks

Check partial rewrite rate, stable-prefix policy, endpointing, VAD thresholds, chunk size, decoder beam settings, timestamp alignment, client debounce, network jitter, and slices by noise, accent, language, and device. Mitigate with a stable prefix, endpointing rollback, or affected-slice routing.

Symptom: TTS Starts Late

First audio byte regressed, but total synthesis time is flat.

Hidden answer: first checks

Inspect request queue time, model warm pool, prompt preprocessing, text normalization, voice cache hits, network handoff, chunk streaming, and admission control. A flat synthesis metric can hide queueing or orchestration delay before synthesis begins.

Symptom: Spoken RAG Hallucinates

ASR and retrieval dashboards look healthy, but spoken answers cite documents that do not support the answer.

Hidden answer: first checks

Audit ASR entity substitutions, query rewrite prompts, index freshness, ACL filtering, reranker version, top-k size, stale cache entries, answer grounding, citation selection, and TTS pronunciation of entities. Add evals that inject realistic ASR errors into retrieval queries.

Symptom: Cost Spike Under Flat Traffic

Traffic is stable, task success is flat, and GPU spend is up.

Hidden answer: first checks

Check average context tokens, retrieved chunks, retries, shadow traffic, fallback model routes, cache hit rate, batch efficiency, autoscaler floor, warm replicas, eval jobs on live pools, and tenant mix. Report cost per successful spoken turn, not only aggregate spend.

Coding Lab

Utilities That Encode Operational Judgment

These are interview-sized functions. Before opening the answers, state the invariant, edge cases, and tests.

Drill 1: End-To-End Latency Budget

Given component p95 latencies and a product SLO, return whether the path has budget left and name the largest contributors.

Hidden answer: invariant, tests, and Python solution

Invariant: the sum of component p95 estimates is an approximate upper-bound planning budget, not a replacement for trace percentiles. Test exact budget, empty components, one dominant component, and a negative or missing value rejected by input validation in real code.

def latency_budget_report(components_ms, slo_ms, top_k=3):
    total = sum(components_ms.values())
    contributors = sorted(
        components_ms.items(),
        key=lambda item: item[1],
        reverse=True,
    )[:top_k]
    return {
        "total_ms": total,
        "slo_ms": slo_ms,
        "within_budget": total <= slo_ms,
        "remaining_ms": slo_ms - total,
        "top_contributors": contributors,
    }

Drill 2: Stable Prefix Churn

Given successive partial transcripts, compute how often already-shown words changed. Use this as a rough caption stability signal.

Hidden answer: invariant, common mistakes, and Python solution

Invariant: only compare the shared prefix length between adjacent partials. Common mistakes include comparing full strings, counting appended words as churn, and ignoring empty partials.

def prefix_churn_rate(partials):
    comparisons = 0
    churns = 0

    for prev, curr in zip(partials, partials[1:]):
        prev_words = prev.split()
        curr_words = curr.split()
        shared = min(len(prev_words), len(curr_words))
        if shared == 0:
            continue
        comparisons += shared
        for i in range(shared):
            if prev_words[i] != curr_words[i]:
                churns += 1

    return 0.0 if comparisons == 0 else churns / comparisons

Drill 3: Canary Rollback Trigger

Decide whether to rollback a canary from aggregate slice metrics. Critical metrics should dominate average wins.

Hidden answer: strong reasoning and Python solution

A strong gate protects critical slices and privacy first. Average quality improvements cannot excuse a privacy error, protected-slice regression, or tail-latency breach.

def should_rollback_canary(metrics):
    reasons = []
    if metrics.get("privacy_errors", 0) > 0:
        reasons.append("privacy_errors")
    if metrics.get("p99_turn_latency_delta_ms", 0) > 250:
        reasons.append("tail_latency")
    if metrics.get("noisy_room_task_success_delta", 0) < -0.03:
        reasons.append("noisy_room_regression")
    if metrics.get("entity_recall_delta", 0) < -0.02:
        reasons.append("entity_recall_regression")

    return {
        "rollback": bool(reasons),
        "reasons": reasons,
    }

Timed Exam

Advanced Practice Prompts

Prompt 1: Local Versus Cloud Speech-To-Speech

A privacy-sensitive assistant needs wake word, dictation, RAG over user documents, and natural TTS. Decide what runs on device, what runs on a private server, and what can use managed cloud services.

Hidden answer: strong decision outline

Keep wake word, VAD, privacy filters, and simple commands local when feasible. Use private server inference for heavy ASR, embeddings, retrieval, and LLM reasoning when user documents are involved. Use managed cloud only behind explicit consent, clear data boundaries, encryption, retention controls, and fallback behavior. Discuss latency, cost, model quality, offline mode, telemetry minimization, and rollback for each tier.

Prompt 2: Debug A Full-Turn Latency Regression

The product SLO is p95 complete spoken turn under 1.8 seconds. After a release, p95 is 2.4 seconds. ASR, LLM, and TTS team dashboards each say their component is healthy.

Hidden answer: strong incident outline

Build an end-to-end trace waterfall and compare old versus new by queue time, orchestration gaps, retries, retrieval, context length, tool calls, TTS first byte, playback buffering, cancellation, client version, region, tenant, and model route. Mitigate by rolling back the release, disabling expensive features, using smaller model tiers, reducing retrieval top-k, or routing noninteractive work away from live pools. Add a release gate for full-turn latency, not only component metrics.

Prompt 3: Explain The Research-To-Production Handoff

A research team delivers a better ASR checkpoint. What artifacts do you require before production can safely canary it?

Hidden answer: handoff checklist

Require model hash, training data manifest, eval report by slice, decoding settings, feature extraction contract, tokenizer or vocabulary version, latency and memory profile, privacy review, known failure modes, compatible serving image, rollout plan, rollback target, dashboards, alerts, and owner contacts. The handoff should make the model reproducible, observable, and reversible.