Real-Time Speech System Design Lab

System Design

Start With The Spoken Turn Contract

A real-time speech product is not only a model. It is a timed contract across capture, VAD, ASR, policy, retrieval, LLM reasoning, TTS, playback, observability, and rollback. The best strong answers make the contract explicit before debating model choices.

Interactive Path

Wake word, VAD, streaming ASR, endpointing, intent detection, retrieval, LLM/tool call, TTS first chunk, playback, and barge-in.

Question: Which metrics belong on the interactive path?

Time to first partial, partial rewrite rate, endpointing delay, first audio byte, complete spoken turn latency, cancellation success, barge-in recovery, task success, correction rate, and cost per successful turn. Final WER alone is too narrow.

Control Plane

Model registry, prompt registry, dataset manifests, eval reports, canary routing, tenant quotas, rollback targets, and incident state.

Question: What makes the control plane interview-worthy?

It shows how research changes become reversible production changes. A strong design records model hashes, feature contracts, prompt versions, eval slices, privacy policy, canary cohorts, SLOs, cost gates, and known-good rollback targets.

Prompt: Design A Real-Time Meeting Assistant

Build a meeting assistant that streams captions, answers questions about the meeting, and produces a summary. It must support noisy rooms, private enterprise data, and rollback during business hours.

Hidden answer: strong architecture outline

Separate the live caption path from slower summarization and retrieval jobs. Use client audio capture, VAD, streaming ASR, stable-prefix rendering, speaker diarization if required, a consent-aware transcript store, retrieval over authorized meeting context, and background summarization. Gate releases by partial latency, final WER by noise plus approved aggregate or consented accent/dialect slice, speaker attribution, retrieval grounding, privacy checks, and cost per meeting hour. Keep previous ASR, diarization, prompt, and index versions available for rollback.

Production Debugging

Debug The Whole Turn, Not Just The Model

Many real-time regressions hide in orchestration, queueing, endpointing, caches, and client rendering. Practice asking for the smallest trace that separates model quality from system behavior.

Symptom: Good Final WER, Bad UX

Users say captions feel unstable. Offline final WER is unchanged.

Hidden answer: first checks

Check partial rewrite rate, stable-prefix policy, endpointing, VAD thresholds, chunk size, decoder beam settings, timestamp alignment, client debounce, network jitter, and slices by noise, language, device, and approved aggregate or consented accent/dialect tags. Mitigate with a stable prefix, endpointing rollback, or affected-slice routing.

Symptom: TTS Starts Late

First audio byte regressed, but total synthesis time is flat.

Hidden answer: first checks

Inspect request queue time, model warm pool, prompt preprocessing, text normalization, voice cache hits, network handoff, chunk streaming, and admission control. A flat synthesis metric can hide queueing or orchestration delay before synthesis begins.

Symptom: Spoken RAG Hallucinates

ASR and retrieval dashboards look healthy, but spoken answers cite documents that do not support the answer.

Hidden answer: first checks

Audit ASR entity substitutions, query rewrite prompts, index freshness, ACL filtering, reranker version, top-k size, stale cache entries, answer grounding, citation selection, and TTS pronunciation of entities. Add evals that inject realistic ASR errors into retrieval queries.

Symptom: Cost Spike Under Flat Traffic

Traffic is stable, task success is flat, and GPU spend is up.

Hidden answer: first checks

Check average context tokens, retrieved chunks, retries, shadow traffic, fallback model routes, cache hit rate, batch efficiency, autoscaler floor, warm replicas, eval jobs on live pools, and tenant mix. Report cost per successful spoken turn, not only aggregate spend.

Coding Lab

Utilities That Encode Operational Judgment

These are interview-sized functions. Before opening the answers, state the invariant, edge cases, and tests.

Drill 1: End-To-End Latency Budget

Given component p95 latencies and a product SLO, return whether the path has budget left and name the largest contributors.

Hidden answer: invariant, tests, and Python solution

Invariant: the sum of component p95 estimates is a planning approximation for budget ownership, not a true end-to-end percentile, a guaranteed upper bound, or a replacement for trace percentiles. Test exact budget, empty components, one dominant component, and a negative, non-finite, non-numeric, or missing value rejected by input validation in real code.

def latency_budget_report(components_ms, slo_ms, top_k=3):
    import math
    from collections.abc import Mapping

    if not isinstance(components_ms, Mapping):
        raise ValueError("components_ms must be a mapping of component names to latencies")
    if isinstance(slo_ms, bool) or not isinstance(slo_ms, (int, float)) or not math.isfinite(slo_ms) or slo_ms < 0:
        raise ValueError("slo_ms must be a finite non-negative number")
    if isinstance(top_k, bool) or not isinstance(top_k, int) or top_k < 1:
        raise ValueError("top_k must be a positive integer")
    for name, value in components_ms.items():
        if isinstance(value, bool) or not isinstance(value, (int, float)) or not math.isfinite(value) or value < 0:
            raise ValueError(f"{name} latency must be a finite non-negative number")

    total = sum(components_ms.values())
    contributors = sorted(
        components_ms.items(),
        key=lambda item: item[1],
        reverse=True,
    )[:top_k]
    return {
        "total_ms": total,
        "slo_ms": slo_ms,
        "within_budget": total <= slo_ms,
        "remaining_ms": slo_ms - total,
        "top_contributors": contributors,
    }

Drill 2: Stable Prefix Churn

Given successive partial transcripts, compute how often already-shown words changed. Use this as a rough caption stability signal.

Hidden answer: invariant, common mistakes, and Python solution

Invariant: compare each new partial against the words that were already visible in the previous partial. Count substitutions and deletions as churn, but do not penalize append-only growth. Common mistakes include comparing full strings, counting appended words as churn, ignoring shortened partials, accepting a single transcript string as an iterable of characters, and allowing non-string partials to fail later with unclear errors.

def prefix_churn_rate(partials):
    from collections.abc import Iterable

    if isinstance(partials, (str, bytes)) or not isinstance(partials, Iterable):
        raise ValueError("partials must be an iterable of transcript strings")
    partials = list(partials)
    if any(not isinstance(partial, str) for partial in partials):
        raise ValueError("each partial transcript must be a string")

    comparisons = 0
    churns = 0

    for prev, curr in zip(partials, partials[1:]):
        prev_words = prev.split()
        curr_words = curr.split()
        if not prev_words:
            continue
        comparisons += len(prev_words)
        for i, prev_word in enumerate(prev_words):
            if i >= len(curr_words) or curr_words[i] != prev_word:
                churns += 1

    return 0.0 if comparisons == 0 else churns / comparisons

Drill 3: Canary Rollback Trigger

Decide whether to rollback a canary from aggregate slice metrics. Critical metrics should dominate average wins.

Hidden answer: strong reasoning and Python solution

A strong gate protects critical slices and privacy first. Average quality improvements cannot excuse a privacy error, approved critical-slice regression, tail-latency breach, or missing critical telemetry.

import math
from collections.abc import Mapping


def should_rollback_canary(metrics):
    if not isinstance(metrics, Mapping):
        raise ValueError("metrics must be a mapping of metric names to values")

    required = [
        "privacy_errors",
        "p99_turn_latency_delta_ms",
        "noisy_room_task_success_delta",
        "entity_recall_delta",
    ]
    reasons = []
    clean = {}
    for name in required:
        if name not in metrics:
            reasons.append(f"missing_{name}")
        elif isinstance(metrics[name], bool) or not isinstance(metrics[name], (int, float)) or not math.isfinite(metrics[name]):
            reasons.append(f"invalid_{name}")
        else:
            clean[name] = metrics[name]

    privacy_errors = clean.get("privacy_errors", 0)
    if privacy_errors < 0:
        reasons.append("invalid_privacy_errors")
    if privacy_errors > 0:
        reasons.append("privacy_errors")
    if clean.get("p99_turn_latency_delta_ms", 0) > 250:
        reasons.append("tail_latency")
    if clean.get("noisy_room_task_success_delta", 0) < -0.03:
        reasons.append("noisy_room_regression")
    if clean.get("entity_recall_delta", 0) < -0.02:
        reasons.append("entity_recall_regression")

    return {
        "rollback": bool(reasons),
        "reasons": reasons,
    }

Timed Exam

Advanced Practice Prompts

Prompt 1: Local Versus Cloud Speech-To-Speech

A privacy-sensitive assistant needs wake word, dictation, RAG over user documents, and natural TTS. Decide what runs on device, what runs on a private server, and what can use managed cloud services.

Hidden answer: strong decision outline

Keep wake word, VAD, privacy filters, and simple commands local when feasible. Use private server inference for heavy ASR, embeddings, retrieval, and LLM reasoning when user documents are involved. Use managed cloud only behind explicit consent, clear data boundaries, encryption, retention controls, and fallback behavior. Discuss latency, cost, model quality, offline mode, telemetry minimization, and rollback for each tier.

Prompt 2: Debug A Full-Turn Latency Regression

The product SLO is p95 complete spoken turn under 1.8 seconds. After a release, p95 is 2.4 seconds. ASR, LLM, and TTS team dashboards each say their component is healthy.

Hidden answer: strong incident outline

Build an end-to-end trace waterfall and compare old versus new by queue time, orchestration gaps, retries, retrieval, context length, tool calls, TTS first byte, playback buffering, cancellation, client version, region, tenant, and model route. Mitigate by rolling back the release, disabling expensive features, using smaller model tiers, reducing retrieval top-k, or routing noninteractive work away from live pools. Add a release gate for full-turn latency, not only component metrics.

Prompt 3: Explain The Research-To-Production Handoff

A research team delivers a better ASR checkpoint. What artifacts do you require before production can safely canary it?

Hidden answer: handoff checklist

Require model hash, training data manifest, eval report by slice, decoding settings, feature extraction contract, tokenizer or vocabulary version, latency and memory profile, privacy review, known failure modes, compatible serving image, rollout plan, rollback target, dashboards, alerts, and owner contacts. The handoff should make the model reproducible, observable, and reversible.