Production Practice

Speech AI Load Testing And Chaos Readiness

Learn how ML engineers prove that ASR, TTS, retrieval, LLM, and speech-to-speech systems will keep working when traffic, latency, dependency failures, and cost pressure arrive at the same time.

Operating Model

Test The Whole Spoken Turn, Not Just One Model

A speech product can pass unit tests and offline evals while failing live because the real bottleneck is queueing, turn-taking, GPU memory, cache misses, retrieval fanout, TTS warmup, client playback, or a bad retry policy. A strong load test validates quality, latency, cost, and graceful degradation together.

  1. Define the user promise: first partial, final transcript, first audio byte, barge-in, and task success targets.
  2. Model traffic: concurrency, utterance duration, silence, language mix, noisy devices, tenant skew, and burst shape.
  3. Protect privacy: use synthetic or approved aggregate fixtures, never private recordings or raw transcripts.
  4. Measure end to end: include client, network, VAD, ASR, retrieval, LLM, TTS, and playback timestamps.
  5. Force failure: inject slow dependencies, unavailable regions, cold starts, model rollback, quota pressure, and retry storms.
Question: Why is offline WER insufficient for launch readiness?

WER only measures transcript text quality on a fixed set. It misses tail latency, partial transcript churn, endpointing errors, GPU queue saturation, retry storms, cost spikes, stale retrieval, TTS first audio byte, and user-visible recovery behavior. Launch readiness needs offline quality plus live-system stress signals.

Load Testing

Design Tests Around Invariants

Good load tests are not random traffic floods. They encode the system invariants that must stay true while traffic changes.

Streaming ASR Invariant

Interactive traffic must not wait behind batch transcription, and first partial p95 must stay below the product target.

Hidden answer: useful test shape

Run mixed traffic with short live calls, long batch files, noisy mobile audio, and tenant-specific vocabulary updates. Verify priority queues, per-stream state limits, GPU queue depth, endpointing stability, and fallback to a smaller model when the interactive queue burns error budget.

TTS Invariant

First audio byte must stay low even when a new voice, long prompt, or cold region enters the serving path.

Hidden answer: useful test shape

Test short confirmations, long policy explanations, multilingual text normalization, warm-pool depletion, cache misses, and vocoder placement. Separate first audio byte, full synthesis time, and playback underruns so mitigation is precise.

Spoken RAG Invariant

Retrieval freshness, ACLs, and grounded answers must hold under index refresh and ASR entity errors.

Hidden answer: useful test shape

Replay synthetic noisy queries across tenants while rotating index versions. Track retrieval recall, stale-document rate, forbidden document exposure, answer grounding, LLM token cost, and handoff rate. Fail closed when ACL or freshness checks are uncertain.

Cost Invariant

Cost per successful turn must not exceed the launch budget without a measured quality gain.

Hidden answer: useful test shape

Stress long conversations, retrieval fanout, disabled cache, retries, high-temperature regeneration, and fallback chains. Watch tokens per turn, GPU seconds, cache hit rate, successful task completion, and the point where a cheaper route should take over.

Chaos Readiness

Failure Drills For Speech Systems

Drill 1: GPU Region Loses Half Its Capacity

During peak call volume, one serving region loses half of its GPU workers. Batch ASR and interactive voice-agent calls share the same cluster.

Hidden answer: strong recovery plan

Preserve interactive traffic first. Shed or defer batch work, lower max batch delay for live calls, route overflow to a warm fallback region, and downgrade noncritical paths to smaller models. Watch p95 first partial, queue depth, retry rate, GPU memory headroom, tenant error budgets, and cost. The prevention item is hard isolation or strict priority plus a load test proving it.

Drill 2: TTS Cache Is Accidentally Bypassed

A release changes text normalization. Cache hit rate falls from 62 percent to 7 percent, first audio byte doubles, and cost rises.

Hidden answer: diagnosis and rollback trigger

Compare normalized text keys, voice IDs, locale, punctuation, segmentation, and cache namespaces between old and new releases. Roll back if first audio byte or cost-per-success crosses the gate. Prevention: add a CI fixture that snapshots normalization keys for common utterances and a canary gate for cache hit rate.

Drill 3: Retrieval Dependency Adds 800 ms

A vector database upgrade increases p95 retrieval latency. ASR and TTS are healthy, but voice-agent turns feel slow.

Hidden answer: graceful degradation

Use retrieval timeouts, smaller top-k, cached tenant policy snippets, answer templates for safe intents, or human handoff for high-risk intents. Do not silently answer from stale documents unless the product explicitly allows it. Add dependency-latency chaos tests and grounded-answer checks under timeout behavior.

Coding Labs

Encode Readiness Gates In Python

These labs use synthetic aggregate metrics. They are safe to commit and useful in interviews because they turn operational judgment into clear, testable rules.

Lab 1: Load-Test Gate

Given aggregate load-test metrics by route, decide whether the release can ship. Each route has p95 latency, error rate, cost per successful turn, and task success delta versus baseline.

Hidden answer: invariant, tests, and Python solution

Invariant: a route can only pass if user experience, reliability, and cost are all inside budget. Test a clean pass, a latency fail, a cost fail with quality gain, and missing route metrics.

def evaluate_load_gate(routes, budgets):
    failures = []
    for name, metrics in routes.items():
        budget = budgets[name]
        if metrics["p95_ms"] > budget["p95_ms"]:
            failures.append((name, "latency", metrics["p95_ms"]))
        if metrics["error_rate"] > budget["error_rate"]:
            failures.append((name, "errors", metrics["error_rate"]))

        cost_over = metrics["cost_per_success"] > budget["cost_per_success"]
        quality_gain = metrics["task_success_delta"] >= budget.get("min_gain_for_cost_overrun", 0.02)
        if cost_over and not quality_gain:
            failures.append((name, "cost_without_quality_gain", metrics["cost_per_success"]))

    return {"ship": not failures, "failures": failures}

Lab 2: Retry Storm Detector

Detect when client or service retries are amplifying a dependency failure during a voice-agent incident.

Hidden answer: common mistakes and Python solution

Common mistakes are alerting on request count alone, ignoring the baseline, and missing the dependency error signal. Retry storms show more attempts per original user turn plus rising dependency errors.

def retry_storm_windows(points, attempt_ratio_threshold=1.8, dependency_error_threshold=0.05):
    alerts = []
    for point in points:
        user_turns = max(point["user_turns"], 1)
        attempt_ratio = point["service_attempts"] / user_turns
        if attempt_ratio >= attempt_ratio_threshold and point["dependency_error_rate"] >= dependency_error_threshold:
            alerts.append({
                "minute": point["minute"],
                "attempt_ratio": round(attempt_ratio, 2),
                "dependency_error_rate": point["dependency_error_rate"],
                "action": "cap_retries_and_enable_fallback",
            })
    return alerts

Lab 3: Capacity Step-Load Summary

Summarize a step-load test and find the first concurrency level where the system violates its latency or error budget.

Hidden answer: invariant and Python solution

Invariant: the supported capacity is the last passing step before the first sustained violation. This avoids claiming capacity from a single lucky point after overload has started.

def first_capacity_break(steps, p95_budget_ms, error_budget):
    last_passing = None
    for step in steps:
        passes = step["p95_ms"] <= p95_budget_ms and step["error_rate"] <= error_budget
        if passes:
            last_passing = step["concurrency"]
            continue
        return {
            "max_supported_concurrency": last_passing,
            "first_failing_concurrency": step["concurrency"],
            "p95_ms": step["p95_ms"],
            "error_rate": step["error_rate"],
        }
    return {"max_supported_concurrency": last_passing, "first_failing_concurrency": None}

Interview And Exam Prompts

Practice Strong Answers Under Constraint

Prompt 1: Load-Test A Speech-To-Speech Agent Before Launch

You have two weeks before launch. Design the load test, quality gates, rollback triggers, and dashboards for a speech-to-speech support agent serving enterprise tenants.

Hidden answer: strong outline

Cover traffic modeling, tenant skew, approved synthetic fixtures, ASR partial latency, LLM first token, retrieval latency, TTS first audio byte, barge-in, task success, groundedness, safety, cost per success, and rollback. Include real-time versus batch isolation, priority queues, fallback model routes, per-tenant dashboards, and a go/no-go meeting with explicit launch gates.

Prompt 2: Explain The Difference Between Load Testing And Chaos Testing

Give a concise strong answer, then map both to ASR, TTS, and spoken RAG.

Hidden answer: concise distinction

Load testing asks whether the system meets promises under expected and peak traffic. Chaos testing asks whether it degrades safely when parts fail. For ASR, load tests stress concurrency and chunking; chaos tests remove GPU capacity or slow VAD. For TTS, load tests stress voice pools and cache; chaos tests bypass cache or cold-start regions. For spoken RAG, load tests stress retrieval fanout; chaos tests add stale indexes, slow dependencies, and ACL uncertainty.