Speech Inference Platform Readiness

Architecture Review

Separate Real-Time Paths From Batch Paths

A production speech platform usually serves several workloads at once: wake word, VAD, streaming ASR, turn classification, LLM reasoning, retrieval, TTS, evaluation jobs, and offline transcription. Strong design starts by separating user-visible latency paths from throughput paths.

Real-Time Path

Optimize first partial transcript, first spoken response, queue age, jitter, cancellation, barge-in, and fallback behavior.

Question: What should be isolated from real-time capacity?

Offline transcription, eval backfills, embedding generation, batch redaction, shadow experiments, and heavy model sweeps. These jobs can consume GPU memory and scheduler attention without improving live user experience.

Batch Path

Optimize accelerator utilization, idempotent writes, checkpointing, retry safety, lineage, and reproducible model or prompt versions.

Question: What makes a batch speech job production-ready?

It records input manifest versions, model hashes, decoding settings, prompt versions, redaction policy, output schema, checkpoint cursor, retry count, and aggregate quality metrics. It should never require committing private audio or transcripts.

Launch Gates

Readiness Means Explicit Stop Conditions

Gate 1: Streaming ASR Canary

A new ASR model improves average WER by 4 percent relative, but noisy far-field traffic regresses by 8 percent relative. p95 first partial transcript latency is unchanged. Should you launch?

Hidden answer: strong launch decision

Do not launch broadly. A strong answer protects the regressed slice, checks whether far-field traffic is strategically important, audits sample size and label quality, and proposes a constrained canary or model routing rule only if the affected slice can be excluded. The launch gate should include slice-level WER, entity recall, latency, privacy logging checks, and rollback triggers.

Gate 2: TTS Voice Update

A new TTS voice has better MOS in human review, but first-audio-byte p99 is 700 ms slower and GPU cost per minute is 35 percent higher. Product wants to ship before a demo.

Hidden answer: cost and latency tradeoff

Ship only behind an explicit product decision and a narrow audience if the demo value justifies it. Keep the old voice as fallback, enforce p99 and cost alerts, pre-warm where needed, and define a rollback threshold. A strong answer separates subjective quality wins from reliability and unit-economics risks.

Gate 3: Speech RAG Policy Change

Retrieval recall improved, but spoken answers now cite stale documents more often because the top-k includes older policy pages.

Hidden answer: quality-system fix

Add freshness-aware ranking, metadata filters, stale-source evaluation slices, and answer-level citation checks. Launch gates should measure retrieval recall, source freshness, groundedness, refusal behavior, ASR-noise robustness, and user-visible latency.

Incident Practice

First-Hour Debugging Drills

Partial Transcript Churn

Users complain that live captions keep changing. Final WER is stable, but partial text feels unusable.

Hidden answer: what to inspect

Inspect partial rewrite rate, endpointing thresholds, VAD sensitivity, beam stability, token timestamps, noise slices, client debounce behavior, and recent decoder changes. Mitigate by stabilizing committed prefixes, tuning endpointing, or rolling back the streaming decoder.

GPU Cost Spike

Daily spend doubled with no obvious traffic growth. Quality metrics are flat and user-visible latency is slightly better.

Hidden answer: what to inspect

Check batch size, utilization, autoscaler floor, shadow traffic, eval jobs running on live pools, retry loops, longer contexts, disabled quantization, over-warm replicas, and tenant mix. The fix should preserve SLOs while restoring utilization and routing batch work away from live capacity.

Regional ASR Regression

One region reports worse command recognition after a deployment. Global dashboards look normal.

Hidden answer: what to inspect

Slice by region, language, approved aggregate or consented pronunciation-variation tags, microphone path, network jitter, client version, model version, feature extraction version, and routing policy. Compare canary and control cohorts. An experienced responder does not rely on global WER during a regional complaint.

TTS Abuse Alert

Abuse monitoring detects a burst of synthetic voice requests that resemble impersonation attempts.

Hidden answer: what to inspect

Rate-limit suspicious tenants, preserve privacy-safe aggregate evidence, disable risky voices if policy allows, inspect prompt and account patterns, escalate to trust and safety, and confirm that logs do not expose private voiceprints or raw recordings.

Coding Lab

Release Gate Utilities

Implement small utilities that mirror production review work. Use synthetic aggregate metrics only.

Lab 1: Rollout Gate Evaluator

Given metric deltas for a model canary, decide whether to continue, pause, or rollback. Negative WER and latency deltas are improvements.

Hidden answer: Python solution

import math


def rollout_decision(metrics):
    required = (
        "p95_latency_ms_delta",
        "far_field_wer_relative_delta",
        "privacy_error_count",
        "overall_wer_relative_delta",
    )
    missing = [key for key in required if key not in metrics]
    if missing:
        raise ValueError(f"missing metrics: {missing}")
    for key in required:
        if not math.isfinite(metrics[key]):
            raise ValueError(f"{key} must be finite")
    if metrics["privacy_error_count"] < 0:
        raise ValueError("privacy_error_count must be non-negative")

    hard_failures = []
    if metrics["p95_latency_ms_delta"] > 100:
        hard_failures.append("latency")
    if metrics["far_field_wer_relative_delta"] > 0.03:
        hard_failures.append("far_field_wer")
    if metrics["privacy_error_count"] > 0:
        hard_failures.append("privacy")

    if hard_failures:
        return {"decision": "rollback", "reasons": hard_failures}

    if metrics["overall_wer_relative_delta"] <= -0.02:
        return {"decision": "continue", "reasons": ["quality_gain"]}

    return {"decision": "pause", "reasons": ["insufficient_gain"]}

The invariant is that privacy errors and critical-slice regressions dominate average quality wins. A common mistake is launching because aggregate WER improved. Missing canary metrics or impossible privacy counters are invalid telemetry, not evidence that a rollout is safe. Non-finite telemetry such as NaN or infinity should fail closed too.

Lab 2: Queue SLO Burn Rate

Compute the burn rate for a queue latency SLO. The error budget is the allowed fraction of requests above threshold.

Hidden answer: Python solution

import math


def queue_burn_rate(latencies_ms, threshold_ms=250, allowed_bad_fraction=0.01):
    if not math.isfinite(threshold_ms):
        raise ValueError("threshold_ms must be finite")
    if not math.isfinite(allowed_bad_fraction):
        raise ValueError("allowed_bad_fraction must be finite")
    if threshold_ms <= 0:
        raise ValueError("threshold_ms must be positive")
    if not 0 < allowed_bad_fraction <= 1:
        raise ValueError("allowed_bad_fraction must be in (0, 1]")
    if any(not math.isfinite(x) for x in latencies_ms):
        raise ValueError("latencies_ms must be finite")
    if any(x < 0 for x in latencies_ms):
        raise ValueError("latencies_ms cannot contain negative values")
    if not latencies_ms:
        return 0.0
    bad = sum(x > threshold_ms for x in latencies_ms)
    observed_bad_fraction = bad / len(latencies_ms)
    return observed_bad_fraction / allowed_bad_fraction

A burn rate above 1 means the service is consuming error budget faster than planned for that window. Alerting should combine short and long windows to catch both spikes and slow burns. Negative latencies or impossible error-budget fractions are invalid telemetry, not launch evidence. NaN and infinity must be rejected before alerting logic decides the service is healthy.

Lab 3: Cost Regression Detector

Compare old and new cost per successful speech turn. Return the top drivers that need investigation.

Hidden answer: Python solution

import math


def cost_regression(old, new, tolerance=0.10):
    if not math.isfinite(tolerance):
        raise ValueError("tolerance must be finite")
    if tolerance < 0:
        raise ValueError("tolerance must be non-negative")
    required = (
        "successful_turns",
        "gpu_dollars",
        "avg_context_tokens",
        "retry_rate",
        "shadow_traffic_fraction",
    )
    for label, window in (("old", old), ("new", new)):
        missing = [key for key in required if key not in window]
        if missing:
            raise ValueError(f"{label} missing metrics: {missing}")
    for label, window in (("old", old), ("new", new)):
        for key in required:
            if not math.isfinite(window[key]):
                raise ValueError(f"{label}.{key} must be finite")
            if window[key] < 0:
                raise ValueError(f"{label}.{key} must be non-negative")
    if old["successful_turns"] <= 0 or new["successful_turns"] <= 0:
        raise ValueError("successful_turns must be positive")
    old_unit = old["gpu_dollars"] / old["successful_turns"]
    new_unit = new["gpu_dollars"] / new["successful_turns"]
    if old_unit <= 0:
        raise ValueError("old unit cost must be positive")
    relative = (new_unit - old_unit) / old_unit

    drivers = []
    for key in ("avg_context_tokens", "retry_rate", "shadow_traffic_fraction"):
        if new[key] > old[key]:
            drivers.append(key)

    return {
        "old_unit_cost": old_unit,
        "new_unit_cost": new_unit,
        "relative_delta": relative,
        "regressed": relative > tolerance,
        "drivers": drivers,
    }

The useful metric is cost per successful user outcome, not total spend alone. Rising traffic may be acceptable; rising unit cost requires a capacity or product explanation. Do not hide zero-turn windows with a fallback denominator; mark them as invalid telemetry and investigate the pipeline before making a rollout decision. Non-finite spend, traffic, retry, or context counters should also block the decision rather than silently producing a false non-regression.

Advanced Practice Prompts

Answer Out Loud Before Opening

Prompt 1: Design A Shared Speech Serving Platform

Design a platform that hosts streaming ASR, TTS, and speech-to-speech for multiple product teams. Cover APIs, tenancy, scheduling, observability, release gates, privacy, and cost controls.

Hidden answer: strong outline

Discuss separate real-time and batch pools, tenant quotas, model registry, feature extraction contracts, streaming APIs, cancellation, canary routing, slice metrics, privacy-safe telemetry, incident runbooks, model rollback, autoscaling, GPU utilization, and chargeback or showback. Call out that ASR, TTS, and LLM serving have different bottlenecks.

Prompt 2: Debug A Speech-To-Speech Latency Regression

p95 turn latency increased by 35 percent after a release. ASR, LLM, and TTS each claim their local metrics are healthy.

Hidden answer: strong outline

Build an end-to-end trace waterfall and inspect queue time, network, client buffering, orchestration gaps, retries, retrieval latency, context growth, TTS first-audio-byte, playback start, and cancellation. Local component health is not enough when the product SLO is a full spoken turn.

Prompt 3: Choose Local, Cloud, Or Hybrid Inference

A privacy-sensitive assistant needs wake word, short commands, long dictation, and high-quality TTS. Decide what runs locally versus in the cloud.

Hidden answer: strong outline

Run wake word, VAD, privacy filters, and simple commands locally when possible. Use cloud or dedicated servers for long dictation, large LLM reasoning, heavy retrieval, and high-quality TTS if consent and policy allow. Add offline fallback, explicit upload boundaries, telemetry minimization, and model/version compatibility checks.