Serving Infrastructure

Speech GPU Serving Capacity Planning

Practice the advanced engineering layer between model quality and product reliability: GPU memory math, batching, streaming concurrency, autoscaling, load shedding, cost controls, and incident response for ASR, TTS, LLM, and speech-to-speech systems.

Mental Model

Capacity Is A Product Contract

Advanced interviewers expect you to connect traffic, latency, quality, memory, and cost. A capacity plan is not a GPU count. It is an argument that the system can keep its product promise when traffic, inputs, and dependencies become uneven.

  1. Segment traffic: real-time streams, async batch jobs, eval jobs, admin tools, and replay pipelines need separate queues.
  2. Budget stages: VAD, ASR partial, ASR final, retrieval, LLM first token, LLM decode, TTS first audio byte, and playback all consume SLO.
  3. Track scarce resources: GPU memory, KV cache, audio encoder occupancy, batch slots, CPU preprocessing, network egress, and warm voices.
  4. Reserve headroom: protect p95 and p99 with burst buffers, priority classes, and warm replicas for hot language and voice slices.
  5. Make rollback cheap: keep previous model bundles routable, preserve compatible schemas, and rehearse partial rollback by component.
Question: Why can average GPU utilization be high while user latency is still bad?

Utilization can hide queueing and tail behavior. A GPU may be busy with long prompts, oversized batches, low-priority replays, or cold TTS voices while real-time requests wait. Strong diagnosis splits utilization by traffic class, queue age, batch composition, model version, context length, and stage latency.

Capacity Planning

Interview Prompts With Hidden Advanced Answers

Prompt 1: Size A Streaming ASR Fleet

A contact-center ASR service receives 12,000 concurrent calls at peak. Each call sends 16 kHz mono audio, needs first partial under 450 ms p95, and must tolerate one zone loss. Describe your capacity model and rollout checks.

Hidden answer: strong sizing outline

Start from concurrent streams and audio seconds per second, then split by language, model, region, and real-time priority. Measure max streams per replica at p95 first-partial latency with realistic noise and endpointing. Reserve zone-loss headroom, isolate eval and replay traffic, and autoscale on queue age plus active streams rather than GPU utilization alone. Rollout gates should compare WER/CER slices, partial churn, first partial p95/p99, finalization latency, endpointing errors, and cost per successful audio minute.

Prompt 2: Estimate LLM KV Cache For Spoken RAG

A speech assistant sends ASR transcripts into an LLM with retrieval. p95 prompt length grows from 2,000 to 6,000 tokens after a prompt change, and first audio latency regresses. What do you inspect?

Hidden answer: memory, batching, and rollback

Inspect tokens per turn, retrieved chunk count, chat-history carryover, cache hit rate, batch occupancy, KV-cache eviction, decode throughput, and route mix. Longer prompts reduce effective batch size and raise prefill latency before TTS can start. Mitigate with retrieval caps, prompt compaction, session summarization, context budgets by tier, speculative decoding checks, and rollback to the previous prompt bundle if first-audio SLO burn continues.

Prompt 3: Plan A TTS Voice Launch

Product wants to launch ten new expressive voices. The model is high quality but each cold voice adds startup latency and memory pressure. Design the serving plan.

Hidden answer: warm pools and guardrails

Launch by voice, language, tenant, and region. Keep warm pools for top voices and lazy-load low-volume voices behind a clear latency budget. Track first audio byte, streaming underruns, voice cache hit rate, memory fragmentation, fallback voice usage, MOS proxies, and abuse signals. Protect real-time traffic with priority queues and fall back to a smaller or neutral voice if cold-start queues exceed the SLO window.

Autoscaling

Scale On User Pain, Not Only Hardware Counters

Prompt: Design Autoscaling Signals For Speech-To-Speech

Choose autoscaling and load-shedding signals for a cascaded speech-to-speech product that mixes ASR, retrieval, LLM, and TTS.

Hidden answer: autoscaling policy

Scale ASR on active streams, chunk queue age, first partial p95, and CPU feature extraction saturation. Scale LLM serving on prefill queue age, active tokens, KV-cache pressure, decode tokens per second, and time to first token. Scale TTS on first audio byte, active synthesis streams, warm voice cache misses, and underruns. Load shedding should degrade in order: reduce retrieval depth, shorten context, route to smaller models, delay non-real-time jobs, disable premium effects, and finally reject low-priority traffic with clear retry behavior.

Question: What is the common autoscaling mistake in real-time ML systems?

Scaling only on GPU utilization. Real-time systems fail through queue age, cold starts, memory fragmentation, context growth, noisy slices, retry storms, and dependency latency before utilization alone tells a complete story.

Production Incidents

First-Hour Drills

Drill 1: Real-Time Queue Saturation

Users report delayed assistant replies. Error rate is flat, but ASR partials and TTS first audio byte are both slower in one region.

Hidden answer: triage path

Split by stage, traffic class, region, language, voice, model bundle, queue age, and retry count. Check whether batch or replay jobs leaked into the real-time pool, whether one dependency is causing backpressure, and whether autoscaling is waiting on cold starts. Freeze rollouts, drain non-real-time jobs, route affected slices to warm fallback pools, and roll back the latest bundle if a version-linked tail regression persists.

Drill 2: Cost Spike With Better Latency

Latency improved after a serving change, but daily GPU spend rose 38 percent and task success did not move. What is the strong response?

Hidden answer: cost-quality investigation

Compare cost per successful turn, model mix, batch occupancy, speculative acceptance rate, prompt tokens, cache reuse, fallback rate, and overprovisioned warm pools. Better latency may come from wasteful headroom or smaller batches. Decide whether the product needs that latency gain, then set tier-specific SLOs, tighter warm pool targets, context budgets, and a release gate on cost per successful task rather than raw latency alone.

Coding Drill

Blind 75 Patterns In Serving Infrastructure

These drills use familiar interview patterns while staying close to speech serving work. Open answers only after writing tests and a first implementation.

Problem 1: Sliding Window Capacity Alert

Given per-minute queue-age measurements, return the first minute where any window of k minutes has average queue age above a threshold. Include edge cases and common mistakes.

Hidden answer: invariant, tests, and Python solution

Invariant: the running sum always equals the current window. Common mistakes are off-by-one return times, integer division, and forgetting that alerting usually triggers at the end of the window. Test empty input, k larger than input, threshold exactly equal, first window breach, later breach, and no breach.

def first_capacity_alert(queue_age_ms, k, threshold_ms):
    if k <= 0:
        raise ValueError("k must be positive")
    if len(queue_age_ms) < k:
        return None

    window = sum(queue_age_ms[:k])
    if window / k > threshold_ms:
        return k - 1

    for right in range(k, len(queue_age_ms)):
        window += queue_age_ms[right] - queue_age_ms[right - k]
        if window / k > threshold_ms:
            return right
    return None

Problem 2: Allocate Requests To Model Pools

Given requests with priority and estimated cost, choose which requests to admit under a fixed capacity budget. Higher priority wins; within the same priority, cheaper requests should be admitted first.

Hidden answer: strategy, tests, and Python solution

This is a sorting and greedy policy question, not optimal knapsack unless the interviewer asks for value maximization. State the product policy first. Test zero capacity, ties, requests larger than capacity, equal priorities with different costs, and stable identifiers for admitted requests.

def admit_requests(requests, capacity):
    ordered = sorted(
        requests,
        key=lambda r: (-r["priority"], r["estimated_cost"], r["id"]),
    )
    admitted = []
    used = 0
    for request in ordered:
        cost = request["estimated_cost"]
        if used + cost <= capacity:
            admitted.append(request["id"])
            used += cost
    return admitted

Problem 3: Detect A Retry Storm

Given aggregate records with minute, requests, and retries, flag sustained windows where retry rate is above a threshold and request volume is also high enough to matter.

Hidden answer: invariant, mistakes, and Python solution

Maintain rolling requests and retries. The invariant is that both sums represent exactly the current window. Avoid dividing by zero, alerting on tiny samples, or looking only at retry count without traffic volume.

def retry_storm_windows(records, k, retry_threshold, min_requests):
    if k <= 0:
        raise ValueError("k must be positive")
    alerts = []
    req_sum = 0
    retry_sum = 0

    for i, row in enumerate(records):
        req_sum += row["requests"]
        retry_sum += row["retries"]
        if i >= k:
            old = records[i - k]
            req_sum -= old["requests"]
            retry_sum -= old["retries"]
        if i >= k - 1 and req_sum >= min_requests:
            if req_sum and retry_sum / req_sum > retry_threshold:
                alerts.append(row["minute"])
    return alerts