Module 13

Inference Hosting And Serving Design

AI engineers need to turn model demos into services with clear latency, memory, throughput, reliability, privacy, and cost behavior. This lesson connects transformer serving to speech workloads.

Mental Model

Every Request Spends Four Budgets

Latency

Queue time, preprocessing, first token or first audio byte, decode time, post-processing, network, and client playback all count.

Question: Why is average latency a weak launch metric?

Interactive speech products are judged by tail latency and jitter. A good mean can hide overloaded queues, cold starts, long-context requests, batch starvation, or a few tenants causing bad p95 and p99 behavior.

Memory

Weights, activations, KV cache, audio buffers, tokenizer state, feature tensors, and batching padding all compete for memory.

Question: What makes KV cache a serving problem?

The cache grows with layers, heads, head dimension, sequence length, batch size, and precision. Long conversations can consume more memory than the model weights, so schedulers need limits, eviction, paging, truncation, or summarization.

Throughput

Batching and streaming must be balanced. Larger batches improve accelerator utilization but can delay the first useful response.

Question: When should batching be limited?

Limit batching for conversational ASR, speech-to-speech, and TTS first-audio-byte paths where queue delay is visible. Use larger batches for offline transcription, embeddings, evaluation jobs, and analytics pipelines.

Quality

Quantization, pruning, smaller context, decoding shortcuts, and lower vocoder quality can save cost but must be checked by slice.

Question: Which slices are risky for audio?

Track accents, noisy rooms, far-field microphones, low-volume speech, domain terms, code-switching, children, elderly speakers, long silences, overlapping speech, and rare entity names. Global WER can hide severe regressions in these groups.

Capacity Planning

Back-Of-The-Envelope Interview Math

In system design interviews, show that you can estimate before reaching for a benchmark. State assumptions, calculate, then explain which measurements would replace the estimate.

Prompt 1: Estimate GPU Replicas For A Streaming ASR Launch

You expect 10,000 concurrent streams at peak. Each stream sends 16 kHz mono audio. The streaming model consumes 0.35 GPU-seconds per audio second after batching, and one GPU should run at no more than 70 percent utilization. Estimate replicas.

Hidden answer: capacity calculation

Peak demand is 10,000 audio-seconds per wall-clock second. At 0.35 GPU-seconds per audio second, demand is 3,500 GPU-seconds per second. At a 70 percent utilization target, each GPU contributes 0.7 usable GPU-seconds per second. Estimate 3,500 / 0.7 = 5,000 GPUs before optimizations. A strong answer should challenge the assumptions, look for model or batching improvements, regional peak smoothing, smaller models, CPU preprocessing, VAD savings, and offline-vs-real-time separation.

Prompt 2: Estimate KV-Cache Memory

A decoder-only model has 32 layers, 32 KV heads, head dimension 128, fp16 cache, and 8,000 active tokens per request. Estimate KV cache per request before batching overhead.

Hidden answer: memory formula

Cache bytes are layers * tokens * KV tensors * KV heads * head dim * bytes. That is 32 * 8,000 * 2 * 32 * 128 * 2 bytes, about 4.2 GB per request. The answer should immediately discuss grouped-query attention, shorter context, cache paging, summarization, request limits, and why long conversations can dominate capacity.

Serving Patterns

Choose The Runtime By Product Constraint

Embedded Local

Use for privacy, offline mode, small models, local wake words, VAD, and personal assistants where device limits are acceptable.

Hidden answer: staff-level tradeoff

Local serving reduces data exposure and network latency, but makes hardware variability, model updates, telemetry, crash debugging, and battery use harder. Keep a compatibility matrix and a rollback path for each release channel.

Online GPU API

Use for large ASR, speech-to-speech, LLM reasoning, neural codecs, and high-quality TTS that exceed device capacity.

Hidden answer: staff-level tradeoff

Online GPU serving gives centralized control, better utilization, and faster model iteration. It also introduces privacy, network, quota, regional capacity, cold-start, and tail-latency problems.

Batch Pipeline

Use for offline transcription, dataset labeling, eval runs, embedding generation, and backfills where throughput matters most.

Hidden answer: staff-level tradeoff

Batch jobs should maximize accelerator occupancy, retry safely, checkpoint progress, record model versions, and support idempotent outputs. They should not share fragile capacity with real-time conversational paths unless quotas are strict.

Production Debugging

First-Hour Incident Exercises

Incident 1: Queue Time Dominates p99

Model execution is stable, but p99 end-to-end latency doubled. Traces show most of the increase before inference begins.

Hidden answer: diagnosis and mitigation

Check arrival rate, per-tenant spikes, autoscaler lag, batcher thresholds, max queue age, priority classes, stuck workers, retries, and whether shadow or canary traffic is counted twice. Mitigate by reducing batch wait, adding replicas, shedding low-priority traffic, splitting real-time and batch queues, or rolling back the scheduler change.

Incident 2: OOMs Only Happen On Long Conversations

Short requests pass. Long speech-to-speech sessions start failing with GPU out-of-memory errors after a few turns.

Hidden answer: diagnosis and mitigation

Inspect KV-cache growth, audio-token context, retained hidden states, conversation summarization, max-token limits, request cancellation cleanup, and cache fragmentation. Mitigate with hard context budgets, cache paging, summarization, session eviction, smaller precision, GQA/MQA models, or routing long sessions to a larger pool.

Incident 3: Cost Dropped But Entity Errors Increased

A quantized ASR model cut GPU cost by 30 percent, but customers report more mistakes on names and product SKUs.

Hidden answer: strong response

Roll back or canary-limit if critical entity error exceeds the launch budget. Compare by domain-term slice, accent, SNR, language, and utterance length. Consider mixed precision for sensitive layers, rescoring, contextual biasing, post-processing, or a cascade where the cheaper model handles easy traffic and uncertain requests go to the baseline.

Lab

Build An Inference Readiness Report

Create a one-page report for any ASR, TTS, embedding, or LLM model you can run locally. Do not commit private audio, transcripts, tokens, or credentials.

  1. Model contract: name weights, feature config, tokenizer, expected input, output schema, and license limits.
  2. Latency: measure cold start, warm p50/p95, first partial token or first audio byte, and total response time.
  3. Memory: record idle memory, peak memory, context length, batch size, and cache behavior.
  4. Quality: run a small public or synthetic eval and report errors by slice.
  5. Operations: define dashboards, alerts, rollback, cost budget, and privacy-safe logs.
Question: What makes this report advanced?

It connects model quality to operational constraints. A strong report explains what was measured, what was not measured, how the model can fail in production, what blocks launch, and exactly how to roll back or degrade gracefully.