Latency
Queue time, preprocessing, first token or first audio byte, decode time, post-processing, network, and client playback all count.
Question: Why is average latency a weak launch metric?
Interactive speech products are judged by tail latency and
jitter. A good mean can hide overloaded queues, cold starts,
long-context requests, batch starvation, or a few tenants causing
bad p95 and p99 behavior.
Memory
Weights, activations, KV cache, audio buffers, tokenizer state, feature tensors, and batching padding all compete for memory.
Question: What makes KV cache a serving problem?
The cache grows with layers, heads, head dimension, sequence
length, batch size, and precision. Long conversations can consume
more memory than the model weights, so schedulers need limits,
eviction, paging, truncation, or summarization.
Throughput
Batching and streaming must be balanced. Larger batches improve accelerator utilization but can delay the first useful response.
Question: When should batching be limited?
Limit batching for conversational ASR, speech-to-speech, and TTS
first-audio-byte paths where queue delay is visible. Use larger
batches for offline transcription, embeddings, evaluation jobs,
and analytics pipelines.
Quality
Quantization, pruning, smaller context, decoding shortcuts, and lower vocoder quality can save cost but must be checked by slice.
Question: Which slices are risky for audio?
Track accents, noisy rooms, far-field microphones, low-volume
speech, domain terms, code-switching, children, elderly speakers,
long silences, overlapping speech, and rare entity names. Global
WER can hide severe regressions in these groups.