Model Architecture
Attention variants, MQA/GQA, Conformer blocks, distillation, adapters, pruning, and smaller specialist models.
Module 12
Learn how modern audio and text models become fast enough to serve: IO-aware attention, KV-cache memory, grouped-query attention, quantization, speculative decoding, distillation, adapters, and production latency tradeoffs.
Mental Model
Experienced engineers reason across algorithms, kernels, memory layout, serving queues, hardware, and quality gates. A faster model that misses domain terms, breaks timestamps, or cannot roll back is not production-ready.
Attention variants, MQA/GQA, Conformer blocks, distillation, adapters, pruning, and smaller specialist models.
FlashAttention, fused operations, quantized matmuls, prefill/decode split, and memory-aware batch scheduling.
Continuous batching, KV-cache paging, admission control, fallbacks, autoscaling, observability, and rollback.
Ask what dominates the user-visible SLO: prefill, decode, audio feature extraction, network, queueing, TTS first audio, or downstream calls. Optimize the measured bottleneck, not the component that is easiest to benchmark in isolation.
Attention Cost
Standard self-attention builds an n x n score matrix per
head, so memory and compute grow quickly with sequence length. Audio
models often face long frame sequences before subsampling. Text models
face long context windows and expensive KV-cache memory during
autoregressive decoding.
Computes exact attention while reducing high-bandwidth memory traffic. It tiles query/key/value blocks, keeps partial softmax statistics, and avoids materializing the full attention matrix.
Attention is often memory-bandwidth bound. FlashAttention saves time by changing how data moves through memory, not by changing the mathematical result. A strong answer mentions exactness, tiling, SRAM/HBM traffic, sequence length, and kernel support.
Multi-query attention shares key/value heads across query heads. Grouped-query attention is the middle ground: several query heads share one key/value group.
MQA/GQA reduce KV-cache memory and memory bandwidth during decoding. Quality can drop if sharing is too aggressive, especially for tasks that need fine-grained conditioning. They are most helpful when decode latency and cache size dominate.
You need to run an encoder over 30 seconds of audio represented as 100 frames per second. What efficiency choices do you consider before reaching for a bigger GPU?
Check frame rate, windowing, subsampling, streaming chunks, convolutional front-end, local attention, FlashAttention support, mixed precision, batch shape padding, and whether the model can emit partial states. Also ask whether the product needs all 30 seconds at once or can process incrementally with stable boundaries.
Generation
Autoregressive transformers reuse previous keys and values so each new token attends to cached history instead of recomputing every prefix. The cache can become the main memory cost.
Track prompt length, generated tokens, active sequences, cache bytes per request, eviction behavior, GPU memory fragmentation, batch size, time to first token, tokens per second, and p95/p99 latency by route and model version.
A small draft model proposes several tokens, and the larger target model verifies them. Accepted draft tokens reduce target-model forward passes.
It helps when the draft model is cheap and has a high acceptance rate. It may disappoint when the draft model is too inaccurate, verification overhead is high, batches are already saturated, or the task distribution differs from the draft model's strengths.
A local speech assistant takes 2.8 seconds before the user hears a reply. ASR partials arrive in 300 ms, but the LLM and TTS handoff feel sluggish. Diagnose the system.
Split latency into endpointing, transcript stabilization, prompt assembly, LLM prefill, time to first token, token rate, TTS chunking, vocoder first audio, playback buffer, and UI/event-loop delay. Mitigations include earlier turn prediction, streaming partial prompts, shorter context, prompt caching, smaller or quantized model, speculative decoding, streaming TTS, response templates for common intents, and a rollback if a recent model change moved p99 beyond the SLO.
Compression
Lower precision weights and activations reduce memory and can improve throughput, but require evals by task slice.
Measure WER, entity error rate, punctuation, timestamp quality, toxicity/safety if relevant, p50/p95/p99 latency, memory, and cost. Quantization errors often show up unevenly across accents, noisy audio, rare words, and long contexts.
Train a smaller model to match a larger teacher's outputs, logits, representations, or behavior on curated examples.
Distillation is useful when serving cost matters and the target domain is narrower than the teacher's full capability. Watch for inherited teacher mistakes, reduced uncertainty, and poor behavior outside the distilled distribution.
Train a small number of parameters for domain adaptation while keeping most base weights frozen.
Adapters can be easier to store, swap, and roll back than full fine-tunes. In production, treat adapter identity as part of the model version and test interactions with quantization and cache layout.
Lab
This small script estimates KV-cache memory. It is not a replacement for profiling, but it helps you reason about why long contexts, concurrency, and attention-head choices matter.
def kv_cache_gib(
layers,
kv_heads,
head_dim,
tokens,
batch_size=1,
bytes_per_value=2,
):
# keys and values are both cached
total_bytes = layers * 2 * kv_heads * head_dim * tokens * batch_size * bytes_per_value
return total_bytes / (1024 ** 3)
if __name__ == "__main__":
cases = [
("short chat", 32, 8, 128, 2048, 8, 2),
("long context", 32, 8, 128, 16384, 8, 2),
("larger batch", 32, 8, 128, 4096, 64, 2),
]
for name, layers, kv_heads, head_dim, tokens, batch, bytes_per_value in cases:
gib = kv_cache_gib(layers, kv_heads, head_dim, tokens, batch, bytes_per_value)
print(f"{name}: {gib:.2f} GiB")
They reduce kv_heads. If a model has many query heads but
only a few key/value heads, the cache is smaller than full
multi-head attention. This directly affects concurrency and whether
long-context serving fits on the target hardware.
Test zero tokens, batch size one, large batch, fp16 versus int8 cache assumptions, MHA versus GQA head counts, and an expected hand-computed case. Also test that bad inputs such as negative tokens are rejected if this becomes a real utility.
Interview Practice
You need to ship an ASR model on laptops. The fp16 model is accurate but too slow and memory-heavy. Design the quantization evaluation and rollout plan.
Define target hardware, latency, memory, battery, and accuracy requirements. Compare int8/int4 weight-only or activation-aware options. Evaluate WER/CER, entity error rate, timestamp accuracy, noisy speech, accents, long utterances, and silence/VAD behavior. Roll out behind a flag, keep fp16 fallback, collect privacy-safe aggregate metrics, and include rollback thresholds.
A neural TTS service has good MOS but the monthly GPU bill is too high. What do you change without breaking quality?
Profile text normalization, acoustic model, vocoder, batching, and request mix. Consider caching repeated prompts, streaming first audio, smaller vocoder, distillation, quantization, CPU fallback for short/simple voices, autoscaling, and route-specific models. Keep MOS/intelligibility/speaker similarity tests, latency SLOs, and regression gates by voice and language.
Offline benchmarks improved, but the production endpoint shows no p95 improvement after enabling a faster attention kernel. Explain why.
The bottleneck may be queueing, pre/post-processing, network, decoding memory bandwidth, small sequence lengths, padding waste, batch scheduler behavior, CPU audio features, or downstream TTS. Verify that the kernel is actually active for the deployed shapes and dtype. Compare traces before and after, not only isolated model benchmarks.
Sources