Module 12

Efficient Transformers

Learn how modern audio and text models become fast enough to serve: IO-aware attention, KV-cache memory, grouped-query attention, quantization, speculative decoding, distillation, adapters, and production latency tradeoffs.

Mental Model

Efficiency Is A Stack, Not One Trick

Experienced engineers reason across algorithms, kernels, memory layout, serving queues, hardware, and quality gates. A faster model that misses domain terms, breaks timestamps, or cannot roll back is not production-ready.

Model Architecture

Attention variants, MQA/GQA, Conformer blocks, distillation, adapters, pruning, and smaller specialist models.

Runtime Kernels

FlashAttention, fused operations, quantized matmuls, prefill/decode split, and memory-aware batch scheduling.

Serving System

Continuous batching, KV-cache paging, admission control, fallbacks, autoscaling, observability, and rollback.

Question: What is the first efficiency question to ask?

Ask what dominates the user-visible SLO: prefill, decode, audio feature extraction, network, queueing, TTS first audio, or downstream calls. Optimize the measured bottleneck, not the component that is easiest to benchmark in isolation.

Attention Cost

From Quadratic Attention To Hardware-Aware Kernels

Standard self-attention builds an n x n score matrix per head, so memory and compute grow quickly with sequence length. Audio models often face long frame sequences before subsampling. Text models face long context windows and expensive KV-cache memory during autoregressive decoding.

FlashAttention

Computes exact attention while reducing high-bandwidth memory traffic. It tiles query/key/value blocks, keeps partial softmax statistics, and avoids materializing the full attention matrix.

Hidden answer: Why does it help?

Attention is often memory-bandwidth bound. FlashAttention saves time by changing how data moves through memory, not by changing the mathematical result. A strong answer mentions exactness, tiling, SRAM/HBM traffic, sequence length, and kernel support.

MQA And GQA

Multi-query attention shares key/value heads across query heads. Grouped-query attention is the middle ground: several query heads share one key/value group.

Hidden answer: What tradeoff changes?

MQA/GQA reduce KV-cache memory and memory bandwidth during decoding. Quality can drop if sharing is too aggressive, especially for tasks that need fine-grained conditioning. They are most helpful when decode latency and cache size dominate.

Checkpoint: Long Audio Context

You need to run an encoder over 30 seconds of audio represented as 100 frames per second. What efficiency choices do you consider before reaching for a bigger GPU?

Hidden answer: Strong checklist

Check frame rate, windowing, subsampling, streaming chunks, convolutional front-end, local attention, FlashAttention support, mixed precision, batch shape padding, and whether the model can emit partial states. Also ask whether the product needs all 30 seconds at once or can process incrementally with stable boundaries.

Generation

Decode Latency Is Usually Memory And Scheduling

KV Cache

Autoregressive transformers reuse previous keys and values so each new token attends to cached history instead of recomputing every prefix. The cache can become the main memory cost.

Hidden answer: What should you monitor?

Track prompt length, generated tokens, active sequences, cache bytes per request, eviction behavior, GPU memory fragmentation, batch size, time to first token, tokens per second, and p95/p99 latency by route and model version.

Speculative Decoding

A small draft model proposes several tokens, and the larger target model verifies them. Accepted draft tokens reduce target-model forward passes.

Hidden answer: When does it fail to help?

It helps when the draft model is cheap and has a high acceptance rate. It may disappoint when the draft model is too inaccurate, verification overhead is high, batches are already saturated, or the task distribution differs from the draft model's strengths.

Production Prompt: Speech Assistant First Audio Is Too Slow

A local speech assistant takes 2.8 seconds before the user hears a reply. ASR partials arrive in 300 ms, but the LLM and TTS handoff feel sluggish. Diagnose the system.

Hidden answer: First-pass diagnosis

Split latency into endpointing, transcript stabilization, prompt assembly, LLM prefill, time to first token, token rate, TTS chunking, vocoder first audio, playback buffer, and UI/event-loop delay. Mitigations include earlier turn prediction, streaming partial prompts, shorter context, prompt caching, smaller or quantized model, speculative decoding, streaming TTS, response templates for common intents, and a rollback if a recent model change moved p99 beyond the SLO.

Compression

Smaller Models Need Quality Guardrails

Quantization

Lower precision weights and activations reduce memory and can improve throughput, but require evals by task slice.

Hidden answer

Measure WER, entity error rate, punctuation, timestamp quality, toxicity/safety if relevant, p50/p95/p99 latency, memory, and cost. Quantization errors often show up unevenly across accents, noisy audio, rare words, and long contexts.

Distillation

Train a smaller model to match a larger teacher's outputs, logits, representations, or behavior on curated examples.

Hidden answer

Distillation is useful when serving cost matters and the target domain is narrower than the teacher's full capability. Watch for inherited teacher mistakes, reduced uncertainty, and poor behavior outside the distilled distribution.

Adapters And LoRA

Train a small number of parameters for domain adaptation while keeping most base weights frozen.

Hidden answer

Adapters can be easier to store, swap, and roll back than full fine-tunes. In production, treat adapter identity as part of the model version and test interactions with quantization and cache layout.

Lab

Estimate KV-Cache Memory Before You Serve

This small script estimates KV-cache memory. It is not a replacement for profiling, but it helps you reason about why long contexts, concurrency, and attention-head choices matter.

def kv_cache_gib(
    layers,
    kv_heads,
    head_dim,
    tokens,
    batch_size=1,
    bytes_per_value=2,
):
    # keys and values are both cached
    total_bytes = layers * 2 * kv_heads * head_dim * tokens * batch_size * bytes_per_value
    return total_bytes / (1024 ** 3)


if __name__ == "__main__":
    cases = [
        ("short chat", 32, 8, 128, 2048, 8, 2),
        ("long context", 32, 8, 128, 16384, 8, 2),
        ("larger batch", 32, 8, 128, 4096, 64, 2),
    ]
    for name, layers, kv_heads, head_dim, tokens, batch, bytes_per_value in cases:
        gib = kv_cache_gib(layers, kv_heads, head_dim, tokens, batch, bytes_per_value)
        print(f"{name}: {gib:.2f} GiB")
Question: How do MQA and GQA appear in this estimate?

They reduce kv_heads. If a model has many query heads but only a few key/value heads, the cache is smaller than full multi-head attention. This directly affects concurrency and whether long-context serving fits on the target hardware.

Question: What test cases should you add around this estimate?

Test zero tokens, batch size one, large batch, fp16 versus int8 cache assumptions, MHA versus GQA head counts, and an expected hand-computed case. Also test that bad inputs such as negative tokens are rejected if this becomes a real utility.

Interview Practice

Advanced Prompts

Prompt 1: Quantize An ASR Model For Edge Deployment

You need to ship an ASR model on laptops. The fp16 model is accurate but too slow and memory-heavy. Design the quantization evaluation and rollout plan.

Hidden answer: What a strong plan includes

Define target hardware, latency, memory, battery, and accuracy requirements. Compare int8/int4 weight-only or activation-aware options. Evaluate WER/CER, entity error rate, timestamp accuracy, noisy speech, accents, long utterances, and silence/VAD behavior. Roll out behind a flag, keep fp16 fallback, collect privacy-safe aggregate metrics, and include rollback thresholds.

Prompt 2: Reduce GPU Cost For A TTS API

A neural TTS service has good MOS but the monthly GPU bill is too high. What do you change without breaking quality?

Hidden answer: Cost-quality tradeoffs

Profile text normalization, acoustic model, vocoder, batching, and request mix. Consider caching repeated prompts, streaming first audio, smaller vocoder, distillation, quantization, CPU fallback for short/simple voices, autoscaling, and route-specific models. Keep MOS/intelligibility/speaker similarity tests, latency SLOs, and regression gates by voice and language.

Prompt 3: FlashAttention Did Not Improve Production Latency

Offline benchmarks improved, but the production endpoint shows no p95 improvement after enabling a faster attention kernel. Explain why.

Hidden answer: Likely causes

The bottleneck may be queueing, pre/post-processing, network, decoding memory bandwidth, small sequence lengths, padding waste, batch scheduler behavior, CPU audio features, or downstream TTS. Verify that the kernel is actually active for the deployed shapes and dtype. Compare traces before and after, not only isolated model benchmarks.

Sources

Reading List