R&D Depth
Explain architectures, losses, optimization limits, papers, ablations, and why a result should or should not transfer to speech.
Interview Track
A practical exam track for advanced machine learning and AI engineering roles. The goal is not memorized answers. The goal is to reason through ambiguous research and production problems with enough detail to earn trust on a real audio and speech ML team.
Current Hiring Signal
A current online pass across ML system design, MLOps/deployment, AI engineering, and speech-recognition interview material shows a common pattern: advanced loops test judgment under constraints. Strong candidates can move from problem framing to data, modeling, evaluation, serving, monitoring, rollback, and cost.
Explain architectures, losses, optimization limits, papers, ablations, and why a result should or should not transfer to speech.
Design reliable pipelines, model registries, rollout plans, observability, drift response, and rollback paths.
Choose between batch and online inference, CPU/GPU placement, batching, quantization, caching, and graceful degradation.
Exam Format
Production ML work is mostly choosing and operating the right system under imperfect constraints. Model architecture is only one part. The candidate must also handle data quality, evaluation, serving latency, monitoring, stakeholder metrics, privacy, cost, and rollback.
System Design
Build a streaming ASR service for a call-center product. It must support 5,000 concurrent calls, partial transcripts under 500 ms, final transcripts with word timestamps, domain vocabulary updates, and strict privacy controls.
Start with requirements: languages, sample rate, channels, latency, concurrency, retention, compliance, and accuracy targets. Propose VAD, chunking, streaming encoder or RNN-T/Transducer-style model, endpointing, partial/final transcript reconciliation, custom vocabulary or contextual biasing, and a WER/CER evaluation set split by accent, noise, domain terms, and call type. For serving, discuss autoscaling workers, GPU batching, CPU preprocessing, backpressure, per-stream state, trace IDs, and a fallback path. For operations, include drift checks, latency percentiles, audio quality metrics, privacy-safe logging, redaction, rollout, and rollback.
Build a voice assistant that listens, reasons, and responds by voice. It should feel conversational on a laptop, preserve user privacy, and support barge-in.
A cascaded ASR -> LLM -> TTS design is easier to debug, easier to evaluate, and easier to moderate, but it adds latency at every boundary and loses some prosody. A direct speech-to-speech model can preserve paralinguistic cues and reduce intermediate text bottlenecks, but is harder to control, inspect, and operate. A strong answer should create a latency budget for VAD, ASR partials, LLM first token, TTS first audio, playback, and interruption. It should also include state management, echo control, local model options, privacy rules, transcript UI, failure recovery, and measurements for both quality and responsiveness.
Multiple teams need to serve ASR, TTS, embedding, and audio classification models. Some are batch jobs. Some are real-time APIs. Some require GPUs. Design the platform.
Include a model registry with weights, code, tokenizer/feature config, environment, training data snapshot, eval report, and owner. Separate batch and online serving paths. Use a standard inference interface, request validation, model-specific preprocessing, model servers for GPU workloads, and direct embedded inference only for tiny low-latency models. Add canary rollout, shadow traffic, traffic splitting, rollback, dashboards, cost attribution, capacity planning, and incident playbooks.
R&D
Explain where each ASR objective fits, how alignment is learned, and why streaming constraints change the choice.
CTC assumes monotonic alignment and conditional independence between output labels given acoustic frames. It is simple and strong for ASR but may need a language model or rescoring. Attention seq2seq is flexible and powerful but usually less natural for strict streaming. RNN-T/Transducer models jointly model acoustic time and output tokens, making them strong for streaming ASR at the cost of more complex training and decoding.
Explain why combining convolution and self-attention is useful for speech.
Speech has local acoustic structure and long-range linguistic context. Convolution captures local patterns such as phonetic transitions efficiently, while attention connects distant frames. This is a better inductive bias than pure attention for many ASR settings, especially when data and latency are constrained.
Decide whether to fine-tune, prompt, add vocabulary biasing, or build post-processing.
Start with error analysis. Fine-tune when the failures are stable, data is available, and model adaptation is allowed. Prefer prompting, decoding constraints, vocabulary biasing, or post-processing when errors are narrow and can be corrected without risking global regression. Always keep a held-out domain eval, track WER by slice, and compare latency and memory changes.
Explain why speech-to-speech systems use discrete audio tokens and what can go wrong.
Codec tokens shorten raw waveform into a sequence that generative models can handle. They can preserve speaker, prosody, and acoustic detail better than text-only bottlenecks. Failure modes include bitrate artifacts, speaker leakage, unstable prosody, alignment drift, and safety/control difficulty.
Production
A 500 MB transformer endpoint now has p99 latency over 2 seconds during peak traffic. Accuracy is slightly better than the old model. What do you do?
First isolate where time is spent: request queue, preprocessing, model execution, post-processing, serialization, network, cold starts, or downstream calls. Compare old and new model traces. Check batch size, GPU utilization, memory pressure, context length, and autoscaling behavior. Mitigations include rollback, traffic reduction, canary pause, dynamic batching, quantization, ONNX or TensorRT-style optimization, shorter inputs, cache reuse, and timeout-aware graceful degradation. Do not defend accuracy if the product cannot meet latency SLOs.
An ASR model improved from 10.2 to 8.7 WER offline, but customer support agents say the transcripts are less useful. Explain possible causes and what you measure next.
Offline WER may hide domain slices: names, product terms, accents, noise, overlapping speech, punctuation, timestamps, or diarization. Production utility may depend on entity accuracy, latency, partial stability, readability, and downstream summarization. Check train-test leakage, distribution shift, label quality, decoding changes, normalization rules, and whether the eval set matches current traffic. Add slice metrics and human review on real privacy-safe samples.
Move a manually trained audio classifier from notebooks to a production pipeline.
ML CI/CD must validate code, data, and model artifacts. Include data schema checks, feature extraction parity, dataset versioning, reproducible environments, training jobs, baseline comparison, slice metrics, bias/fairness checks if relevant, model registry promotion gates, staged deployment, rollback, and monitoring. A model version is not just weights; it includes code, feature config, dependencies, data snapshot, hyperparameters, and eval report.
Coding
Coding rounds often test whether you can write clear ML code, find silent bugs, and design guardrails. These exercises should be added to the course labs as executable notebooks or scripts.
# Exercise: find the bug before this ships.
import torch
import torch.nn.functional as F
def training_step(model, batch):
x, y = batch["features"], batch["labels"]
logits = model(x) # [batch, classes]
probs = F.softmax(logits) # bug: dim is missing
loss = F.cross_entropy(probs, y)
return loss
Cross entropy expects raw logits, not softmax probabilities. The
softmax also omits the dimension argument. The correct code is usually
loss = F.cross_entropy(logits, y). A strong answer also
says how to catch this: unit tests on shape and loss behavior,
overfit-one-batch tests, code review checklists, and metric sanity
checks against a known baseline.
Audio Domain
Rubric
Asks clarifying questions, creates an end-to-end design, picks metrics by product goal, and names failure modes.
Adds rollout, rollback, cost, monitoring, data lineage, privacy, capacity planning, and migration strategy.
Frames tradeoffs across teams, makes reversible decisions, identifies organizational risks, and creates a path from prototype to platform.
Sources Used For This Track