Advanced ML Exam Drills For Audio-Text Systems

Practice Loop

Answer Like An Owner

For each prompt, spend five minutes writing assumptions, ten minutes designing, five minutes on failure modes, and five minutes on metrics and rollout. Then open the hidden answer and compare.

State constraints: latency, quality, privacy, data retention, cost, traffic, and failure tolerance.
Name the model path: ASR, TTS, LLM, embedding, reranker, codec, or direct speech model.
Define evaluation: offline slices, online SLOs, human review, and regression budgets.
Plan operations: deployment, CI/CD, monitoring, alerts, rollback, and incident ownership.
Explain tradeoffs: what you deliberately did not optimize and why.

Question: What should every strong answer include even when the prompt is vague?

It should include explicit assumptions, measurable launch gates, privacy-safe observability, a rollback path, cost and latency reasoning, and a plan for learning from production failures.

Timed Rounds

Interview And Exam Prompts

Round 1: Multilingual Streaming ASR Launch

You are launching streaming ASR for support calls in English, Spanish, and code-switched conversations. Design the model, evaluation, rollout, and monitoring plan.

Hidden answer: strong outline

Clarify target latency, final WER, entity error rate, language mix, retention rules, and traffic peaks. Use language ID or multilingual ASR with VAD, streaming partials, punctuation restoration, and domain contextual biasing for product names. Do not route solely on a single early language-ID guess because code-switched calls can change language mid-utterance. Evaluate by language, accent, noise, code-switch, named entities, partial churn, endpoint delay, and correction rate, using approved aggregate or consented tags for high-risk attributes such as accent or dialect. Roll out by tenant, language, and call type with feature flags and model-version rollback.

Round 2: GPU Serving Platform Capacity

A shared inference platform hosts embeddings, a small LLM, TTS, and batch ASR. Interactive requests are missing p95 latency. How do you redesign scheduling and capacity?

Hidden answer: scheduling and tradeoffs

Split interactive and batch pools, then add request deadlines, admission control, priority queues, warm pools, per-model quotas, and autoscaling from queue age rather than only GPU utilization. Track p50/p95/p99 latency, time in queue, tokens or audio seconds per second, cold starts, error budget burn, and cost per successful request. Continuous batching helps LLM throughput but can hurt first-token latency if queueing is uncontrolled.

Round 3: Audio RAG Quality Regression

A voice assistant's text RAG eval is stable, but spoken users receive more ungrounded answers after an ASR update. Diagnose and prevent it.

Hidden answer: eval and debugging plan

Re-run the retrieval eval using clean text, old ASR hypotheses, new ASR hypotheses, noisy partials, and final transcripts from a consented or de-identified evaluation set. Slice by entity substitutions, punctuation loss, homophones, language mix, wake-word clipping, and endpointing. Add retrieval recall at k, grounded answer rate, refusal precision, citation coverage, and human review for high-risk queries. Gate future ASR changes on downstream RAG metrics, not WER alone.

Round 4: TTS Safety And Latency Review

Product wants a more expressive TTS voice. Legal worries about voice cloning, and support worries about slower first audio byte. Design the release review.

Hidden answer: release review checklist

Require consent and provenance for voices, watermark or disclosure where appropriate, and abuse monitoring that does not treat a watermark or synthetic-speech detector as the only proof of misuse. Include text normalization tests, pronunciation evals, and refusal for unsafe synthesis requests. Measure first audio byte, chunk cadence, real-time factor, failure rate, MOS or preference, interruption rate, and abandonment. Use short first segments, warm vocoder pools, fallback voices, and canary rollback if latency or safety budgets are exceeded.

Production Drills

Incident Exercises

Incident 1: Partial Transcript Churn Spike

After a model release, users see words change repeatedly while they speak. Final transcripts are acceptable, but the UI feels unstable.

Hidden answer: first-hour response

Check partial churn by language, device, network, VAD segment, decoder config, beam settings, endpoint delay, and UI commit policy. Mitigate by rollback, sticky old model for affected slices, longer stabilization before displaying partials, or UI smoothing. Add partial stability metrics to release gates because final WER alone misses the user-visible failure.

Incident 2: Runaway Cost Without Quality Change

Quality dashboards are flat, but GPU spend doubled overnight for the speech-to-speech agent. What do you inspect?

Hidden answer: cost-debug checklist

Inspect traffic mix, retry loops, longer conversations, ASR segment count, LLM tokens per turn, TTS retries, cache hit rate, batch pool spillover, warm-pool size, autoscaling thresholds, and model version routing using aggregate counters and redacted request metadata rather than raw audio or transcripts. Stabilize with quotas, request deadlines, retry caps, fallback tiers, and per-stage cost alerts. Keep quality and safety gates active while reducing spend.

Self-Grading

Advanced Answer Rubric

Does the answer name the product SLO and the model metric?
Does it separate online serving, offline evaluation, data labeling, and rollout control paths?
Does it include CI/CD gates, canary criteria, rollback triggers, and post-release monitoring?
Does it discuss drift, privacy-safe logs, and slice metrics rather than only aggregate accuracy?
Does it make cost, latency, memory, and quality tradeoffs explicit?

Question: What is a red flag in an experienced ML system design answer?

A red flag is treating model choice as the whole system. Production work includes data contracts, evaluation, launch gates, observability, debugging, rollback, cost control, and ownership after deployment.