Advanced Practice Capstone

Speech AI Interview Sprint

A timed capstone for ML engineer and AI engineer interviews: speech system design, R&D judgment, production debugging, model serving, MLOps, evaluation, and focused coding follow-ups with hidden answers.

How To Use This

Practice Like A Real Loop

Run each prompt under a timer before opening the answer. An experienced answer should make tradeoffs explicit, name observability and rollback hooks, and separate confirmed evidence from hypotheses.

  1. Five minutes: clarify product goal, users, traffic shape, privacy boundary, and launch target.
  2. Ten minutes: draw the serving, data, evaluation, and rollback paths.
  3. Ten minutes: quantify latency, cost, quality, and failure-mode tradeoffs.
  4. Five minutes: propose tests, dashboards, rollout gates, and the first incident response.
Question: What separates a strong answer from a mid-level answer?

A strong answer owns the full system lifecycle. It connects model choice to product constraints, data contracts, evaluation slices, deployment mechanics, monitoring, rollback, privacy, cost, and future iteration. It does not stop at "train a better model."

System Design

Four High-Signal Design Rounds

Prompt 1: Real-Time Meeting Assistant

Design a meeting assistant that streams captions, produces summaries, extracts action items, and answers spoken follow-up questions over the meeting history. It must support enterprise privacy controls.

Hidden answer: advanced design outline

Split real-time captioning from async summarization. Use VAD, streaming ASR, diarization, punctuation, chunk storage, redaction, retrieval over sanitized meeting chunks, and an answer service with citations. Track first partial latency, final WER, speaker-attributed WER, summary factuality, action-item precision/recall, retrieval recall, p95 answer latency, tenant isolation, retention policy, and deletion propagation. Roll out by tenant and feature flag; rollback ASR, summarizer, retriever, and answer prompt independently.

Prompt 2: Low-Latency Voice Agent For Customer Support

Design a speech-to-speech support agent that handles account questions, can interrupt itself when the user speaks, and falls back safely when confidence is low.

Hidden answer: advanced design outline

Use cascaded ASR-LLM-RAG-TTS unless direct speech-to-speech is needed for a narrow domain and has stronger safety gates. Budget VAD, partial ASR, intent confidence, retrieval, tool calls, first token, first audio byte, and playback. Add barge-in events, echo cancellation, cancellation tokens, tool confirmation, handoff, and transcript repair. Evaluate task success, containment, groundedness, unsafe tool refusal, interruption recovery, first response latency, and cost per resolved call.

Prompt 3: Shared GPU Platform For ASR, TTS, And LLMs

Design a serving platform that hosts streaming ASR, batch transcription, TTS, embeddings, and LLM inference for several teams.

Hidden answer: advanced design outline

Isolate real-time pools from batch pools. Use workload classes, admission control, model registry, canary routing, autoscaling, queue-age SLOs, accelerator utilization, KV-cache or decoder memory budgets, warm pools for TTS, and backpressure. Track per-tenant cost, p95/p99 latency, queue age, error budget burn, saturation, retry storms, cold starts, model version, and rollback readiness. Never let eval backfills starve live speech traffic.

Prompt 4: Multilingual ASR Upgrade

A new multilingual ASR model improves aggregate WER but regresses code-switching and noisy far-field slices. How do you launch it?

Hidden answer: launch plan

Do not launch globally from aggregate WER. Define slice gates for language, code-switching, SNR, microphone type, duration, region, entity error rate, confidence calibration, and first partial latency. Use traffic routing to keep regressed slices on the old model while canarying improved slices. Add a model card, known limitations, rollback trigger, shadow comparison, and post-launch drift watch.

R&D Judgment

Explain Research Tradeoffs Like An Owner

Prompt 5: CTC, RNN-T, Encoder-Decoder, Or Whisper-Style Model?

Compare model families for streaming dictation, offline transcription, and voice-command recognition.

Hidden answer: comparison points

CTC is simple and efficient but often needs decoding support and can be weaker for long context. RNN-T is strong for streaming partials and production dictation, with more complex training and decoding. Encoder-decoder models can use richer context but may be harder to stream tightly. Whisper-style weak supervision is robust offline and multilingual but may be costly and not ideal for ultra-low-latency partials. Match the family to latency, domain, data, decoding, and controllability constraints.

Prompt 6: Cascaded Versus Direct Speech-To-Speech

Product asks whether a direct speech-to-speech model should replace the current ASR-LLM-TTS cascade.

Hidden answer: decision framework

Cascades are easier to inspect, moderate, localize, retrieve over, log with privacy controls, and roll back by component. Direct models may reduce latency and preserve prosody but are harder to debug, evaluate, and constrain. A strong answer proposes an experiment with matched tasks, latency budgets, safety checks, grounding, turn-taking recovery, controllability, and a fallback path before replacement.

Prompt 7: Efficient Transformer Choice

You need to cut serving cost 30 percent while protecting speech-agent task success. Which optimizations do you try first?

Hidden answer: optimization order

Start with routing and workload isolation before model surgery: smaller model for easy turns, cache retrieval and prompts, trim context, tune batching, use quantization where eval passes, and consider speculative decoding, GQA, FlashAttention, distillation, or LoRA only with slice-level quality gates. Watch p95 latency, first token latency, task success, hallucination, tool error, and cost per successful task.

Production Debugging

First-Hour Incident Prompts

Prompt 8: Final WER Stable, Users Still Angry

After an ASR release, final WER is stable but live captions feel worse and support tickets rise.

Hidden answer: investigation plan

Inspect partial churn, first stable token latency, endpointing, punctuation rewrites, timestamp jitter, client debounce behavior, decoder rescoring, and noisy or long-utterance slices. Mitigate with rollback, committed-prefix stabilization, or endpointing changes. Add streaming UX metrics to release gates because final WER hides partial instability.

Prompt 9: GPU Spend Doubled Overnight

Speech platform spend doubles without matching traffic growth. Latency is slightly better and quality metrics are flat.

Hidden answer: investigation plan

Check autoscaler floor, warm replicas, batch jobs on live pools, shadow traffic, retry loops, longer prompts, disabled quantization, cache miss rate, tenant mix, and per-model utilization. Stabilize by isolating batch, restoring utilization targets, capping shadow traffic, and preserving SLOs. Add cost-per-success and utilization alerts, not just latency alerts.

Prompt 10: RAG Answers Cite Stale Policy

A spoken RAG assistant starts citing outdated policy pages after an index refresh.

Hidden answer: investigation plan

Pin the previous index or filter stale namespaces. Compare document lineage, chunker version, embedding model, top-k overlap, freshness metadata, ASR query variants, and answer-grounding checks. Add stale document fixtures, freshness-aware ranking, citation validation, and index-release rollback to CI/CD.

Coding Follow-Up

Rollout Risk Scoring

Interviewers often turn system design into a small implementation exercise. Keep the invariant simple and test edge cases explicitly.

Lab: Rank Canary Risks

Given aggregate canary metrics by slice, return a sorted list of failing gates. Lower is better for WER, latency, cost, and unsafe rate; higher is better for task success and citation support.

Hidden answer: invariant, tests, and Python solution

Invariant: every slice is judged against its own configured budgets, and severity is normalized so different metric units can be sorted together. Test missing metrics, exactly-on-threshold values, lower-is-better metrics, higher-is-better metrics, and multiple failures in the same slice.

def rank_canary_risks(slices, gates):
    failures = []
    for slice_name, metrics in slices.items():
        for metric, gate in gates.items():
            if metric not in metrics:
                failures.append((float("inf"), slice_name, metric, "missing"))
                continue

            value = metrics[metric]
            direction = gate["direction"]
            threshold = gate["threshold"]

            if direction == "max":
                over = value - threshold
                if over > 0:
                    severity = over / max(abs(threshold), 1e-9)
                    failures.append((severity, slice_name, metric, value))
            elif direction == "min":
                under = threshold - value
                if under > 0:
                    severity = under / max(abs(threshold), 1e-9)
                    failures.append((severity, slice_name, metric, value))
            else:
                raise ValueError(f"unknown direction: {direction}")

    failures.sort(reverse=True)
    return [
        {"slice": s, "metric": m, "value": v, "severity": round(sev, 4)}
        for sev, s, m, v in failures
    ]

Blind 75 Connection: Meeting Rooms II

Explain how the Meeting Rooms II heap pattern maps to GPU serving capacity planning for speech workloads.

Hidden answer: strategy and production analogy

In the coding problem, sort intervals by start time and keep a min-heap of end times; the heap size is the number of rooms needed. In serving, requests or streams occupy accelerator memory and decode slots for an interval. The same idea estimates peak concurrent sessions, but production also needs variable token rates, queueing, batching, priority classes, and safety margin.

Blind 75 Connection: Word Search II

Explain how trie pruning from Word Search II relates to spoken command recognition and constrained decoding.

Hidden answer: strategy and speech analogy

Word Search II uses a trie to reject impossible prefixes early and avoid exploring every path. Speech systems use similar prefix constraints for command grammars, contact names, entity biasing, or safety-sensitive tool names. The mistake is over-constraining so the decoder cannot recover from ASR noise or user paraphrases.