Speech AI Interview Sprint

How To Use This

Practice Like A Real Loop

Run each prompt under a timer before opening the answer. An experienced answer should make tradeoffs explicit, name observability and rollback hooks, and separate confirmed evidence from hypotheses.

Five minutes: clarify product goal, users, traffic shape, privacy boundary, and launch target.
Ten minutes: draw the serving, data, evaluation, and rollback paths.
Ten minutes: quantify latency, cost, quality, and failure-mode tradeoffs.
Five minutes: propose tests, dashboards, rollout gates, and the first incident response.

Question: What separates a strong answer from a mid-level answer?

A strong answer owns the full system lifecycle. It connects model choice to product constraints, data contracts, evaluation slices, deployment mechanics, monitoring, rollback, privacy, cost, and future iteration. It does not stop at "train a better model."

System Design

Four High-Signal Design Rounds

Prompt 1: Real-Time Meeting Assistant

Design a meeting assistant that streams captions, produces summaries, extracts action items, and answers spoken follow-up questions over the meeting history. It must support enterprise privacy controls.

Hidden answer: advanced design outline

Split real-time captioning from async summarization. Use VAD, streaming ASR, diarization, punctuation, chunk storage, redaction, retrieval over sanitized meeting chunks, and an answer service with citations. Track first partial latency, final WER, speaker-attributed WER, summary factuality, action-item precision/recall, retrieval recall, p95 answer latency, tenant isolation, retention policy, and deletion propagation. Roll out by tenant and feature flag; rollback ASR, summarizer, retriever, and answer prompt independently.

Prompt 2: Low-Latency Voice Agent For Customer Support

Design a speech-to-speech support agent that handles account questions, can interrupt itself when the user speaks, and falls back safely when confidence is low.

Hidden answer: advanced design outline

Use cascaded ASR-LLM-RAG-TTS unless direct speech-to-speech is needed for a narrow domain and has stronger safety gates. Budget VAD, partial ASR, intent confidence, retrieval, tool calls, first token, first audio byte, and playback. Add barge-in events, echo cancellation, cancellation tokens, tool confirmation, handoff, and transcript repair. Evaluate task success, containment, groundedness, unsafe tool refusal, interruption recovery, first response latency, and cost per resolved call.

Prompt 3: Shared GPU Platform For ASR, TTS, And LLMs

Design a serving platform that hosts streaming ASR, batch transcription, TTS, embeddings, and LLM inference for several teams.

Hidden answer: advanced design outline

Isolate real-time pools from batch pools. Use workload classes, admission control, model registry, canary routing, autoscaling, queue-age SLOs, accelerator utilization, KV-cache or decoder memory budgets, warm pools for TTS, and backpressure. Track per-tenant cost, p95/p99 latency, queue age, error budget burn, saturation, retry storms, cold starts, model version, and rollback readiness. Never let eval backfills starve live speech traffic.

Prompt 4: Multilingual ASR Upgrade

A new multilingual ASR model improves aggregate WER but regresses code-switching and noisy far-field slices. How do you launch it?

Hidden answer: launch plan

Do not launch globally from aggregate WER. Define slice gates for language, code-switching, SNR, microphone type, duration, approved aggregate or consented region tags, entity error rate, confidence calibration, and first partial latency. Use traffic routing to keep regressed slices on the old model while canarying improved slices. Add a model card, known limitations, rollback trigger, shadow comparison, and post-launch drift watch.

R&D Judgment

Explain Research Tradeoffs Like An Owner

Prompt 5: CTC, RNN-T, Encoder-Decoder, Or Whisper-Style Model?

Compare model families for streaming dictation, offline transcription, and voice-command recognition.

Hidden answer: comparison points

CTC is simple and efficient but often needs decoding support and can be weaker for long context. RNN-T is strong for streaming partials and production dictation, with more complex training and decoding. Encoder-decoder models can use richer context but may be harder to stream tightly. Whisper-style weak supervision is robust offline and multilingual but may be costly and not ideal for ultra-low-latency partials. Match the family to latency, domain, data, decoding, and controllability constraints.

Prompt 6: Cascaded Versus Direct Speech-To-Speech

Product asks whether a direct speech-to-speech model should replace the current ASR-LLM-TTS cascade.

Hidden answer: decision framework

Cascades are easier to inspect, moderate, localize, retrieve over, log with privacy controls, and roll back by component. Direct models may reduce latency and preserve prosody but are harder to debug, evaluate, and constrain. A strong answer proposes an experiment with matched tasks, latency budgets, safety checks, grounding, turn-taking recovery, controllability, auditable transcript and tool-call traces, retention controls, and a fallback path before replacement.

Prompt 7: Efficient Transformer Choice

You need to cut serving cost 30 percent while protecting speech-agent task success. Which optimizations do you try first?

Hidden answer: optimization order

Start with routing and workload isolation before model surgery: smaller model for easy turns, cache retrieval and prompts, trim context, tune batching, use quantization where eval passes, and consider speculative decoding, GQA, FlashAttention, distillation, or LoRA only with slice-level quality gates. Watch p95 latency, first token latency, task success, hallucination, tool error, and cost per successful task.

Production Debugging

First-Hour Incident Prompts

Prompt 8: Final WER Stable, Users Still Angry

After an ASR release, final WER is stable but live captions feel worse and support tickets rise.

Hidden answer: investigation plan

Inspect partial churn, first stable token latency, endpointing, punctuation rewrites, timestamp jitter, client debounce behavior, decoder rescoring, and noisy or long-utterance slices. Mitigate with rollback, committed-prefix stabilization, or endpointing changes. Add streaming UX metrics to release gates because final WER hides partial instability.

Prompt 9: GPU Spend Doubled Overnight

Speech platform spend doubles without matching traffic growth. Latency is slightly better and quality metrics are flat.

Hidden answer: investigation plan

Check autoscaler floor, warm replicas, batch jobs on live pools, shadow traffic, retry loops, longer prompts, disabled quantization, cache miss rate, tenant mix, and per-model utilization. Stabilize by isolating batch, restoring utilization targets, capping shadow traffic, and preserving SLOs. Add cost-per-success and utilization alerts, not just latency alerts.

Prompt 10: RAG Answers Cite Stale Policy

A spoken RAG assistant starts citing outdated policy pages after an index refresh.

Hidden answer: investigation plan

Pin the previous index or filter stale namespaces. Compare document lineage, chunker version, embedding model, top-k overlap, freshness metadata, ASR query variants, and answer-grounding checks. Add stale document fixtures, freshness-aware ranking, citation validation, and index-release rollback to CI/CD.

Coding Follow-Up

Rollout Risk Scoring

Interviewers often turn system design into a small implementation exercise. Keep the invariant simple and test edge cases explicitly.

Lab: Rank Canary Risks

Given aggregate canary metrics by slice, return a sorted list of failing gates. Lower is better for WER, latency, cost, and unsafe rate; higher is better for task success and citation support.

Hidden answer: invariant, tests, and Python solution

Invariant: every slice is judged against its own configured budgets, and severity is normalized so different metric units can be sorted together, including zero-budget safety gates such as unsafe-rate ceilings. Validate gate definitions and metric values before scoring. Keep missing metrics explicit and finite in reports so launch tooling does not emit non-JSON severity values. Reject booleans, NaN, and infinity so malformed telemetry cannot silently pass a launch gate or distort severity sorting. Test missing metrics, exactly-on-threshold values, lower-is-better metrics, higher-is-better metrics, zero thresholds, invalid gates, nonnumeric metrics, non-finite values, and multiple failures in the same slice.

from math import isfinite


def _finite_number(value, label):
    if isinstance(value, bool) or not isinstance(value, (int, float)) or not isfinite(value):
        raise ValueError(f"{label} must be a finite number")
    return value


def rank_canary_risks(slices, gates):
    failures = []
    for slice_name, metrics in slices.items():
        for metric, gate in gates.items():
            direction = gate.get("direction")
            if direction not in {"max", "min"}:
                raise ValueError(f"unknown direction for {metric}: {direction}")
            threshold = _finite_number(gate.get("threshold"), f"threshold for {metric}")

            if metric not in metrics:
                failures.append((1_000_000.0, slice_name, metric, "missing"))
                continue

            value = _finite_number(metrics[metric], f"value for {slice_name}/{metric}")
            scale = max(abs(threshold), 1.0)

            if direction == "max":
                over = value - threshold
                if over > 0:
                    severity = over / scale
                    failures.append((severity, slice_name, metric, value))
            else:
                under = threshold - value
                if under > 0:
                    severity = under / scale
                    failures.append((severity, slice_name, metric, value))

    failures.sort(reverse=True)
    return [
        {"slice": s, "metric": m, "value": v, "severity": round(sev, 4)}
        for sev, s, m, v in failures
    ]

Speech Serving Follow-Up: Concurrent Stream Capacity

Given synthetic stream start and end times, explain how you would estimate peak concurrent sessions for GPU serving capacity planning.

Hidden answer: strategy and production caveats

Sort stream intervals by start time and keep the active end times in a min-heap; the largest heap size estimates peak concurrency. In serving, streams also occupy accelerator memory, decoder slots, network buffers, and sometimes per-tenant quotas. A production plan must add variable token rates, queueing, batching, priority classes, warm capacity, and safety margin before turning the estimate into replica count.

Speech Decoding Follow-Up: Prefix Constraints

Explain how prefix constraints can help spoken command recognition and where they can hurt ASR or speech-agent behavior.

Hidden answer: strategy and speech caveats

Prefix constraints can reject impossible command paths early and bias decoding toward valid tool names, contact names, product SKUs, or safety-sensitive actions. They are useful when the allowed action space is small and explicit. The failure mode is over-constraining: the decoder may not recover from ASR noise, pronunciation variation, code-switching, or user paraphrases, so launches need fallback paths and approved aggregate or consented slice-level evaluation.