Round 1: Multilingual Streaming ASR Launch
You are launching streaming ASR for support calls in English,
Spanish, and code-switched conversations. Design the model,
evaluation, rollout, and monitoring plan.
Hidden answer: strong outline
Clarify target latency, final WER, entity error rate, language mix,
retention rules, and traffic peaks. Use language ID or multilingual
ASR with VAD, streaming partials, punctuation restoration, and
domain contextual biasing for product names. Evaluate by language,
accent, noise, code-switch, named entities, partial churn, endpoint
delay, and correction rate. Roll out by tenant, language, and call
type with feature flags and model-version rollback.
Round 2: GPU Serving Platform Capacity
A shared inference platform hosts embeddings, a small LLM, TTS, and
batch ASR. Interactive requests are missing p95 latency. How do you
redesign scheduling and capacity?
Hidden answer: scheduling and tradeoffs
Split interactive and batch pools, then add request deadlines,
admission control, priority queues, warm pools, per-model quotas,
and autoscaling from queue age rather than only GPU utilization.
Track p50/p95/p99 latency, time in queue, tokens or audio seconds
per second, cold starts, error budget burn, and cost per successful
request. Continuous batching helps LLM throughput but can hurt
first-token latency if queueing is uncontrolled.
Round 3: Audio RAG Quality Regression
A voice assistant's text RAG eval is stable, but spoken users receive
more ungrounded answers after an ASR update. Diagnose and prevent it.
Hidden answer: eval and debugging plan
Re-run the retrieval eval using clean text, old ASR hypotheses, new
ASR hypotheses, noisy partials, and final transcripts. Slice by
entity substitutions, punctuation loss, homophones, language mix,
wake-word clipping, and endpointing. Add retrieval recall at k,
grounded answer rate, refusal precision, citation coverage, and
human review for high-risk queries. Gate future ASR changes on
downstream RAG metrics, not WER alone.
Round 4: TTS Safety And Latency Review
Product wants a more expressive TTS voice. Legal worries about voice
cloning, and support worries about slower first audio byte. Design
the release review.
Hidden answer: release review checklist
Require consent and provenance for voices, watermark or disclosure
where appropriate, abuse monitoring, text normalization tests,
pronunciation evals, and refusal for unsafe synthesis requests.
Measure first audio byte, chunk cadence, real-time factor, failure
rate, MOS or preference, interruption rate, and abandonment. Use
short first segments, warm vocoder pools, fallback voices, and
canary rollback if latency or safety budgets are exceeded.