Prompt 1: Scale A Real-Time Speech-To-Speech Assistant
Design the serving architecture for a bilingual speech-to-speech
assistant that must support 20,000 concurrent calls during peak
traffic. The product promise is first partial transcript under 500 ms
p95 and first synthesized audio under 1.8 seconds p95.
Hidden answer: advanced design outline
Start with a streaming gateway, per-call session state, VAD, ASR
stream workers, retrieval and policy services, LLM serving, TTS
streaming, and playback telemetry. Isolate real-time traffic from
batch jobs, reserve warm model pools, and use priority queues for
interactive turns. Track concurrent streams, audio seconds per
second, partial latency, finalization latency, token throughput,
TTS first audio byte, error budget burn, and cost per successful
turn. Roll out by language, region, device class, and tenant.
Prompt 2: Choose A Routing Policy For Cost And Quality
You have a large ASR model with better noisy-call WER and a smaller
model with half the cost and lower latency. Design a routing policy
for production.
Hidden answer: policy, gates, and failure modes
Route by observable aggregate-safe signals such as language,
device, expected noise class, tenant tier, real-time requirement,
and confidence from early chunks. Use the large model for hard
slices, high-value calls, low confidence, or escalation. Gate the
policy on WER/CER slices, first partial latency, abandonment,
downstream task success, and cost per successful minute. Failure
modes include biased routing, stale noise classifiers, retry loops,
and silently routing rare accents to the wrong model.
Prompt 3: Design CI/CD For A Multi-Model Voice Agent
ASR, retrieval, prompt templates, LLM, TTS, and safety classifiers can
all ship independently. Design release gates and rollback strategy.
Hidden answer: release engineering answer
Give every artifact a version, owner, data lineage record, eval
report, and rollback target. CI validates schemas, model cards,
safety checks, privacy constraints, and reproducible eval inputs.
Pre-prod gates cover noisy ASR queries, retrieval freshness,
grounded answers, tool-call safety, TTS latency, and end-to-end
spoken-task success. Canary gates compare slice metrics against the
previous bundle, not only global averages. Rollback can pin one
artifact, restore a bundle, disable a feature, or route to a safer
fallback.