ML System Design Casebook

Answer Shape

Use The Same Spine For Every Design Prompt

Interviewers are listening for structured tradeoffs, not a diagram packed with fashionable components. Start with product requirements, then narrow toward model, serving, evaluation, rollout, and incident behavior.

Clarify: users, latency target, quality target, traffic shape, privacy limits, and failure tolerance.
Separate paths: online, batch, eval, admin, data labeling, and fallback paths should not share fragile assumptions.
Choose models: explain why the ASR, TTS, LLM, embedding, or codec model fits the product constraint.
Define gates: offline eval, shadow, canary, SLO, cost, safety, and rollback thresholds.
Operate: dashboards, alerts, runbooks, privacy-safe logs, drift checks, and postmortem learning.

Question: What makes a system design answer staff-level rather than mid-level?

A strong answer names the uncertain assumptions, makes measurable tradeoffs, designs for rollback before launch, connects quality to SLOs and cost, and explains how the system improves after production failures without collecting unnecessary private data.

Case Prompts

Five Advanced Design Rounds

Treat each case as a 45-minute interview. Write assumptions, draw a block diagram, estimate capacity, define launch gates, then open the hidden answer.

Case 1: Enterprise Streaming Dictation

Design a streaming ASR service for clinicians. The product needs low partial latency, high named-entity accuracy, tenant isolation, and a rollback path when a model update hurts medical terms.

Hidden answer: strong design outline

Separate real-time streaming from offline correction jobs. Use VAD, normalization, streaming ASR, domain contextual biasing, entity post-processing, and privacy-safe aggregate telemetry. Gate releases by WER, entity error rate, partial churn, first partial latency, finalization delay, tenant-level SLOs, and correction rate. Roll back by model version and feature flag, not by redeploying the whole service.

Case 2: Speech-To-Speech Support Agent

Design a voice support agent that listens, reasons over account policy documents, speaks back naturally, supports interruption, and escalates to a human when confidence is low.

Hidden answer: architecture and tradeoffs

A cascaded ASR-LLM/RAG-TTS system is often easier to debug and control than a direct speech-to-speech model, while direct models may reduce latency or preserve prosody when they have strong evals and guardrails. Track ASR confidence, retrieval grounding, answer safety, TTS first audio byte, turn latency, barge-in, and handoff rate. Use session state with explicit TTLs, policy-document versioning, retrieval evals from spoken queries, and human escalation when grounding, identity, or intent confidence is weak.

Case 3: Cost-Aware Inference Platform

Your team hosts ASR, embeddings, LLM, and TTS models for many product teams. Design a platform that controls GPU cost while protecting interactive latency.

Hidden answer: platform design

Split interactive and batch capacity, enforce quotas, provide model contracts, support continuous batching where it helps, and expose per-model cost and latency dashboards. Use priority queues, request deadlines, admission control, canary routing, model pools, warmup, autoscaling, and fallback tiers. Cost savings should be gated by slice quality, not only aggregate throughput.

Case 4: Audio Data Flywheel

Design a privacy-preserving data and evaluation loop for improving ASR on noisy, domain-specific, and policy-approved aggregate language or pronunciation-variation slices.

Hidden answer: data operations plan

Collect opt-in examples or derived features under retention rules, label with clear rubrics, measure annotator agreement, and build eval slices before training changes. Store dataset lineage, model version, prompt/config version, and consent state. Use synthetic and public fixtures for CI, reserve private data for governed evals, and monitor drift through aggregate features and correction signals.

Case 5: Regression After A TTS Voice Upgrade

A new TTS voice wins preference tests but increases call abandonment. Design the investigation, mitigation, and prevention plan.

Hidden answer: production response

Compare old and new voice by first audio byte, chunk cadence, text normalization, sentence segmentation, vocoder time, client playback, interruption rate, abandonment, and approved aggregate language slices. Mitigate with canary rollback, fallback voice, shorter first segment, warm pools, or lower-cost vocoding for noncritical turns. Prevent with conversational evals, not only offline preference MOS.

Coding Labs

Small Interview Utilities

These are the kind of practical snippets advanced candidates may write during debugging or system design follow-ups.

Lab 1: Canary Gate Decision

Given baseline and canary metrics, return whether to continue, pause, or roll back. Critical regressions should override wins.

Hidden answer: invariant, tests, and Python solution

Invariant: every metric is compared in the direction that matters for the product. Test quality regression with latency win, latency regression with quality win, missing metrics, invalid budget definitions, and exactly-on-budget changes.

def canary_decision(baseline, canary, budgets):
    actions = []
    for name, budget in budgets.items():
        required_fields = {"direction", "allowed_delta"}
        missing_fields = required_fields - budget.keys()
        if missing_fields:
            missing = ", ".join(sorted(missing_fields))
            raise ValueError(f"budget for {name} missing: {missing}")

        if name not in baseline or name not in canary:
            actions.append(("pause", name, "missing metric"))
            continue

        direction = budget["direction"]
        allowed = budget["allowed_delta"]
        if allowed < 0:
            raise ValueError("allowed_delta must be non-negative")

        delta = canary[name] - baseline[name]
        tolerance = 1e-12
        if direction == "lower_is_better" and delta - allowed > tolerance:
            actions.append(("rollback", name, delta))
        elif direction == "higher_is_better" and -delta - allowed > tolerance:
            actions.append(("rollback", name, delta))
        elif direction not in {"lower_is_better", "higher_is_better"}:
            raise ValueError(f"unknown direction for {name}: {direction}")

    if any(action[0] == "rollback" for action in actions):
        return "rollback", actions
    if actions:
        return "pause", actions
    return "continue", []

Lab 2: Estimate Real-Time Audio Capacity

Estimate required replicas from concurrent streams, model cost per audio second, per-replica serving capacity, utilization target, and VAD savings.

Hidden answer: invariant, tests, and Python solution

Invariant: real-time streams produce audio work every wall-clock second, reduced only by silence skipping or routing. Test zero streams, no VAD savings, invalid utilization, invalid negative inputs, nonpositive replica capacity, and very large peaks.

import math


def estimate_replicas(concurrent_streams, gpu_seconds_per_audio_second,
                      replica_capacity=1.0, target_utilization=0.7,
                      vad_savings=0.0):
    if not 0 < target_utilization <= 1:
        raise ValueError("target_utilization must be in (0, 1]")
    if not 0 <= vad_savings < 1:
        raise ValueError("vad_savings must be in [0, 1)")
    if replica_capacity <= 0:
        raise ValueError("replica_capacity must be positive")
    if concurrent_streams < 0:
        raise ValueError("concurrent_streams must be non-negative")
    if gpu_seconds_per_audio_second < 0:
        raise ValueError("gpu_seconds_per_audio_second must be non-negative")

    effective_audio_seconds = concurrent_streams * (1 - vad_savings)
    gpu_seconds_needed = effective_audio_seconds * gpu_seconds_per_audio_second
    usable_capacity = replica_capacity * target_utilization
    return math.ceil(gpu_seconds_needed / usable_capacity)

Exam Checks

Short Prompts With Hidden Answers

Prompt 1: Rollback Versus Hotfix

When should a ML engineer roll back immediately instead of debugging in place?

Hidden answer

Roll back when user harm, privacy risk, SLO breach, critical slice regression, data corruption, runaway cost, or unrecoverable queue growth exceeds the launch budget. Debug after stabilizing unless the rollback itself would create larger risk.

Prompt 2: Why RAG Evaluation Is Different For Spoken Queries

Explain why a text-only RAG eval can miss voice assistant failures.

Hidden answer

Spoken queries include ASR substitutions, punctuation loss, entity errors, disfluencies, approved aggregate pronunciation-variation slices, language switches, endpointing mistakes, and rewritten partials. Evaluate clean text, final ASR, noisy ASR, and streaming partial paths separately.