Capstone Exam

Advanced Audio ML Interview And Production Exam

A realistic final practice round for ML engineer and AI engineer roles covering speech R&D, model serving, MLOps, production debugging, cost control, and coding follow-ups.

Exam Shape

Four Hours, Four Advanced Signals

Use this after completing the system-design, production MLOps, inference hosting, and Blind 75 chapters. Answer each section on a timer before opening the hidden strong answer.

  1. R&D judgment: choose model families, metrics, and experiment plans under ambiguous quality goals.
  2. System design: design online, offline, eval, data, rollout, and rollback paths with explicit SLOs.
  3. Production operations: debug incidents using aggregate telemetry, privacy-safe logs, and reversible mitigation.
  4. Coding: implement small utilities with clear invariants, edge cases, and cost analysis.
Question: What should an experienced candidate do before drawing architecture boxes?

Clarify product constraints, traffic shape, latency budget, privacy limits, failure tolerance, quality target, cost ceiling, model update frequency, and how success will be measured after launch. The diagram should follow those constraints.

Round 1

Speech-To-Speech Enterprise Assistant

Design a voice assistant for enterprise help desks. It handles spoken questions, retrieves policy documents, answers by voice, lets users interrupt, and escalates to humans for risky actions.

Prompt 1: End-To-End Architecture

Give a advanced design answer for the online serving path, offline eval path, data feedback path, rollout process, and rollback plan.

Hidden answer: strong architecture outline

Prefer a cascaded VAD, streaming ASR, retrieval, LLM policy engine, tool router, and streaming TTS design unless the product requires direct speech-to-speech research. Separate interactive capacity from batch evaluation. Version ASR, retrieval index, prompt, tools, LLM, TTS voice, and client config. Gate launches by ASR slice WER, retrieval grounding, task success, unsafe action rate, first audio byte, end-to-end turn latency, interruption recovery, escalation precision, cost per resolved ticket, and rollback time.

Prompt 2: Cost And Latency Tradeoff

Traffic doubles during business hours. GPU cost is too high, but the p95 spoken answer latency SLO must remain below 1800 ms. What changes do you try first?

Hidden answer: practical tradeoff stack

Start with measurement by stage and tenant before changing models. Consider VAD tuning, early partial ASR, streaming retrieval, smaller rerankers, prompt trimming, KV-cache reuse where safe, continuous batching with deadlines, speculative decoding only if quality holds, TTS first-segment streaming, warm pools, request admission for noninteractive work, quantization, and fallback model tiers. Do not blend batch and interactive queues if it harms tail latency. Every savings proposal needs quality, safety, and rollback gates.

Round 2

Production Incident Exercises

These prompts test whether you can debug real systems without asking for raw private audio or transcripts.

Incident 1: Accent Slice Regression

A new ASR model improves aggregate WER by 5 percent but doubles correction rate for accented support calls in two regions. What do you do in the first hour?

Hidden answer: first-hour response

Freeze promotion, route affected slices back to the prior model, preserve aggregate-only metrics, and compare model version, region, client, device, VAD, language ID, noise level, and contextual bias features. Check whether eval coverage missed those accents or whether serving config differs from offline evaluation. Communicate blast radius and rollback status. Prevention should add slice gates, consented eval fixtures, labeling rubric updates, and canary alerts on correction rate, not only WER.

Incident 2: TTS First Audio Byte Spike

A TTS release keeps MOS steady but p95 first audio byte jumps from 320 ms to 1100 ms, increasing barge-in failures. Diagnose it.

Hidden answer: diagnosis plan

Split latency into request queue, text normalization, segmentation, acoustic model, vocoder, streaming chunk assembly, network, and client playback. Compare cold versus warm paths, voice version, sentence length, language, punctuation, cache hit rate, and GPU saturation. Mitigate by rolling back the voice, sending shorter first segments, warming voice pools, or routing long noninteractive reads to a cheaper path. Add release gates for first audio byte and chunk cadence, not only MOS.

Round 3

Coding Follow-Ups With Blind 75 Patterns

Speech ML/AI interviews often use small coding tasks that resemble production utilities. State the invariant first, then code.

Coding Prompt 1: Sliding Window Error Rate

Given timestamped request records with ok booleans, return the maximum error rate over any fixed-size time window.

Hidden answer: invariant, tests, and Python solution

Invariant: after advancing left, every record in the active window is within window_seconds of the current right edge. Test empty input, all success, all failure, same timestamp, and records exactly on the boundary.

from collections import deque


def max_window_error_rate(records, window_seconds):
    window = deque()
    failures = 0
    best = 0.0

    for ts, ok in records:
        window.append((ts, ok))
        if not ok:
            failures += 1

        while window and ts - window[0][0] > window_seconds:
            _, old_ok = window.popleft()
            if not old_ok:
                failures -= 1

        best = max(best, failures / len(window))

    return best

Coding Prompt 2: Top-K Regressed Slices

Given baseline and candidate metric maps by slice, return the k slices with the largest regression for a lower-is-better metric such as WER or p95 latency.

Hidden answer: invariant, mistakes, and Python solution

Invariant: each candidate slice is compared only if the baseline exists, and positive delta means worse. Common mistakes include sorting in the wrong direction, hiding missing baselines, and using aggregate averages that mask slice failures.

import heapq


def top_k_regressed_slices(baseline, candidate, k):
    heap = []
    missing = []
    for name, new_value in candidate.items():
        if name not in baseline:
            missing.append(name)
            continue
        delta = new_value - baseline[name]
        if delta > 0:
            heapq.heappush(heap, (-delta, name, baseline[name], new_value))

    worst = []
    while heap and len(worst) < k:
        neg_delta, name, old, new = heapq.heappop(heap)
        worst.append({
            "slice": name,
            "delta": -neg_delta,
            "baseline": old,
            "candidate": new,
        })
    return worst, missing

Coding Prompt 3: Rollback Target Resolver

Given a list of release records sorted newest first, return the first version that is healthy for a requested tenant and model family.

Hidden answer: invariant and Python solution

Invariant: the first matching healthy record is the safest rollback target because records are already sorted newest first. Test no match, tenant-specific block, model-family mismatch, and a global release that applies to all tenants.

def resolve_rollback_target(releases, tenant, family):
    for release in releases:
        tenants = release.get("tenants", ["*"])
        if family != release["family"]:
            continue
        if tenant not in tenants and "*" not in tenants:
            continue
        if release.get("healthy") and not release.get("blocked"):
            return release["version"]
    return None

Scoring

Advanced Rubric

Question: What is a failing capstone answer?

A failing answer optimizes aggregate metrics only, has no rollback target, mixes batch and interactive traffic without deadlines, cannot explain slice drift, requests private raw data casually, or writes code without testing empty, boundary, and missing-data cases.