Speech ML/AI Capstone Exam

Exam Shape

Four Hours, Four Advanced Signals

Use this after completing the speech system-design, production MLOps, inference hosting, and evaluation chapters. Answer each section on a timer before opening the hidden strong answer.

R&D judgment: choose model families, metrics, and experiment plans under ambiguous quality goals.
System design: design online, offline, eval, data, rollout, and rollback paths with explicit SLOs.
Production operations: debug incidents using aggregate telemetry, privacy-safe logs, and reversible mitigation.
Coding: implement small utilities with clear invariants, edge cases, and cost analysis.

Question: What should an experienced candidate do before drawing architecture boxes?

Clarify product constraints, traffic shape, latency budget, privacy limits, failure tolerance, quality target, cost ceiling, model update frequency, and how success will be measured after launch. The diagram should follow those constraints.

Round 1

Speech-To-Speech Enterprise Assistant

Design a voice assistant for enterprise help desks. It handles spoken questions, retrieves policy documents, answers by voice, lets users interrupt, and escalates to humans for risky actions.

Prompt 1: End-To-End Architecture

Give an advanced design answer for the online serving path, offline eval path, data feedback path, rollout process, and rollback plan.

Hidden answer: strong architecture outline

Prefer a cascaded VAD, streaming ASR, retrieval, LLM policy engine, tool router, and streaming TTS design unless the product requires direct speech-to-speech research. Separate interactive capacity from batch evaluation. Version ASR, retrieval index, prompt, tools, LLM, TTS voice, and client config. Gate launches by ASR slice WER, retrieval grounding, task success, unsafe action rate, first audio byte, end-to-end turn latency, interruption recovery, escalation precision, cost per resolved ticket, and rollback time.

Prompt 2: Cost And Latency Tradeoff

Traffic doubles during business hours. GPU cost is too high, but the p95 spoken answer latency SLO must remain below 1800 ms. What changes do you try first?

Hidden answer: practical tradeoff stack

Start with measurement by stage and tenant before changing models. Consider VAD tuning, early partial ASR, streaming retrieval, smaller rerankers, prompt trimming, KV-cache reuse where safe, continuous batching with deadlines, speculative decoding only if quality holds, TTS first-segment streaming, warm pools, request admission for noninteractive work, quantization, and fallback model tiers. Do not blend batch and interactive queues if it harms tail latency. Every savings proposal needs quality, safety, and rollback gates.

Round 2

Production Incident Exercises

These prompts test whether you can debug real systems without asking for raw private audio or transcripts.

Incident 1: Accent Slice Regression

A new ASR model improves aggregate WER by 5% relative but doubles correction rate on an approved aggregate accent/dialect slice in two regions. What do you do in the first hour?

Hidden answer: first-hour response

Freeze promotion, route affected slices back to the prior model, preserve aggregate-only metrics, and compare model version, region, client, device, VAD, language ID, noise level, and contextual bias features. Check whether consented eval coverage missed that slice or whether serving config differs from offline evaluation. Communicate blast radius and rollback status. Prevention should add approved aggregate slice gates, consented eval fixtures, labeling rubric updates, and canary alerts on correction rate, not only WER.

Incident 2: TTS First Audio Byte Spike

A TTS release keeps MOS steady but p95 first audio byte jumps from 320 ms to 1100 ms, increasing barge-in failures. Diagnose it.

Hidden answer: diagnosis plan

Split latency into request queue, text normalization, segmentation, acoustic model, vocoder, streaming chunk assembly, network, and client playback. Compare cold versus warm paths, voice version, sentence length, language, punctuation, cache hit rate, and GPU saturation. Mitigate by rolling back the voice, sending shorter first segments, warming voice pools, or routing long noninteractive reads to a cheaper path. Add release gates for first audio byte and chunk cadence, not only MOS.

Round 3

Coding Follow-Ups for Speech Production Utilities

Speech ML/AI interviews often use small coding tasks that resemble production utilities. State the invariant first, then code.

Coding Prompt 1: Sliding Window Error Rate

Given timestamped request records with ok booleans, return the maximum error rate over any fixed-size time window.

Hidden answer: invariant, tests, and Python solution

Invariant: records are processed in nondecreasing timestamp order; after advancing left, every record in the active window is within window_seconds of the current right edge. Test empty input, all success, all failure, same timestamp, out-of-order timestamps, negative windows, and records exactly on the boundary.

from collections import deque


def max_window_error_rate(records, window_seconds):
    if window_seconds < 0:
        raise ValueError("window_seconds must be non-negative")

    window = deque()
    failures = 0
    best = 0.0
    previous_ts = None

    for ts, ok in records:
        if previous_ts is not None and ts < previous_ts:
            raise ValueError("records must be sorted by timestamp")
        previous_ts = ts

        window.append((ts, ok))
        if not ok:
            failures += 1

        while window and ts - window[0][0] > window_seconds:
            _, old_ok = window.popleft()
            if not old_ok:
                failures -= 1

        best = max(best, failures / len(window))

    return best

Coding Prompt 2: Top-K Regressed Slices

Given baseline and candidate metric maps by slice, return the k slices with the largest regression for a lower-is-better metric such as WER or p95 latency.

Hidden answer: invariant, mistakes, and Python solution

Invariant: each candidate slice is compared only if the baseline exists, and positive delta means worse. Common mistakes include sorting in the wrong direction, hiding missing baselines, and using aggregate averages that mask slice failures. Test zero and negative k, missing baselines, ties, no regressions, and higher-is-better metrics that must be normalized before calling.

import heapq


def top_k_regressed_slices(baseline, candidate, k):
    if k < 0:
        raise ValueError("k must be non-negative")

    heap = []
    missing = []
    for name, new_value in candidate.items():
        if name not in baseline:
            missing.append(name)
            continue
        delta = new_value - baseline[name]
        if delta > 0:
            heapq.heappush(heap, (-delta, name, baseline[name], new_value))

    worst = []
    while heap and len(worst) < k:
        neg_delta, name, old, new = heapq.heappop(heap)
        worst.append({
            "slice": name,
            "delta": -neg_delta,
            "baseline": old,
            "candidate": new,
        })
    return worst, missing

Coding Prompt 3: Rollback Target Resolver

Given a list of release records sorted newest first, return the first version that is healthy for a requested tenant and model family.

Hidden answer: invariant and Python solution

Invariant: the first matching healthy record is the safest rollback target because records are already sorted newest first. Test no match, tenant-specific block, model-family mismatch, and a global release that applies to all tenants.

def resolve_rollback_target(releases, tenant, family):
    for release in releases:
        tenants = release.get("tenants", ["*"])
        if family != release["family"]:
            continue
        if tenant not in tenants and "*" not in tenants:
            continue
        if release.get("healthy") and not release.get("blocked"):
            return release["version"]
    return None

Scoring

Advanced Rubric

Names assumptions and turns them into measurable requirements.
Separates serving, eval, data, admin, and rollback paths.
Chooses metrics that catch slice regressions and user-visible failures.
Explains cost and latency tradeoffs without sacrificing safety or debuggability.
Uses privacy-safe telemetry and avoids raw audio or transcript exposure.
Writes coding solutions with invariants, edge cases, and complexity analysis.

Question: What is a failing capstone answer?

A failing answer optimizes aggregate metrics only, has no rollback target, mixes batch and interactive traffic without deadlines, cannot explain slice drift, requests private raw data casually, or writes code without testing empty, boundary, and missing-data cases.