Advanced Speech ML/AI Evaluation And Incident Exam

Exam Method

Answer With Evidence, Not Vibes

Each prompt asks for a decision under uncertainty. Before opening the hidden answer, write the metric contract, the slice plan, the rollback trigger, and the smallest useful implementation.

Define quality: specify user-visible outcomes, proxy metrics, and failure modes.
Slice first: segment by language, approved aggregate or consented accent/dialect tags, noise, device, tenant, region, model version, and traffic path.
Gate releases: combine offline evals, shadow traffic, canaries, SLOs, and rollback criteria.
Debug safely: use aggregate telemetry and consented fixtures instead of private raw audio or transcripts.
Encode judgment: write small deterministic utilities for promotion, rollback, drift, and cost decisions.

Question: Why can aggregate WER be a dangerous launch metric?

Aggregate WER can improve while a critical slice regresses. An experienced launch gate checks user-impacting slices such as noisy mobile calls, approved aggregate or consented accent/dialect tags, domain terms, entity-heavy utterances, and regions with different network conditions. It also connects WER to downstream task success, correction rate, latency, safety, and cost.

Round 1

Design The Evaluation Stack

These prompts test whether you can build evaluation systems that serve research, release engineering, and production operations.

Prompt 1: Streaming ASR Release Gate

A candidate streaming ASR model improves offline WER by 4 percent relative to baseline, but changes partial hypothesis behavior. Design the promotion gate.

Hidden answer: strong gate outline

Gate on final WER and CER by slice, entity recall, punctuation if user-visible, first partial latency, finalization latency, partial churn, endpointing false cuts, correction rate, and downstream task success. Run offline fixtures, shadow traffic, and a small canary. Promote only if no high-risk approved aggregate, consented, or high-value slice crosses a defined regression threshold. Roll back on tail-latency SLO breach, correction-rate spike, or severe entity recall regression.

Prompt 2: Spoken RAG Evaluation

A voice assistant answers enterprise policy questions. Users speak through ASR, retrieval fetches private documents, and the answer is spoken back with TTS. What evals do you need before launch?

Hidden answer: eval plan

Evaluate ASR robustness on noisy and entity-heavy queries, retrieval recall with ASR variants, ACL correctness, freshness, grounded answer rate, refusal behavior, citation traceability, tool-use safety, TTS intelligibility, first audio byte, complete turn latency, escalation precision, and cost per successful answer. Include human review for high-risk intents and synthetic tests for stale documents, ambiguous policies, and prompt-injection attempts.

Prompt 3: TTS Quality Versus Latency

A new TTS voice wins preference tests but increases first audio byte. Explain how you would decide whether to ship it.

Hidden answer: decision framework

Split product surfaces into interactive and noninteractive use. For interactive voice agents, require first audio byte, chunk cadence, barge-in recovery, pronunciation, safety, and abandonment to stay within gates. For long-form generation, slower synthesis may be acceptable if quality improves and cost is controlled. Consider segmented first responses, voice warm pools, cache policies, or model-tier routing before rejecting the release.

Round 2

Production Incident Exercises

For each incident, write the first five checks, the safe mitigation, and the prevention item before opening the answer.

Incident 1: Partial Churn Spike

Users see the transcript rewrite itself several times before the final ASR result. Final WER is unchanged, but trust and completion rate drop.

Hidden answer: first-hour response

Slice by model, decoder config, VAD, chunk size, endpointing, client version, language, noise, and network path. Compare partial churn, first partial latency, finalization delay, and downstream correction rate. Mitigate with decoder rollback, more conservative partial display, endpointing rollback, or affected-slice routing. Add canary gates for partial stability and UI-visible rewrite rate.

Incident 2: Retrieval Freshness Regression

One tenant reports that the spoken assistant cites old policy after an overnight index refresh. Other tenants look healthy.

Hidden answer: investigation and mitigation

Check tenant index version, document ingestion watermark, ACLs, embedding version, chunking config, reranker version, cache TTL, and whether ASR entity substitutions changed retrieval queries. Pin the tenant to the last good index or force human handoff for affected intents. Add per-tenant freshness gates and ASR-noisy retrieval evals before index promotion.

Incident 3: Cost Spike With Flat Traffic

GPU spend rises 55 percent while request count and task success stay nearly flat. p99 latency also worsens.

Hidden answer: likely causes and controls

Check prompt length, retrieved document count, conversation length, retry storms, speculative decoding fallback, model mix, cache hit rate, batch efficiency, and tenant-specific routes. Stabilize with token budgets, retrieval caps, priority queues, fallback model tiers, admission control for noninteractive work, and release rollback if a config change caused the spike. Add cost per successful turn to launch gates.

Round 3

Implementation Drills

These are intentionally small. They mirror interview utilities and production checks that an experienced engineer should implement cleanly.

Drill 1: Promotion Gate

Given baseline and candidate metrics by slice, return whether the candidate can promote. Each threshold declares whether lower or higher is better so latency/error metrics and recall/success metrics are judged in the right direction.

Hidden answer: invariant, edge cases, and Python solution

Invariant: every required slice and metric must pass. Missing candidate metrics are failures because a release gate cannot assume safety. Test missing slices, missing metrics, exact threshold, improvements, a lower-is-better regression, and a higher-is-better regression.

def can_promote(baseline, candidate, thresholds):
    failures = []

    for slice_name, base_metrics in baseline.items():
        cand_metrics = candidate.get(slice_name)
        if cand_metrics is None:
            failures.append((slice_name, "missing_slice", None))
            continue

        for metric, base_value in base_metrics.items():
            if metric not in thresholds:
                continue
            cand_value = cand_metrics.get(metric)
            if cand_value is None:
                failures.append((slice_name, metric, "missing_metric"))
                continue

            direction = thresholds[metric]["direction"]
            allowed = thresholds[metric]["allowed_regression"]

            if direction == "lower_is_better":
                regression = cand_value - base_value
            elif direction == "higher_is_better":
                regression = base_value - cand_value
            else:
                raise ValueError(f"unknown direction for {metric}: {direction}")

            if regression - allowed > 1e-12:
                failures.append((slice_name, metric, regression))

    return {
        "promote": not failures,
        "failures": failures,
    }

Drill 2: Consecutive Drift Detector

Given daily aggregate metrics for one slice, flag the first day where a metric has breached its threshold for n consecutive days. The threshold declares whether higher or lower values are worse.

Hidden answer: invariant, common mistakes, and Python solution

Invariant: streak is the number of consecutive days up to the current point that violate the threshold in the configured direction. Reset it on a healthy day. Common mistakes include flagging one-day noise, assuming every metric drifts upward, and forgetting to return the first sustained breach. In production, reject invalid windows and missing telemetry instead of treating unknown days as healthy.

def first_sustained_drift(points, metric, threshold, consecutive_days, direction):
    if consecutive_days <= 0:
        raise ValueError("consecutive_days must be positive")
    if direction not in {"higher_is_worse", "lower_is_worse"}:
        raise ValueError(f"unknown direction: {direction}")

    streak = 0
    for point in points:
        if metric not in point:
            raise ValueError(f"missing {metric} for day {point.get('day')}")

        value = point[metric]
        if direction == "higher_is_worse":
            breached = value > threshold
        else:
            breached = value < threshold

        if breached:
            streak += 1
            if streak >= consecutive_days:
                return point["day"]
        else:
            streak = 0
    return None

Drill 3: Rollback Candidate Ranking

Given release candidates with blast radius, regression score, and rollback time, rank the safest rollback target.

Hidden answer: strong reasoning and Python solution

A rollback target should reduce user impact quickly without hiding safety or privacy issues. Only rank candidates that are explicitly known-good; an unknown version is not a safe rollback target during an incident. Prefer lower regression score, smaller blast radius, and short rollback time. If two options are close, choose the one with better observability and smaller config difference from the current serving path.

def rank_rollback_targets(candidates):
    required = {
        "version",
        "known_good",
        "regression_score",
        "blast_radius_percent",
        "rollback_minutes",
    }

    safe_candidates = []
    for candidate in candidates:
        missing = required - candidate.keys()
        if missing:
            raise ValueError(f"missing rollback fields: {sorted(missing)}")
        if candidate["known_good"]:
            safe_candidates.append(candidate)

    if not safe_candidates:
        raise ValueError("no known-good rollback target")

    def score(candidate):
        return (
            candidate["regression_score"],
            candidate["blast_radius_percent"],
            candidate["rollback_minutes"],
            -candidate.get("observability_score", 0),
        )

    return sorted(safe_candidates, key=score)

Rubric

What A Strong Answer Includes

Evaluation Depth

Defines metrics, slices, fixtures, human review, rollout gates, and known blind spots.

Production Judgment

Chooses reversible mitigations, protects privacy, explains blast radius, and adds prevention gates.

Systems Tradeoffs

Connects model quality to latency, cost, queueing, cache behavior, deployment shape, and user experience.

Coding Discipline

States invariants, handles missing data, writes simple tests mentally, and returns actionable output.

Question: What is the fastest way to weaken an otherwise good strong answer?

Treating evaluation, serving, and incidents as separate worlds. A strong answer shows how offline metrics predict production behavior, how production telemetry exposes eval blind spots, and how every launch gate maps to a rollback or prevention decision.