Production Track

Speech Data And Evaluation Operations

Build the operating discipline behind strong ASR, TTS, speech-to-speech, and speech RAG systems: dataset contracts, labeling quality, eval slices, privacy boundaries, CI gates, drift monitoring, and release decisions.

Data Flywheel

Turn Production Signals Into Safer Model Improvements

ML engineers are expected to connect model training with product telemetry, human review, release gates, and rollback. The data loop must improve quality without turning private audio into an unmanaged asset.

  1. Observe: collect aggregate latency, confidence, slice, correction, and failure signals.
  2. Sample: select privacy-approved examples or synthetic fixtures that represent failure modes.
  3. Label: define rubrics, annotator checks, disagreement rules, and audit trails.
  4. Evaluate: run stable offline suites plus shadow or canary metrics before rollout.
  5. Ship: use CI gates, model registry metadata, rollout flags, and rollback drills.
  6. Monitor: track drift, cost, latency, quality, and user harm signals after launch.
Question: What separates a strong answer from a junior answer here?

A junior answer often says "collect more data." A strong answer says which data is allowed, which slice is missing, how labels are audited, how offline evals map to online metrics, what regression budget is acceptable, and how the team rolls back if the improvement fails in production.

Dataset Contracts

Every Dataset Needs A Release Contract

Required Metadata

Purpose, owner, consent scope, source, retention policy, allowed uses, language, domain, acoustic conditions, device class, and known gaps.

Hidden answer: why metadata changes model risk

Metadata tells you whether an eval result can be trusted for the launch target. A dataset dominated by clean English headset audio does not validate noisy mobile dictation, multilingual support, rare names, far-field assistants, or low-bandwidth calls.

Versioned Artifacts

Keep immutable manifests, checksums, schema versions, split definitions, label rubric versions, and model-to-data lineage.

Hidden answer: release-review invariant

A model release must be reproducible enough to answer: which data trained it, which data evaluated it, which labels changed, which slices regressed, and which users could be affected. If the team cannot answer those questions, rollback and debugging become guesswork.

Label Quality

Design The Labeling System Before Scaling It

ASR Labels

Normalize punctuation, casing, numerals, hesitations, diarization, timestamp rules, and entity spelling before measuring WER.

Hidden answer: common failure

Teams often mix label conventions and then blame the model for noisy metrics. Decide whether "twenty one" and "21" are equivalent, whether filler words count, and how domain terms are canonicalized.

TTS Labels

Separate intelligibility, naturalness, pronunciation, speaker consistency, latency, clipping, prosody, and safety preference.

Hidden answer: strong rubric

Do not reduce TTS to a single preference score. A voice can sound pleasant but fail on names, numbers, long-form stability, streaming first audio, or emotionally sensitive content.

Speech RAG Labels

Judge ASR hypothesis quality, retrieval recall, groundedness, citation correctness, answer usefulness, and spoken delivery separately.

Hidden answer: why separate stages

End-to-end scores are useful, but stage labels show where to fix the system. A bad spoken answer may come from ASR entity substitution, retrieval miss, prompt behavior, stale documents, or TTS delivery.

Evaluation Design

Build Gates That Catch Real Regressions

Gate 1: Streaming ASR Release

Define the minimum offline and online signals required before a new streaming decoder reaches 25 percent of users.

Hidden answer: release gate

Require aggregate and slice WER, entity error, partial churn, first partial latency, finalization delay, correction rate, timeout rate, retry rate, cost per audio minute, privacy review, rollback flag, and owner sign-off. Set explicit budgets for accented speech, noisy mobile audio, rare names, code-switching, and long dictation.

Gate 2: TTS Voice Upgrade

The new voice wins preference tests but increases first audio byte by 300 ms. What should the release decision include?

Hidden answer: decision framework

Compare preference lift against latency, interruption rate, completion rate, device buffer behavior, warm-pool cost, and fallback quality. Consider routing short conversational answers to the faster path while using the preferred voice for longer or less latency-sensitive responses.

Gate 3: Speech RAG Knowledge Refresh

A new document index improves text-query retrieval but hurts spoken query answers. What should the eval reveal?

Hidden answer: eval slices

Pair clean text queries with ASR hypotheses and noisy hypotheses. Measure retrieval recall, entity miss rate, answer groundedness, refusal correctness, citation quality, latency, and spoken response usefulness. Spoken-query regressions often come from ASR substitutions interacting with chunking or filters.

Interview Drills

Practice Production Judgment

Prompt 1: Label Drift After A Product Redesign

A redesigned dictation UI changes how users correct text. Correction rate drops, but complaints rise. How do you debug the metric?

Hidden answer: strong response

Treat correction rate as product-coupled, not pure model quality. Compare edit affordances, abandoned sessions, manual retyping, thumbs-down feedback, support tickets, final transcript error audits, and cohort-level usage. Add a stable offline eval and a UI-independent human review sample before trusting the metric.

Prompt 2: Vendor Model Beats Internal Model Offline

A vendor ASR model has lower WER on your benchmark. What questions must be answered before replacing the internal model?

Hidden answer: review checklist

Ask about data rights, retention, latency, streaming partials, domain vocabulary, custom biasing, outages, observability, rollback, cost at peak, privacy review, security boundaries, support SLAs, and slice regressions. Offline WER is only one input to a production replacement decision.

Coding Lab

Compute Slice Regressions

Many advanced practice questions become easier when you can write the small analysis utilities cleanly.

Lab: Flag Slices That Exceed A Regression Budget

Given old and new metric values by slice, return the slices where the new value regresses more than the allowed amount.

Hidden answer: invariant and Python solution

Invariant: each output row has a slice present in both metric maps, and new - old is greater than the regression budget. Missing slices should be reported by a schema check, not silently treated as zero.

def slices_over_budget(old_metrics, new_metrics, budget):
    flagged = []
    for name, old_value in old_metrics.items():
        if name not in new_metrics:
            continue
        delta = new_metrics[name] - old_value
        if delta > budget:
            flagged.append({
                "slice": name,
                "old": old_value,
                "new": new_metrics[name],
                "delta": delta,
            })
    return flagged