Speech Data And Evaluation Operations

Data Flywheel

Turn Production Signals Into Safer Model Improvements

ML engineers are expected to connect model training with product telemetry, human review, release gates, and rollback. The data loop must improve quality without turning private audio into an unmanaged asset.

Observe: collect aggregate latency, confidence, slice, correction, and failure signals.
Sample: select privacy-approved examples or synthetic fixtures that represent failure modes.
Label: define rubrics, annotator checks, disagreement rules, and audit trails.
Evaluate: run stable offline suites plus shadow or canary metrics before rollout.
Ship: use CI gates, model registry metadata, rollout flags, and rollback drills.
Monitor: track drift, cost, latency, quality, and user harm signals after launch.

Question: What separates a strong answer from a junior answer here?

A junior answer often says "collect more data." A strong answer says which data is allowed, which slice is missing, how labels are audited, how offline evals map to online metrics, what regression budget is acceptable, and how the team rolls back if the improvement fails in production.

Dataset Contracts

Every Dataset Needs A Release Contract

Required Metadata

Purpose, owner, consent scope, source, retention policy, deletion obligations, allowed uses, coarse approved language or pronunciation-variation metadata, domain, acoustic conditions, device class, and known gaps.

Hidden answer: why metadata changes model risk

Metadata tells you whether an eval result can be trusted for the launch target. A dataset dominated by clean English headset audio does not validate noisy mobile dictation, multilingual support, rare names, far-field assistants, or low-bandwidth calls.

Versioned Artifacts

Keep immutable manifests, checksums, schema versions, split definitions, label rubric versions, derived artifact lineage, and model-to-data lineage.

Hidden answer: release-review invariant

A model release must be reproducible enough to answer: which data trained it, which data evaluated it, which transcripts, embeddings, or cached features were derived from it, which labels changed, which slices regressed, and which users could be affected. If the team cannot answer those questions, deletion requests, rollback, and debugging become guesswork.

Leakage Controls

Keep train, validation, regression, canary, and human-audit sets separated by speaker, session, document, and collection window when those fields can leak the answer.

Hidden answer: why leakage breaks gates

Speech systems can memorize recurring speakers, prompts, document passages, room acoustics, or vendor transcripts. If the same source appears in training and release evals, the gate may reward memorization instead of measuring quality on future traffic. Treat any promoted production sample as lineage-tracked data with explicit split eligibility.

Label Quality

Design The Labeling System Before Scaling It

ASR Labels

Normalize punctuation, casing, numerals, hesitations, diarization, timestamp rules, and entity spelling before measuring WER.

Hidden answer: common failure

Teams often mix label conventions and then blame the model for noisy metrics. Decide whether "twenty one" and "21" are equivalent, whether filler words count, and how domain terms are canonicalized.

TTS Labels

Separate intelligibility, naturalness, pronunciation, speaker consistency, latency, clipping, prosody, and safety preference.

Hidden answer: strong rubric

Do not reduce TTS to a single preference score. A voice can sound pleasant but fail on names, numbers, long-form stability, streaming first audio, or emotionally sensitive content.

Speech RAG Labels

Judge ASR hypothesis quality, retrieval recall, groundedness, citation correctness, answer usefulness, and spoken delivery separately.

Hidden answer: why separate stages

End-to-end scores are useful, but stage labels show where to fix the system. A bad spoken answer may come from ASR entity substitution, retrieval miss, prompt behavior, stale documents, or TTS delivery.

Reviewer Access

Minimize exposure to raw audio and sensitive transcripts with role-scoped queues, redaction where possible, synthetic calibration examples, audit logs, and short retention windows.

Hidden answer: safe labeling invariant

Labeling quality should not require broad access to private speech. Keep reviewer permissions tied to the task, separate identity fields from annotation views when possible, log exports, and use synthetic or de-identified examples for training and calibration before any production sample reaches a human queue.

Evaluation Design

Build Gates That Catch Real Regressions

Slice gates should pair point estimates with enough support to make the decision meaningful. A five-utterance slice can reveal a risk, but it should usually trigger more review or targeted sampling rather than a confident ship/no-ship call by itself. Keep holdout suites stable enough for trend comparisons, and route newly mined production failures through a separate triage set before they become training or regression data.

Gate 1: Streaming ASR Release

Define the minimum offline and online signals required before a new streaming decoder reaches 25 percent of users.

Hidden answer: release gate

Require aggregate and slice WER, entity error, partial churn, first partial latency, finalization delay, correction rate, timeout rate, retry rate, cost per audio minute, privacy review, rollback flag, and owner sign-off. Set explicit budgets for approved aggregate or consented language and pronunciation-variation slices, noisy mobile audio, rare names, code-switching, and long dictation, and bucket sparse sensitive slices before reporting.

Gate 2: TTS Voice Upgrade

The new voice wins preference tests but increases first audio byte by 300 ms. What should the release decision include?

Hidden answer: decision framework

Compare preference lift against latency, interruption rate, completion rate, device buffer behavior, warm-pool cost, and fallback quality. Consider routing short conversational answers to the faster path while using the preferred voice for longer or less latency-sensitive responses.

Gate 3: Speech RAG Knowledge Refresh

A new document index improves text-query retrieval but hurts spoken query answers. What should the eval reveal?

Hidden answer: eval slices

Pair clean text queries with ASR hypotheses and noisy hypotheses. Measure retrieval recall, entity miss rate, answer groundedness, refusal correctness, citation quality, latency, and spoken response usefulness. Spoken-query regressions often come from ASR substitutions interacting with chunking or filters.

Interview Drills

Practice Production Judgment

Prompt 1: Label Drift After A Product Redesign

A redesigned dictation UI changes how users correct text. Correction rate drops, but complaints rise. How do you debug the metric?

Hidden answer: strong response

Treat correction rate as product-coupled, not pure model quality. Compare edit affordances, abandoned sessions, manual retyping, thumbs-down feedback, support tickets, final transcript error audits, and cohort-level usage. Add a stable offline eval and a UI-independent human review sample before trusting the metric.

Prompt 2: Vendor Model Beats Internal Model Offline

A vendor ASR model has lower WER on your benchmark. What questions must be answered before replacing the internal model?

Hidden answer: review checklist

Ask about data rights, retention, latency, streaming partials, domain vocabulary, custom biasing, outages, observability, rollback, cost at peak, privacy review, security boundaries, support SLAs, and slice regressions. Offline WER is only one input to a production replacement decision.

Coding Lab

Compute Slice Regressions

Many advanced practice questions become easier when you can write the small analysis utilities cleanly.

Lab: Flag Slices That Exceed A Regression Budget

Given old and new metric values by slice, return the slices where a lower-is-better metric regresses more than the allowed amount, and report slice mismatches or underpowered slices before making a release decision.

Hidden answer: invariant and Python solution

Invariant: each flagged row has a slice present in both metric maps, both metric values are finite, the slice has enough examples for a release gate, new - old is greater than the non-negative regression budget, and missing, newly added, or underpowered slices are reported separately instead of being silently treated as zero.

from math import isfinite
from numbers import Integral, Real


def _finite_number(value, name):
    if isinstance(value, bool) or not isinstance(value, Real) or not isfinite(value):
        raise ValueError(f"{name} must be a finite number")
    return float(value)


def _count(value, name):
    if isinstance(value, bool) or not isinstance(value, Integral) or value < 0:
        raise ValueError(f"{name} must be a non-negative integer")
    return int(value)


def slices_over_budget(old_metrics, new_metrics, budget, slice_counts=None, min_count=1):
    budget = _finite_number(budget, "budget")
    if budget < 0:
        raise ValueError("budget must be non-negative for lower-is-better metrics")
    min_count = _count(min_count, "min_count")
    if min_count == 0:
        raise ValueError("min_count must be positive for a release gate")

    missing_in_new = sorted(set(old_metrics) - set(new_metrics))
    new_only = sorted(set(new_metrics) - set(old_metrics))
    flagged = []
    underpowered = []
    for name in sorted(set(old_metrics) & set(new_metrics)):
        old_value = _finite_number(old_metrics[name], f"{name} old metric")
        new_value = _finite_number(new_metrics[name], f"{name} new metric")
        if slice_counts is not None:
            support = _count(slice_counts.get(name, 0), f"{name} count")
            if support < min_count:
                underpowered.append({"slice": name, "count": support})
                continue
        delta = new_value - old_value
        if delta > budget:
            flagged.append({
                "slice": name,
                "old": old_value,
                "new": new_value,
                "delta": delta,
            })
    return {
        "flagged": flagged,
        "missing_in_new": missing_in_new,
        "new_only": new_only,
        "underpowered": underpowered,
    }