Speech Feature Pipelines and Data Contracts

Mental Model

A Feature Pipeline Is a Release Dependency

Speech models often fail in production because the data path changes: VAD thresholds shift, resampling differs, transcript normalization changes, metadata drops a field, or replay jobs mix with live data. Experienced engineers make those assumptions explicit before training, evaluation, serving, and incident response depend on them.

Raw boundary: decide what audio, transcript, consent, locale, device, and quality metadata may enter the system.
Transform boundary: version resampling, VAD, chunking, normalization, augmentation, text cleanup, and feature extraction code.
Label boundary: store label schema, annotator instructions, disagreement rules, quality audits, and adjudication outcomes.
Serving boundary: prove online preprocessing matches offline evaluation for sample rate, channel handling, timestamps, and text normalization.
Audit boundary: keep lineage from model version to data snapshot, transform version, eval suite, rollout gate, and rollback target.

Question: Why is "we only changed preprocessing" a serious release risk?

Preprocessing defines the distribution the model sees. A small change in VAD, sample-rate conversion, silence trimming, casing, punctuation, timestamp alignment, or profanity normalization can invalidate offline metrics and create slice regressions. Treat preprocessing like model code: version it, test it, canary it, and make rollback possible.

System Design

Design Prompts With Hidden Advanced Answers

Prompt 1: Build A Training Dataset Pipeline For Streaming ASR

You need a weekly ASR training snapshot from consented call audio. The model must improve noisy-device and approved aggregate pronunciation-variation slices without storing unnecessary private content. Design the pipeline.

Hidden answer: advanced design outline

Start with consent and retention policy, then define aggregate-only discovery metrics before selecting clips. Store durable IDs, locale, device class, channel, noise bucket, duration, label status, and transform versions, not casual raw exports. Run privacy filters and redaction before labeling. Stratify sampling by approved aggregate or consented failure slices, keep holdout users out of training, version label instructions, and produce a snapshot manifest that ties every row to audio lineage, transcript version, VAD version, and evaluation eligibility.

Prompt 2: Match Offline And Online Features

Offline WER improves, but online canary WER regresses for long utterances. How do you debug the data contract?

Hidden answer: feature parity investigation

Compare online and offline sample-rate conversion, chunk overlap, VAD hangover, timestamp stitching, max-duration truncation, normalization, punctuation handling, and language routing. Replay canary traffic through both paths with synthetic or consented fixtures and diff intermediate features. Check whether the offline eval used full utterances while online serving streamed partial chunks. Freeze promotion until parity tests and slice metrics pass.

Prompt 3: Add A Feature Store For Spoken RAG

A voice assistant needs searchable memories, transcript summaries, safety labels, and audio-quality signals. What should be online, offline, or excluded?

Hidden answer: storage and privacy tradeoffs

Keep online features minimal: retrieval embeddings, freshness, permissions, source type, summary version, safety flags, and TTL. Keep offline features for evaluation, active learning, and drift analysis: aggregate quality buckets, turn outcomes, device class, latency, and consent status. Exclude raw private audio unless there is explicit consent and retention purpose. Every feature should have an owner, schema, TTL, backfill plan, deletion path, and evaluation showing that it improves task success or safety.

Production Debugging

Data Contract Incident Drills

Drill 1: Missing Locale Metadata

After a mobile app release, ASR fallback routing increases and WER spikes for bilingual users. The model bundle did not change.

Hidden answer: first-hour response

Split by app version, locale field presence, language ID output, model route, geography, and device class. If locale metadata dropped or changed encoding, route with a conservative language ID fallback and roll back the app or serving parser if possible. Protect dashboards from silent null buckets, add schema validation at ingestion, and add a canary gate that fails when critical metadata coverage regresses.

Drill 2: Labeling Guideline Drift

A new label vendor starts expanding abbreviations in transcripts. Offline CER improves, but downstream command accuracy falls.

Hidden answer: label-contract fix

The transcript label contract changed. Compare examples before and after the vendor switch, inspect normalization rules, and rerun command-intent evals that depend on literal wording. Version label guidelines, require overlap audits, keep adjudicated gold sets, and either normalize both styles consistently or block the dataset from training until the downstream contract is restored.

Coding Interview Track

Speech Pipeline Coding Drills

These drills connect production speech data contracts to concrete implementation tasks: intervals for chunk stitching, heaps for active-learning queues, and sliding windows for drift detection.

Problem: Merge Overlapping Audio Chunks

Given chunk intervals from VAD and endpointing, merge overlaps and adjacent intervals when the gap is at most max_gap_ms. Reject negative gaps or inverted intervals before they can corrupt timestamp lineage.

Questions, invariants, tests, and hidden Python answer

Questions: Are intervals closed or half-open? Should touching intervals merge? Are inputs sorted? Can intervals be empty? Invariant: after sorting, the output is disjoint and the last interval is the only candidate that can overlap the next interval. Test cases: empty input, one interval, nested intervals, touching intervals, small allowed gaps, and unsorted input.

def merge_audio_chunks(chunks, max_gap_ms=0):
    if max_gap_ms < 0:
        raise ValueError("max_gap_ms must be non-negative")
    if not chunks:
        return []

    ordered = sorted(chunks)
    for start, end in ordered:
        if start > end:
            raise ValueError("chunk start must be <= end")

    merged = [list(ordered[0])]

    for start, end in ordered[1:]:
        last = merged[-1]
        if start <= last[1] + max_gap_ms:
            last[1] = max(last[1], end)
        else:
            merged.append([start, end])

    return [tuple(item) for item in merged]


assert merge_audio_chunks([(30, 40), (0, 10), (10, 20)]) == [(0, 20), (30, 40)]
assert merge_audio_chunks([(0, 10), (13, 20)], max_gap_ms=3) == [(0, 20)]
try:
    merge_audio_chunks([(20, 10)])
    raise AssertionError("expected validation failure")
except ValueError:
    pass

Problem: Rank Active-Learning Candidates

Select the top k consented examples for human review using aggregate scoring signals such as uncertainty, approved coarse-slice rarity, and recent regression score. The queue must fail closed on invalid limits, missing scoring signals, or rare slices that are too small to review safely.

Questions, common mistakes, and hidden Python answer

Questions: Can private content be selected? Are scores stable under ties? Should one user dominate the queue? Common mistakes include ignoring consent, sorting the entire stream when a heap is enough, using raw transcript content in logs, and optimizing uncertainty while starving approved rare slices. A production queue should also require minimum slice support, cap or diversify per-user contributions, and avoid exposing raw transcripts before sending examples to reviewers.

from heapq import heappush, heappushpop
from math import isfinite


def rank_review_candidates(rows, k, min_slice_count=20):
    if not isinstance(k, int) or k < 0:
        raise ValueError("k must be a non-negative integer")
    if not isinstance(min_slice_count, int) or min_slice_count <= 0:
        raise ValueError("min_slice_count must be a positive integer")
    if k == 0:
        return []
    required = ("uncertainty", "slice_rarity", "regression", "slice_count", "example_id")
    heap = []
    for row in rows:
        if not row.get("consented"):
            continue
        missing = [name for name in required if name not in row]
        if missing:
            raise ValueError(f"missing candidate fields: {missing}")
        for name in ("uncertainty", "slice_rarity", "regression"):
            value = row[name]
            if not isinstance(value, (int, float)) or not isfinite(value) or not 0 <= value <= 1:
                raise ValueError(f"{name} must be a finite score between 0 and 1")
        if not isinstance(row["slice_count"], int) or row["slice_count"] < min_slice_count:
            continue
        score = (
            0.55 * row["uncertainty"]
            + 0.30 * row["slice_rarity"]
            + 0.15 * row["regression"]
        )
        item = (score, row["example_id"])
        if len(heap) < k:
            heappush(heap, item)
        else:
            heappushpop(heap, item)

    return [example_id for score, example_id in sorted(heap, reverse=True)]


rows = [
    {"example_id": "a", "consented": True, "uncertainty": 0.9, "slice_rarity": 0.2, "regression": 0.3, "slice_count": 80},
    {"example_id": "b", "consented": False, "uncertainty": 1.0, "slice_rarity": 1.0, "regression": 1.0, "slice_count": 100},
    {"example_id": "c", "consented": True, "uncertainty": 0.5, "slice_rarity": 0.9, "regression": 0.8, "slice_count": 40},
    {"example_id": "d", "consented": True, "uncertainty": 0.8, "slice_rarity": 1.0, "regression": 1.0, "slice_count": 3},
]
assert rank_review_candidates(rows, 2) == ["c", "a"]

Problem: Detect Sustained Schema Coverage Drop

Given hourly metadata coverage rates, return the first window where average coverage falls below a threshold for w hours. This is a sliding-window incident detector. Fail closed on invalid windows, thresholds, or impossible coverage values so bad telemetry cannot look like a healthy stream.

Hidden answer with production follow-up

from math import isfinite


def first_coverage_drop(values, window, threshold):
    if not isinstance(window, int) or window <= 0:
        raise ValueError("window must be a positive integer")
    if not isinstance(threshold, (int, float)) or not isfinite(threshold) or not 0 <= threshold <= 1:
        raise ValueError("threshold must be between 0 and 1")
    for value in values:
        if not isinstance(value, (int, float)) or not isfinite(value) or not 0 <= value <= 1:
            raise ValueError("coverage values must be finite rates between 0 and 1")
    if len(values) < window:
        return None

    total = sum(values[:window])
    if total / window < threshold:
        return 0

    for right in range(window, len(values)):
        total += values[right] - values[right - window]
        left = right - window + 1
        if total / window < threshold:
            return left

    return None


assert first_coverage_drop([0.99, 0.98, 0.70, 0.72, 0.74], 3, 0.85) == 1
assert first_coverage_drop([0.99, 0.98, 0.97], 2, 0.95) is None
try:
    first_coverage_drop([0.99, 1.2], 2, 0.95)
    raise AssertionError("expected validation failure")
except ValueError:
    pass

Production follow-up: page only when the drop affects critical fields or approved aggregate launch slices, deduplicate alerts by app version and ingestion path, and attach rollout context so on-call can identify whether the first bad hour matches a deployment.

Timed Exam

Advanced Practice Prompts

R&D Prompt

A self-supervised audio model improves average WER but worsens short noisy commands. What experiments and release gates do you require before launch?

Hidden answer

Require slice-specific evals, command-intent task metrics, calibration by utterance length and noise bucket, online/offline preprocessing parity, and canary rollback gates. Inspect whether representation learning improved long-form transcription while hurting endpointing or short-command acoustics.

System Design Prompt

Design a backfill system that recomputes audio features for 100 million clips without impacting real-time serving.

Hidden answer

Use isolated batch queues, explicit resource quotas, resumable manifests, idempotent writes, transform-versioned output paths, data quality sampling, and throttles tied to serving headroom. Never let replay jobs share priority with real-time requests, and require lineage plus rollback to the previous feature snapshot.