Advanced Data And MLOps

Speech Feature Pipelines And Data Contracts

Design privacy-safe audio feature pipelines that survive model releases, labeling changes, streaming constraints, drift, replay jobs, and production interview scrutiny.

Mental Model

A Feature Pipeline Is A Release Dependency

Speech models often fail in production because the data path changes: VAD thresholds shift, resampling differs, transcript normalization changes, metadata drops a field, or replay jobs mix with live data. Experienced engineers make those assumptions explicit before training, evaluation, serving, and incident response depend on them.

  1. Raw boundary: decide what audio, transcript, consent, locale, device, and quality metadata may enter the system.
  2. Transform boundary: version resampling, VAD, chunking, normalization, augmentation, text cleanup, and feature extraction code.
  3. Label boundary: store label schema, annotator instructions, disagreement rules, quality audits, and adjudication outcomes.
  4. Serving boundary: prove online preprocessing matches offline evaluation for sample rate, channel handling, timestamps, and text normalization.
  5. Audit boundary: keep lineage from model version to data snapshot, transform version, eval suite, rollout gate, and rollback target.
Question: Why is "we only changed preprocessing" a serious release risk?

Preprocessing defines the distribution the model sees. A small change in VAD, sample-rate conversion, silence trimming, casing, punctuation, timestamp alignment, or profanity normalization can invalidate offline metrics and create slice regressions. Treat preprocessing like model code: version it, test it, canary it, and make rollback possible.

System Design

Design Prompts With Hidden Advanced Answers

Prompt 1: Build A Training Dataset Pipeline For Streaming ASR

You need a weekly ASR training snapshot from consented call audio. The model must improve noisy-device and accented-speech slices without storing unnecessary private content. Design the pipeline.

Hidden answer: advanced design outline

Start with consent and retention policy, then define aggregate-only discovery metrics before selecting clips. Store durable IDs, locale, device class, channel, noise bucket, duration, label status, and transform versions, not casual raw exports. Run privacy filters and redaction before labeling. Stratify sampling by failure slices, keep holdout users out of training, version label instructions, and produce a snapshot manifest that ties every row to audio lineage, transcript version, VAD version, and evaluation eligibility.

Prompt 2: Match Offline And Online Features

Offline WER improves, but online canary WER regresses for long utterances. How do you debug the data contract?

Hidden answer: feature parity investigation

Compare online and offline sample-rate conversion, chunk overlap, VAD hangover, timestamp stitching, max-duration truncation, normalization, punctuation handling, and language routing. Replay canary traffic through both paths with synthetic or consented fixtures and diff intermediate features. Check whether the offline eval used full utterances while online serving streamed partial chunks. Freeze promotion until parity tests and slice metrics pass.

Prompt 3: Add A Feature Store For Spoken RAG

A voice assistant needs searchable memories, transcript summaries, safety labels, and audio-quality signals. What should be online, offline, or excluded?

Hidden answer: storage and privacy tradeoffs

Keep online features minimal: retrieval embeddings, freshness, permissions, source type, summary version, safety flags, and TTL. Keep offline features for evaluation, active learning, and drift analysis: aggregate quality buckets, turn outcomes, device class, latency, and consent status. Exclude raw private audio unless there is explicit consent and retention purpose. Every feature should have an owner, schema, TTL, backfill plan, deletion path, and evaluation showing that it improves task success or safety.

Production Debugging

Data Contract Incident Drills

Drill 1: Missing Locale Metadata

After a mobile app release, ASR fallback routing increases and WER spikes for bilingual users. The model bundle did not change.

Hidden answer: first-hour response

Split by app version, locale field presence, language ID output, model route, geography, and device class. If locale metadata dropped or changed encoding, route with a conservative language ID fallback and roll back the app or serving parser if possible. Protect dashboards from silent null buckets, add schema validation at ingestion, and add a canary gate that fails when critical metadata coverage regresses.

Drill 2: Labeling Guideline Drift

A new label vendor starts expanding abbreviations in transcripts. Offline CER improves, but downstream command accuracy falls.

Hidden answer: label-contract fix

The transcript label contract changed. Compare examples before and after the vendor switch, inspect normalization rules, and rerun command-intent evals that depend on literal wording. Version label guidelines, require overlap audits, keep adjudicated gold sets, and either normalize both styles consistently or block the dataset from training until the downstream contract is restored.

Coding Interview Track

Blind 75 Patterns For Data Pipelines

These drills connect common interview patterns to feature pipelines: hashing for schema diffs, intervals for chunk stitching, heaps for active-learning queues, and sliding windows for drift detection.

Problem: Merge Overlapping Audio Chunks

Given chunk intervals from VAD and endpointing, merge overlaps and adjacent intervals when the gap is at most max_gap_ms. This is the production version of Merge Intervals.

Questions, invariants, tests, and hidden Python answer

Questions: Are intervals closed or half-open? Should touching intervals merge? Are inputs sorted? Can intervals be empty? Invariant: after sorting, the output is disjoint and the last interval is the only candidate that can overlap the next interval. Test cases: empty input, one interval, nested intervals, touching intervals, small allowed gaps, and unsorted input.

def merge_audio_chunks(chunks, max_gap_ms=0):
    if not chunks:
        return []

    ordered = sorted(chunks)
    merged = [list(ordered[0])]

    for start, end in ordered[1:]:
        last = merged[-1]
        if start <= last[1] + max_gap_ms:
            last[1] = max(last[1], end)
        else:
            merged.append([start, end])

    return [tuple(item) for item in merged]


assert merge_audio_chunks([(30, 40), (0, 10), (10, 20)]) == [(0, 20), (30, 40)]
assert merge_audio_chunks([(0, 10), (13, 20)], max_gap_ms=3) == [(0, 20)]

Problem: Rank Active-Learning Candidates

Select the top k consented aggregate examples for human review using uncertainty, slice rarity, and recent regression score. This maps to Top K Elements.

Questions, common mistakes, and hidden Python answer

Questions: Can private content be selected? Are scores stable under ties? Should one user dominate the queue? Common mistakes include ignoring consent, sorting the entire stream when a heap is enough, using raw transcript content in logs, and optimizing uncertainty while starving rare slices.

from heapq import heappush, heappushpop


def rank_review_candidates(rows, k):
    heap = []
    for row in rows:
        if not row.get("consented"):
            continue
        score = (
            0.55 * row["uncertainty"]
            + 0.30 * row["slice_rarity"]
            + 0.15 * row["regression"]
        )
        item = (score, row["example_id"])
        if len(heap) < k:
            heappush(heap, item)
        else:
            heappushpop(heap, item)

    return [example_id for score, example_id in sorted(heap, reverse=True)]


rows = [
    {"example_id": "a", "consented": True, "uncertainty": 0.9, "slice_rarity": 0.2, "regression": 0.3},
    {"example_id": "b", "consented": False, "uncertainty": 1.0, "slice_rarity": 1.0, "regression": 1.0},
    {"example_id": "c", "consented": True, "uncertainty": 0.5, "slice_rarity": 0.9, "regression": 0.8},
]
assert rank_review_candidates(rows, 2) == ["c", "a"]

Problem: Detect Sustained Schema Coverage Drop

Given hourly metadata coverage rates, return the first window where average coverage falls below a threshold for w hours. This is a sliding-window incident detector.

Hidden answer with production follow-up
def first_coverage_drop(values, window, threshold):
    if window <= 0 or len(values) < window:
        return None

    total = sum(values[:window])
    if total / window < threshold:
        return 0

    for right in range(window, len(values)):
        total += values[right] - values[right - window]
        left = right - window + 1
        if total / window < threshold:
            return left

    return None


assert first_coverage_drop([0.99, 0.98, 0.70, 0.72, 0.74], 3, 0.85) == 1
assert first_coverage_drop([0.99, 0.98, 0.97], 2, 0.95) is None

Production follow-up: page only when the drop affects critical fields or user-facing slices, deduplicate alerts by app version and ingestion path, and attach rollout context so on-call can identify whether the first bad hour matches a deployment.

Timed Exam

Advanced Practice Prompts

R&D Prompt

A self-supervised audio model improves average WER but worsens short noisy commands. What experiments and release gates do you require before launch?

Hidden answer

Require slice-specific evals, command-intent task metrics, calibration by utterance length and noise bucket, online/offline preprocessing parity, and canary rollback gates. Inspect whether representation learning improved long-form transcription while hurting endpointing or short-command acoustics.

System Design Prompt

Design a backfill system that recomputes audio features for 100 million clips without impacting real-time serving.

Hidden answer

Use isolated batch queues, explicit resource quotas, resumable manifests, idempotent writes, transform-versioned output paths, data quality sampling, and throttles tied to serving headroom. Never let replay jobs share priority with real-time requests, and require lineage plus rollback to the previous feature snapshot.