Active Learning
Sample For Decisions, Not Just Volume
Active learning should answer a release or research question. For
speech systems, useful selection often combines model uncertainty,
user impact, audio conditions, language/domain coverage, and known
slice gaps.
ASR Selection Signals
Low confidence, high partial churn, endpointing retries, rare entities, code-switching, far-field audio, noisy mobile, and correction clusters.
Hidden answer: common sampling mistake
Sampling only the lowest-confidence utterances can overrepresent
noise, unsupported languages, or unusable audio. A strong plan
stratifies by product slice and adds diversity caps so the label
batch improves the eval set instead of becoming a junk drawer.
TTS Selection Signals
Pronunciation complaints, long-form instability, number/date errors, emotional tone mismatch, clipping, latency outliers, and speaker consistency.
Hidden answer: why TTS active learning is different
TTS examples are often text prompts plus generated audio, so the
team needs separate labels for text difficulty, pronunciation,
prosody, audio artifacts, safety, and latency. Preference alone
does not explain what to fix.
Voice Agent Selection Signals
ASR entity substitutions, retrieval misses, stale citations, low task completion, repeated clarification, barge-in failures, and unsafe tool attempts.
Hidden answer: stage-aware sampling
Select examples with stage tags. Otherwise a failed answer may be
mislabeled as an LLM issue when the root cause was ASR, retrieval,
stale index metadata, tool policy, or TTS delivery.
Privacy-Safe Proxies
Use synthetic utterances, aggregate metrics, redacted text, hashed slice IDs, generated noise conditions, and review-approved fixtures.
Hidden answer: when proxies are enough
Proxies are enough for pipeline logic, CI gates, data contracts,
release tooling, and many regression tests. They are not enough
to claim final ASR, TTS, or human-experience quality for a real
target population.