Recognition
From Alignment To Foundation Models
- Template and DTW systems: compare acoustic patterns with explicit time warping.
- HMM/GMM systems: model hidden phonetic states, acoustic likelihoods, pronunciation lexicons, and language models.
- DNN-HMM hybrids: replace GMM acoustic models with neural posterior estimators while keeping decoding graphs.
- End-to-end ASR: CTC, attention encoder-decoder, and RNN-T reduce hand-built components but shift complexity into data and decoding.
- Self-supervised and weakly supervised systems: wav2vec-style pretraining and Whisper-style scale improve robustness, but production still depends on slice evaluation.
Question: Why did HMM/GMM systems use lexicons and language models?
They separated acoustic evidence from pronunciation and word sequence
probability. This made it possible to inject vocabulary, constrain
decoding, and improve domain language without retraining the acoustic
model. Modern systems often hide these pieces, but contextual biasing,
rescoring, and post-processing are descendants of the same idea.
Question: What should you evaluate beyond WER?
Track entity error rate, timestamp error, diarization error, partial
transcript churn, language and accent slices, noise slices, latency,
memory, throughput, and downstream task success. WER can improve while
product quality regresses.
Lab
Latency Budget For A Local Voice Assistant
Build a spreadsheet or small script that budgets every step in one
conversational turn. Use placeholders first, then replace them with
real measurements from local experiments.
budget_ms = {
"vad_endpoint": 120,
"asr_first_partial": 250,
"asr_final": 550,
"llm_first_token": 300,
"tts_first_audio": 350,
"playback_buffer": 80,
}
total_to_first_audio = (
budget_ms["vad_endpoint"]
+ budget_ms["asr_first_partial"]
+ budget_ms["llm_first_token"]
+ budget_ms["tts_first_audio"]
+ budget_ms["playback_buffer"]
)
print(total_to_first_audio)
Question: What does this toy budget hide?
It hides overlap between stages, network jitter, queueing, cold
starts, text normalization, endpointing mistakes, audio device
buffers, cancellation, barge-in, and retries. The next version should
record per-stage traces for successful, slow, interrupted, and failed
turns.