ASR, TTS, And Speech-To-Speech History

Recognition

From Alignment To Foundation Models

Template and DTW systems: compare acoustic patterns with explicit time warping.
HMM/GMM systems: model hidden phonetic states, acoustic likelihoods, pronunciation lexicons, and language models.
DNN-HMM hybrids: replace GMM acoustic models with neural posterior estimators while keeping decoding graphs.
End-to-end ASR: CTC, attention encoder-decoder, and RNN-T reduce hand-built components but shift complexity into data and decoding.
Self-supervised and weakly supervised systems: wav2vec-style pretraining can reduce labeled-data needs and Whisper-style scale can improve robustness, but production still depends on slice evaluation for domain terms, timestamps, approved aggregate or consented low-resource accent tags, and silence or noise behavior.

Question: Why did HMM/GMM systems use lexicons and language models?

They separated acoustic evidence from pronunciation and word sequence probability. This made it possible to inject vocabulary, constrain decoding, and improve domain language without retraining the acoustic model. Modern systems often hide these pieces, but contextual biasing, rescoring, and post-processing are descendants of the same idea.

Question: What should you evaluate beyond WER?

Track entity error rate, timestamp error, diarization error, partial transcript churn, approved aggregate language and accent slices, noise slices, latency, memory, throughput, and downstream task success. WER can improve while product quality regresses.

Synthesis

From Concatenation To Neural Voices

Classic And Statistical TTS

Rule-based and concatenative systems used text normalization, pronunciation dictionaries, prosody rules, and recorded unit selection. Statistical parametric systems later modeled acoustic features and vocoder parameters more compactly, but often traded naturalness for controllability and footprint. These systems could sound clear in narrow domains but were hard to scale across styles.

Hidden answer: Why this still matters

Text normalization remains a major source of TTS bugs. Dates, currency, addresses, abbreviations, code snippets, and names can break naturalness before the neural model sees the input.

Neural TTS

Tacotron-style acoustic models, WaveNet-style vocoders, non-autoregressive FastSpeech systems, VITS-style end-to-end systems, and diffusion or flow models improved quality and speaker control.

Hidden answer: Production tradeoff

Higher MOS can raise first-audio-byte latency, GPU cost, memory, and failure risk. A strong rollout compares quality by slice, conversational latency, cold starts, streaming chunk behavior, safety controls, and rollback readiness.

Conversation

Speech-To-Speech Architectures

Cascaded ASR -> LLM -> TTS

The cascade is easy to inspect, moderate, test, and debug. Modern versions often stream micro-turns rather than waiting for a full transcript, but each boundary can still add latency and discard prosody, emotion, and turn-taking cues.

Hidden answer: When to choose the cascade

Choose it when correctness, explainability, safety review, transcript UI, tool use, enterprise compliance, and incremental rollout matter more than preserving every acoustic cue. Treat it as the baseline to beat, then compare native or codec-based speech models on latency, barge-in, grounding, safety, and rollback.

Direct Or Codec-Based Speech-To-Speech

Direct systems can model speech tokens or latent audio representations and produce speech without a text-only bottleneck. They may preserve timing and prosody better, especially in duplex conversation, but are harder to evaluate and control.

Hidden answer: What can go wrong?

Risks include speaker leakage, unstable prosody, hallucinated content, poor tool-use grounding, hard-to-audit safety behavior, bitrate artifacts, and difficult regression triage. Evaluation must include intelligibility, semantic faithfulness, latency, speaker similarity, safety, and user task success.

Lab

Latency Budget For A Local Voice Assistant

Build a spreadsheet or small script that budgets every step in one conversational turn. Use placeholders first, then replace them with real measurements from local experiments.

budget_ms = {
    "vad_endpoint": 120,
    "asr_first_partial": 250,
    "asr_final": 550,
    "llm_first_token": 300,
    "tts_first_audio": 350,
    "playback_buffer": 80,
}

total_to_first_audio = (
    budget_ms["vad_endpoint"]
    + budget_ms["asr_first_partial"]
    + budget_ms["llm_first_token"]
    + budget_ms["tts_first_audio"]
    + budget_ms["playback_buffer"]
)
conservative_to_first_audio = (
    budget_ms["vad_endpoint"]
    + budget_ms["asr_final"]
    + budget_ms["llm_first_token"]
    + budget_ms["tts_first_audio"]
    + budget_ms["playback_buffer"]
)
print(total_to_first_audio, conservative_to_first_audio)

Question: What does this toy budget hide?

It hides overlap between stages, network jitter, queueing, cold starts, text normalization, endpointing mistakes, audio device buffers, cancellation, barge-in, retries, and whether the assistant can safely act on an ASR partial or must wait for a final transcript. The next version should record per-stage traces for successful, slow, interrupted, and failed turns.

Interview Prompts

Advanced Questions

Prompt: A New ASR Model Improves Average WER But Hurts Medical Terms

Decide whether to ship, roll back, fine-tune, add contextual biasing, or change post-processing.

Hidden answer: Strong response

Do not ship broadly if critical entities regress. Quantify the domain-term slice, inspect substitutions, compare decoding and normalization changes, test contextual biasing or rescoring, and canary only on safe traffic. Tie the decision to product risk, privacy constraints, owner approval, rollback, and a follow-up data plan.

Prompt: A TTS Voice Sounds Better Offline But Users Interrupt It More

Explain what metrics and experiments you run before deciding whether to keep the voice.

Hidden answer: Strong response

Measure first-audio-byte latency, completion rate, interruption rate, perceived pace, pronunciation errors, long-form fatigue, sentence streaming, cold starts, and task success. Offline MOS is useful, but conversational UX can fail because the voice starts too late, speaks too slowly, mispronounces key terms, or prevents natural turn-taking.