Speech Systems

ASR, TTS, And Speech-To-Speech History

An experienced engineer should know where modern systems came from, because old design constraints still explain today's failure modes, evaluation choices, and production tradeoffs.

Recognition

From Alignment To Foundation Models

  1. Template and DTW systems: compare acoustic patterns with explicit time warping.
  2. HMM/GMM systems: model hidden phonetic states, acoustic likelihoods, pronunciation lexicons, and language models.
  3. DNN-HMM hybrids: replace GMM acoustic models with neural posterior estimators while keeping decoding graphs.
  4. End-to-end ASR: CTC, attention encoder-decoder, and RNN-T reduce hand-built components but shift complexity into data and decoding.
  5. Self-supervised and weakly supervised systems: wav2vec-style pretraining and Whisper-style scale improve robustness, but production still depends on slice evaluation.
Question: Why did HMM/GMM systems use lexicons and language models?

They separated acoustic evidence from pronunciation and word sequence probability. This made it possible to inject vocabulary, constrain decoding, and improve domain language without retraining the acoustic model. Modern systems often hide these pieces, but contextual biasing, rescoring, and post-processing are descendants of the same idea.

Question: What should you evaluate beyond WER?

Track entity error rate, timestamp error, diarization error, partial transcript churn, language and accent slices, noise slices, latency, memory, throughput, and downstream task success. WER can improve while product quality regresses.

Synthesis

From Concatenation To Neural Voices

Classic TTS

Rule-based and concatenative systems used text normalization, pronunciation dictionaries, prosody rules, and recorded unit selection. They could sound clear in narrow domains but were hard to scale across styles.

Hidden answer: Why this still matters

Text normalization remains a major source of TTS bugs. Dates, currency, addresses, abbreviations, code snippets, and names can break naturalness before the neural model sees the input.

Neural TTS

Tacotron-style acoustic models, WaveNet-style vocoders, non-autoregressive FastSpeech systems, VITS-style end-to-end systems, and diffusion or flow models improved quality and speaker control.

Hidden answer: Production tradeoff

Higher MOS can raise first-audio-byte latency, GPU cost, memory, and failure risk. A strong rollout compares quality by slice, conversational latency, cold starts, streaming chunk behavior, safety controls, and rollback readiness.

Conversation

Speech-To-Speech Architectures

Cascaded ASR -> LLM -> TTS

The cascade is easy to inspect, moderate, test, and debug. It also creates latency at each boundary and may discard prosody, emotion, and turn-taking cues.

Hidden answer: When to choose the cascade

Choose it when correctness, explainability, safety review, transcript UI, tool use, enterprise compliance, and incremental rollout matter more than preserving every acoustic cue. It is often the best production default for a production assistant.

Direct Or Codec-Based Speech-To-Speech

Direct systems can model speech tokens or latent audio representations and produce speech without a text-only bottleneck. They may preserve prosody better, but are harder to evaluate and control.

Hidden answer: What can go wrong?

Risks include speaker leakage, unstable prosody, hallucinated content, poor tool-use grounding, hard-to-audit safety behavior, bitrate artifacts, and difficult regression triage. Evaluation must include intelligibility, semantic faithfulness, latency, speaker similarity, safety, and user task success.

Lab

Latency Budget For A Local Voice Assistant

Build a spreadsheet or small script that budgets every step in one conversational turn. Use placeholders first, then replace them with real measurements from local experiments.

budget_ms = {
    "vad_endpoint": 120,
    "asr_first_partial": 250,
    "asr_final": 550,
    "llm_first_token": 300,
    "tts_first_audio": 350,
    "playback_buffer": 80,
}

total_to_first_audio = (
    budget_ms["vad_endpoint"]
    + budget_ms["asr_first_partial"]
    + budget_ms["llm_first_token"]
    + budget_ms["tts_first_audio"]
    + budget_ms["playback_buffer"]
)
print(total_to_first_audio)
Question: What does this toy budget hide?

It hides overlap between stages, network jitter, queueing, cold starts, text normalization, endpointing mistakes, audio device buffers, cancellation, barge-in, and retries. The next version should record per-stage traces for successful, slow, interrupted, and failed turns.

Interview Prompts

Advanced Questions

Prompt: A New ASR Model Improves Average WER But Hurts Medical Terms

Decide whether to ship, roll back, fine-tune, add contextual biasing, or change post-processing.

Hidden answer: Strong response

Do not ship broadly if critical entities regress. Quantify the domain-term slice, inspect substitutions, compare decoding and normalization changes, test contextual biasing or rescoring, and canary only on safe traffic. Tie the decision to product risk, privacy constraints, owner approval, rollback, and a follow-up data plan.

Prompt: A TTS Voice Sounds Better Offline But Users Interrupt It More

Explain what metrics and experiments you run before deciding whether to keep the voice.

Hidden answer: Strong response

Measure first-audio-byte latency, completion rate, interruption rate, perceived pace, pronunciation errors, long-form fatigue, sentence streaming, cold starts, and task success. Offline MOS is useful, but conversational UX can fail because the voice starts too late, speaks too slowly, mispronounces key terms, or prevents natural turn-taking.