Speech ML/AI Domain Study Guide

North Star

Tour The Speech ML/AI Domain End To End

Foundation

Learn the math, data representations, and model families behind modern audio/text ML.

Build

Write small experiments for features, training loops, ASR/TTS evaluation, and inference.

Operate

Measure latency, memory, context size, batching, quantization, and streaming behavior.

Lead

Connect product goals, safety, cost, evaluation, reliability, and platform strategy across the speech AI stack.

Study Rhythm

Repeat This Loop

Read: one paper, chapter, or implementation note.
Rebuild: implement the smallest working version.
Measure: record accuracy, latency, memory, and failure modes.
Explain: write a one-page note in your own words.
Connect: map the idea back to ASR, TTS, or speech-to-speech.

Curriculum

Module Tour

The numbered tour links audio ML and speech AI modules, with additional practice pages below. Lesson pages include code, checkpoints, and hidden answers. Start with the transformer lesson, then return here to connect it to ASR, TTS, and speech-to-speech systems.

Orientation And Tooling

Set up Python, notebooks, PyTorch, audio tools, Git, experiment logs, and privacy boundaries.

Lab: load a WAV, plot waveform/spectrogram, save metadata only.

Math For ML

Linear algebra, probability, gradients, cross entropy, sequence likelihoods, and optimization intuition.

Lab: implement linear regression and softmax classification from scratch.

Open lesson: ML foundations for speech engineers

Neural Network Basics

Tensors, autograd, MLPs, initialization, normalization, regularization, train/validation loops.

Lab: train a small classifier and write a failure analysis.

Open lesson: training loops, evaluation, and debugging

Audio Representations

Sampling, quantization, STFT, mel filterbanks, MFCCs, log-mel features, framing, and augmentation.

Lab: compare waveform, spectrogram, and mel features for the same utterance.

Open lesson: audio features with code and hidden answers

Text And Sequence Modeling

Tokenization, embeddings, n-grams, RNNs, seq2seq, attention, beam search, and decoding tradeoffs.

Lab: build a tiny character language model and inspect its decoding mistakes.

Transformers

Self-attention, causal masks, encoder/decoder stacks, positional encodings, RoPE, KV cache, and scaling limits.

Lab: implement single-head attention, then measure sequence length cost.

Open lesson: transformers with code and hidden answers

ASR History

Dynamic time warping, HMM/GMM systems, pronunciation lexicons, acoustic models, language models, and decoding graphs.

Lab: diagram a classic ASR pipeline and map modern replacements.

Open lesson: ASR, TTS, and speech-to-speech history

Neural ASR

CTC, Listen-Attend-Spell, RNN-T, Conformer, wav2vec 2.0, Whisper, WER/CER, streaming ASR, and VAD.

Lab: run two ASR models on the same clips and compare errors by category.

TTS History

Concatenative synthesis, statistical parametric speech, vocoders, Tacotron, WaveNet, FastSpeech, VITS, and voice cloning risks.

Lab: compare TTS latency and intelligibility across two local engines.

Open lesson: TTS evolution and production tradeoffs

Neural Audio Codecs

Discrete audio tokens, residual vector quantization, codec language models, prosody, speaker information, and compression artifacts.

Lab: encode/decode audio and listen for bitrate-dependent artifacts.

Audio-Text Multimodality

Contrastive learning, cross-modal retrieval, audio captioning, speech translation, shared embedding spaces, and alignment losses.

Lab: build a tiny audio-text retrieval evaluation with hand-labeled examples.

Speech-To-Speech Architectures

Cascaded ASR-LLM-TTS, direct speech-to-speech, streaming turn-taking, barge-in, echo control, memory, and agent state.

Lab: design latency budgets for a local speech assistant turn.

Open lesson: cascaded and direct speech-to-speech

Efficient Transformers

FlashAttention, grouped-query attention, speculative decoding, quantization, pruning, distillation, adapters, LoRA, and prompt compression.

Lab: benchmark one model under two quantization levels and two context lengths.

Open lesson: efficient transformers and latency tradeoffs

Inference Hosting

KV-cache memory, continuous batching, PagedAttention, llama.cpp, MLX, vLLM-style serving, telemetry, cold starts, and local privacy.

Lab: build a local inference readiness report with latency, memory, throughput, and quality notes.

Open lesson: inference hosting and serving design

Evaluation, RAG, And Quality Systems

Speech eval stacks, RAG for voice assistants, LLM judges, human review, slice metrics, and rollout gates.

Lab: build a privacy-safe speech RAG evaluation sheet with latency, retrieval, and spoken-answer scores.

Open lesson: speech evaluation, RAG, and quality systems

Production Debugging

First-hour incident response for ASR, TTS, speech-to-speech, audio RAG, rollback, and privacy-safe telemetry.

Lab: compute partial-churn and latency-slice signals from synthetic request records.

Open lesson: production debugging for speech systems

Data And Evaluation Operations

Dataset contracts, labeling quality, eval slice design, CI gates, drift monitoring, and model release data lineage.

Lab: compute slice regressions from old and new metric maps before a release review.

Open lesson: speech data and evaluation operations

ML System Design Casebook

Practice advanced design rounds for streaming ASR, speech-to-speech agents, inference platforms, data flywheels, and TTS incidents.

Lab: implement canary gates and real-time audio capacity estimates.

Open lesson: ML system design casebook

Advanced ML Exam Drills

Timed prompts for ASR, TTS, audio RAG, inference platforms, incidents, rollout gates, and staff-level tradeoff explanations.

Lab: answer each prompt under time pressure, then compare against hidden strong outlines.

Open lesson: advanced ML exam drills

Speech Serving Observability

SLOs, dashboards, burn rates, approved aggregate drift slices, rollback decisions, cost alerts, and postmortems for production speech systems.

Lab: implement SLO burn-rate, slice-drift, and rollout-recommendation utilities.

Open lesson: speech serving observability and SLOs

Speech Safety, Privacy, And Security

Threat modeling for spoken prompt injection, voice impersonation, PII, consent, TTS abuse, and privacy-safe incident response.

Lab: implement transcript redaction, sensitive-tool confirmation, and abuse-spike detection utilities.

Open lesson: speech safety, privacy, and security

Speech Inference Platform Readiness

Readiness reviews for ASR, TTS, LLM, and speech-to-speech serving: capacity, gates, SLOs, incidents, and cost-aware rollout decisions.

Lab: implement rollout gates, queue burn-rate checks, and cost-regression detectors with synthetic aggregate metrics.

Open lesson: speech inference platform readiness

Speech ML/AI Production Incidents

First-hour triage, rollback judgment, slice regression analysis, privacy-safe incident packets, drift, and speech-to-speech failure drills.

Lab: rank slice regressions, recommend rollout actions, and produce aggregate-only incident packets.

Open lesson: audio ML production incident drills

Voice Agent Evaluation Benchmarks

Advanced evaluation plans for spoken RAG agents, speech-to-speech repair, task success, latency, safety, cost, and launch gates.

Lab: implement a voice-agent release-decision aggregator over synthetic slice metrics.

Open lesson: voice agent evaluation benchmarks

Speech AI Interview Sprint

Timed capstone prompts for speech system design, R&D judgment, production incidents, serving platforms, MLOps, and coding follow-ups.

Lab: rank canary rollout risks and explain speech serving capacity and constrained-decoding follow-ups.

Open lesson: advanced speech AI interview sprint

Speech Model CI/CD And Release Engineering

Model registry contracts, CI gates, canaries, rollback targets, production release drills, and research-to-serving handoff discipline.

Lab: implement registry validation, canary promotion gates, and rollback target resolution over synthetic release metadata.

Open lesson: speech model CI/CD and release engineering

Speech ML/AI Capstone Exam

Final advanced practice for speech R&D, system design, inference hosting, incidents, MLOps, cost control, and coding follow-ups.

Lab: answer timed capstone prompts, then compare against hidden strong outlines and worked Python utilities.

Open lesson: advanced audio ML capstone exam

Speech AI Debugging Casebook

Production debugging cases for ASR, TTS, speech-to-speech agents, spoken RAG, safety classifiers, and rollback decisions.

Lab: implement first-bad-release detection, slice-aware rollback selection, and privacy-safe incident packet builders.

Open lesson: speech AI debugging casebook

Model Serving On-Call Runbook

Production on-call practice for ASR, TTS, LLM, retrieval, and speech-to-speech serving incidents under real traffic constraints.

Lab: implement queue saturation, cost-per-success, and rollback-candidate utilities with synthetic aggregate metrics.

Open lesson: model serving on-call runbook

Evaluation And Incident Exam

Advanced practice for speech evaluation stacks, rollout gates, production incidents, privacy-safe debugging, drift, cost spikes, and rollback ranking.

Lab: implement promotion gates, sustained-drift detection, and rollback-target ranking over synthetic aggregate metrics.

Open lesson: advanced audio ML evaluation and incident exam

Real-Time Speech System Design Lab

Practice complete spoken-turn design, latency budgets, canary rollback, partial transcript churn, and production debugging for real-time speech products.

Lab: implement latency budget reports, prefix churn measurement, and canary rollback triggers over synthetic aggregate telemetry.

Open lesson: real-time speech system design lab

Speech Data Flywheels And Active Learning

Practice privacy-safe data loops, active-learning sampling, labeling operations, release gates, and first-hour data incident debugging for speech systems.

Lab: rank aggregate review candidates with consent filtering, slice coverage, regression signals, and privacy-safe metadata.

Open lesson: speech data flywheels and active learning

Speech Serving Scaling and Reliability Exam

Practice capacity planning, SLOs, rollout gates, cost-aware routing, rollback judgment, and production debugging for speech serving systems.

Lab: implement peak concurrency, canary rollback, and sliding-window error-budget utilities over synthetic aggregate metrics.

Open lesson: speech serving scaling and reliability exam

Speech GPU Serving Capacity Planning

Practice GPU memory math, batching, autoscaling, load shedding, warm pools, cost controls, and rollback for speech serving systems.

Lab: implement sliding-window queue alerts, priority admission, and retry-storm detectors over synthetic aggregate serving metrics.

Open lesson: speech GPU serving capacity planning

Speech Feature Pipelines And Data Contracts

Design privacy-safe audio feature pipelines, schema contracts, label drift checks, online/offline parity tests, and feature-store boundaries.

Lab: implement audio chunk merging, active-learning ranking, and schema coverage drift detectors with hidden Python answers.

Open lesson: speech feature pipelines and data contracts

Speech AI Load Testing And Chaos Readiness

Practice launch readiness for ASR, TTS, spoken RAG, and speech-to-speech systems under load, dependency failures, cost pressure, and rollback constraints.

Lab: implement load-test gates, retry-storm detection, and capacity step-load summaries over synthetic aggregate metrics.

Open lesson: speech AI load testing and chaos readiness

Full-Duplex Speech-To-Speech Research Update

Study recent full-duplex speech model directions: native listen-while-speaking models, micro-turn cascades, role-conditioned agents, action streams, asynchronous retrieval, interactivity alignment, and benchmark gaps.

Lab: compare native, cascaded, front-end-assisted, and retrieval-augmented duplex serving designs with rollout gates and interruption metrics.

Open lesson: full-duplex speech-to-speech research update

Production Audio Judge Monitoring

Build fast feedback loops from flagged ASR/TTS events, consented or sanitized production samples, audio-aware judges, schema validation, and top-issue aggregation.

Lab: implement a privacy-safe top-issue aggregator over synthetic judge results from user-unsatisfied and consented sampled speech events.

Open lesson: production audio judge monitoring

Career Ladder

Practice From Junior To Principal

This track turns the speech ML/AI domain tour into role-level practice: junior implementation habits, mid-level debugging, advanced system ownership, staff-level platform tradeoffs, and principal-level strategy across ML system design, R&D, MLOps, serving, monitoring, rollback, and speech and multimodal AI depth.

Career Practice Track

Practice realistic prompts for streaming ASR, speech-to-speech assistants, audio model serving platforms, production incidents, and research tradeoffs.

Open lesson: advanced practice and exam readiness

Speech ML System Design

Practice end-to-end architecture for ASR, TTS, speech-to-speech, evaluation, deployment, rollback, and cost-aware serving.

Open lesson: speech ML system design interviews

System Design Casebook

Practice full ML system design cases with hidden strong answers, rollout gates, production tradeoffs, and coding follow-ups.

Open lesson: ML system design casebook

Advanced ML Exam Drills

Run timed interview and exam prompts for ASR, TTS, audio RAG, GPU serving, MLOps incidents, rollback, and cost control.

Open lesson: advanced ML exam drills

Production Speech MLOps

Work through serving architecture, CI/CD gates, observability, drift, rollback, cost controls, and realistic incident drills for audio systems.

Open lesson: production MLOps and incident lab

Production Debugging

Practice first-hour triage, rollback judgment, telemetry slicing, and privacy-safe debugging for speech production incidents.

Open lesson: production debugging for speech systems

Model Release Playbook

Practice release reviews for ASR, TTS, embeddings, and speech-to-speech systems, including gates, canaries, debugging, and rollback.

Open lesson: speech model release and rollback

Speech Serving Observability

Practice SLO definition, dashboard design, burn-rate alerts, privacy-safe drift checks, rollback recommendations, and speech incident postmortems.

Open lesson: speech serving observability and SLOs

Speech Safety And Privacy

Practice spoken prompt injection defense, voice impersonation controls, PII handling, consent, abuse monitoring, and privacy-safe debugging.

Open lesson: speech safety, privacy, and security

Inference Hosting

Practice GPU/CPU placement, batching, KV-cache memory, cost controls, rollback, and debugging for ASR, TTS, LLM, and speech-to-speech serving.

Open lesson: inference hosting and serving design

Inference Platform Readiness

Practice release gates, real-time versus batch isolation, cost regressions, queue SLO burn rates, and first-hour speech platform incident response.

Open lesson: speech inference platform readiness

Speech ML/AI Production Incidents

Practice production incident response for partial churn, TTS latency, voice RAG grounding, noisy pronunciation-variation regressions, barge-in, drift, and rollback.

Open lesson: audio ML production incident drills

Speech Evaluation And RAG

Practice end-to-end quality systems for voice assistants: ASR slices, retrieval failures, LLM judges, human review, and rollout decisions.

Open lesson: speech evaluation, RAG, and quality systems

Voice Agent Eval Benchmarks

Practice benchmark design for spoken RAG agents, multi-turn account help, speech-to-speech repair, launch gates, and cost-quality decisions.

Open lesson: voice agent evaluation benchmarks

Speech AI Interview Sprint

Practice timed advanced rounds that combine speech system design, R&D tradeoffs, serving infrastructure, incidents, and coding follow-ups.

Open lesson: advanced speech AI interview sprint

Speech Model CI/CD

Practice registry contracts, CI gates, canary promotion, rollback, and production release engineering for ASR, TTS, and speech-to-speech systems.

Open lesson: speech model CI/CD and release engineering

Speech ML/AI Capstone Exam

Practice a four-hour advanced round covering speech R&D, architecture, incidents, cost and latency tradeoffs, MLOps, and coding utilities.

Open lesson: advanced audio ML capstone exam

Evaluation And Incident Exam

Practice speech eval design, ASR/TTS/RAG release gates, production incident response, drift checks, cost controls, and rollback ranking.

Open lesson: advanced audio ML evaluation and incident exam

Real-Time Speech Design Lab

Practice full spoken-turn architecture, end-to-end latency traces, partial churn debugging, canary gates, and handoff checklists.

Open lesson: real-time speech system design lab

Speech AI Debugging Casebook

Practice first-hour incident method, hypothesis-driven slice analysis, rollback judgment, spoken RAG failures, and safety/cost tradeoffs.

Open lesson: speech AI debugging casebook

Model Serving On-Call

Practice queue triage, tail-latency diagnosis, cost-per-success analysis, graceful degradation, and rollback selection for speech AI serving.

Open lesson: model serving on-call runbook

Data And Evaluation Ops

Practice dataset contracts, labeling rubrics, CI release gates, drift monitoring, and data flywheel tradeoffs for speech products.

Open lesson: speech data and evaluation operations

Data Flywheels

Practice active learning, consent-aware sampling, label audits, data release diffs, and production data incident response for speech AI.

Open lesson: speech data flywheels and active learning

Scaling and Reliability Exam

Practice serving-plane design, capacity math, SLOs, CI/CD gates, rollback, cost controls, and production incident drills.

Open lesson: speech serving scaling and reliability exam

GPU Capacity Planning

Practice GPU serving capacity, autoscaling signals, load shedding, warm TTS voice pools, cost controls, and real-time queue incidents.

Open lesson: speech GPU serving capacity planning

Load Testing And Chaos

Practice end-to-end load tests, chaos drills, retry-storm detection, capacity gates, dependency timeouts, and rollback triggers for speech AI systems.

Open lesson: speech AI load testing and chaos readiness

Full-Duplex S2S Research Update

Practice staff-level tradeoffs for native full-duplex SpeechLMs, VAD-free micro-turn cascades, role-conditioned duplex models, asynchronous RAG, action streams, privacy risk, and turn-taking benchmarks.

Open lesson: full-duplex speech-to-speech research update

Production Audio Judge Monitoring

Practice production QC loops for ASR and TTS: flagged live events, user dissatisfaction signals, audio LLM diagnostics, schema failures, and top recurring issues.

Open lesson: production audio judge monitoring

What This Adds

The target bar is not definitions. It is being able to design, build, ship, monitor, debug, and improve audio-text ML systems under real constraints.

Implementation Practice

Audio ML Coding Drills

Keep implementation practice tied to speech systems: feature pipelines, evaluation gates, serving telemetry, rollout decisions, and privacy-safe synthetic fixtures.

Feature Pipeline Utilities

Practice chunk merging, schema checks, active-learning ranking, and aggregate drift detectors for audio feature pipelines.

Open lesson: speech pipeline coding drills

Serving Reliability Utilities

Practice queue alerts, load-test gates, cost regressions, retry storms, and rollout recommendations from synthetic aggregate metrics.

Open lesson: speech AI load testing and chaos readiness

Evaluation And Release Gates

Practice promotion gates, sustained-drift detection, rollback ranking, and privacy-safe incident summaries.

Open lesson: advanced audio ML evaluation and incident exam

Real-Time Speech Telemetry

Practice latency budgets, prefix churn, canary triggers, and spoken-turn handoff checks for streaming products.

Open lesson: real-time speech system design lab

How To Use It

Solve from the speech-system prompt first, state the privacy boundary and metrics, then compare against the hidden strategy.

Projects

Milestones To Build

Project A: Audio Feature Lab

Feature extraction notebook with waveform, STFT, mel, augmentation, and metadata-only privacy rules.

Project B: ASR Error Analyzer

Run ASR, compute WER/CER, classify substitutions/deletions/insertions, and produce a readable report.

Project C: TTS Latency Bench

Compare local TTS paths by warmup time, generation time, perceived quality, and playback stability.

Project D: Speech Agent Loop

Combine VAD, ASR, LLM, TTS, shared state, and transcript UI into a local privacy-preserving assistant.

Project E: Efficient Serving Report

Benchmark prompt length, context memory, quantization, batching, and model choice for local hosting.

Project F: Speech-To-Speech Design Review

Compare cascaded and direct architectures for latency, controllability, alignment, safety, and quality.

Project G: Speech RAG Eval Harness

Build a small synthetic eval for spoken queries, noisy ASR hypotheses, document retrieval, grounded answers, and TTS latency.

Project H: Release Readiness Review

Write a model card, eval summary, rollout plan, rollback checklist, and first-hour incident runbook for one ASR or TTS release.

Project I: Speech Data Flywheel Review

Design a privacy-safe data collection, labeling, eval, CI, and drift-monitoring plan for one production speech model improvement.

Project J: Speech Safety Red-Team Pack

Create sanitized spoken prompt-injection, voice replay, PII leakage, and TTS abuse fixtures with release gates and rollback actions.

Project K: Inference Platform Readiness Review

Write a launch review for shared ASR, TTS, and speech-to-speech serving with capacity math, SLOs, canaries, rollback, and cost guardrails.

Project L: Audio Incident Runbook Pack

Create privacy-safe runbooks, dashboards, synthetic repros, rollout gates, and rollback criteria for three realistic speech production incidents.

Project M: Voice Agent Eval Benchmark Pack

Create synthetic spoken RAG, account-help, interruption, safety, latency, and rollback fixtures with slice-level launch thresholds.

Project N: Speech Model Release Pipeline

Build a sanitized model registry entry, CI gate report, canary promotion decision, rollback target, and post-release dashboard checklist.

Project O: Speech ML/AI Mock Exam

Run a timed capstone with architecture, incident, eval, cost, MLOps, and coding sections using only synthetic fixtures and aggregate metrics.

Project P: Speech AI Debugging Casebook

Create a reusable pack of synthetic incidents, slice dashboards, rollback decisions, and privacy-safe incident packets for advanced practice.

Project Q: Model Serving On-Call Pack

Create synthetic dashboards, rollback gates, cost-per-success reports, and graceful-degradation runbooks for ASR, TTS, LLM, and spoken RAG serving.

Project R: Streaming Coding Pattern Pack

Build stack and sliding-window utilities for queue alerts, partial transcript churn, parser readiness, and canary rollback exercises.

Project S: Full-Duplex S2S Launch Review

Write a launch review for a full-duplex speech-to-speech agent, including interruption handling, backchannel metrics, micro-turn budgets, action-stream safety, and rollback gates.

Project T: Production Audio Judge QC Loop

Design a local monitor that flags the most recent transcription or TTS output, runs an audio judge, and reports top recurring user-felt issues without committing private audio.

Canonical Sources

Starting Reading List

Attention Is All You Need - transformer foundation.
SpecAugment - simple augmentation for ASR features.
wav2vec 2.0 - self-supervised speech representations.
Conformer - convolution-augmented Transformer for ASR.
Whisper paper - robust speech recognition via weak supervision.
FastSpeech - fast non-autoregressive TTS.
VITS - end-to-end TTS with variational and adversarial learning.
FlashAttention - IO-aware exact attention.
PagedAttention / vLLM - efficient KV-cache management for serving.
Retrieval-Augmented Generation - retrieval plus generation evaluation baseline.
Moshi - real-time full-duplex speech-text foundation model.
BayLing-Duplex - native full-duplex speech dialogue with a single autoregressive LLM.
DuplexOmni - full-duplex speech, video, and asynchronous thinking interaction layer.
Qwen3.5-Omni - large-scale omni-modal speech, video, and text baseline adjacent to duplex systems.
NVIDIA Nemotron 3 VoiceChat - 12B real-time full-duplex speech-to-speech model-card and early-access serving reference.
KRAFTON Raon-SpeechChat-9B - public full-duplex speech language model for simultaneous listening and speaking under a non-commercial license.
DuplexCascade - VAD-free cascaded ASR-LLM-TTS full-duplex pipeline.
PersonaPlex - role and voice control for full-duplex conversational speech models.
DuplexSLA - synchronized speech, language, and action streams for full-duplex tool use.
FastTurn and JAL-Turn - acoustic and streaming-semantic turn-control layers for practical full-duplex agents.
SID-Bench - semantic-aware interruption detection and Average Penalty Time for robust barge-in gates.
TurnGuide - text-guided full-duplex generation that preserves turn-level timing.
Multi-Faceted Interactivity Alignment - RL post-training for pause handling, turn-taking, backchannels, and interruption.
MoshiRAG - asynchronous retrieval for factual full-duplex speech language models.
HumDial-FDBench - ICASSP 2026 full-duplex interaction challenge dataset and leaderboard.
Full-Duplex-Bench-v3 - tool use under real human disfluency and multi-step tasks.
Full-Duplex-Bench v1.5 - overlap handling across interruptions, backchannels, side speech, and background speech.
Game-Time - temporal dynamics benchmark for tempo, synchronized responses, and simultaneous speaking.
LLMs-as-Judges survey - judge prompt criteria, references, explanations, feedback, and meta-evaluation.
AudioBench - broad audio LLM evaluation across speech, audio scenes, and paralinguistic understanding.
AudioJudge - multi-aspect large audio model judging, pairwise protocols, and bias checks.
Google Research: meaning preservation for ASR - semantic ASR evaluation beyond WER.