Speech ML/AI Domain Study Guide

Speech ML/AI Domain Study Guide

A progressive tour from junior foundations through production ownership, staff-level system design, and principal-level audio AI strategy.

North Star

Tour The Speech ML/AI Domain End To End

Foundation

Learn the math, data representations, and model families behind modern audio/text ML.

Build

Write small experiments for features, training loops, ASR/TTS evaluation, and inference.

Operate

Measure latency, memory, context size, batching, quantization, and streaming behavior.

Lead

Connect product goals, safety, cost, evaluation, reliability, and platform strategy across the speech AI stack.

Study Rhythm

Repeat This Loop

  1. Read: one paper, chapter, or implementation note.
  2. Rebuild: implement the smallest working version.
  3. Measure: record accuracy, latency, memory, and failure modes.
  4. Explain: write a one-page note in your own words.
  5. Connect: map the idea back to ASR, TTS, or speech-to-speech.

Curriculum

Forty Modules

Lesson pages include code, checkpoints, and hidden answers. Start with the transformer lesson, then return here to connect it to ASR, TTS, and speech-to-speech systems.

00

Orientation And Tooling

Set up Python, notebooks, PyTorch, audio tools, Git, experiment logs, and privacy boundaries.

Lab: load a WAV, plot waveform/spectrogram, save metadata only.

01

Math For ML

Linear algebra, probability, gradients, cross entropy, sequence likelihoods, and optimization intuition.

Lab: implement linear regression and softmax classification from scratch.

Open lesson: ML foundations for speech engineers

04

Text And Sequence Modeling

Tokenization, embeddings, n-grams, RNNs, seq2seq, attention, beam search, and decoding tradeoffs.

Lab: build a tiny character language model and inspect its decoding mistakes.

07

Neural ASR

CTC, Listen-Attend-Spell, RNN-T, Conformer, wav2vec 2.0, Whisper, WER/CER, streaming ASR, and VAD.

Lab: run two ASR models on the same clips and compare errors by category.

08

TTS History

Concatenative synthesis, statistical parametric speech, vocoders, Tacotron, WaveNet, FastSpeech, VITS, and voice cloning risks.

Lab: compare TTS latency and intelligibility across two local engines.

Open lesson: TTS evolution and production tradeoffs

09

Neural Audio Codecs

Discrete audio tokens, residual vector quantization, codec language models, prosody, speaker information, and compression artifacts.

Lab: encode/decode audio and listen for bitrate-dependent artifacts.

10

Audio-Text Multimodality

Contrastive learning, cross-modal retrieval, audio captioning, speech translation, shared embedding spaces, and alignment losses.

Lab: build a tiny audio-text retrieval evaluation with hand-labeled examples.

11

Speech-To-Speech Architectures

Cascaded ASR-LLM-TTS, direct speech-to-speech, streaming turn-taking, barge-in, echo control, memory, and agent state.

Lab: design latency budgets for a local speech assistant turn.

Open lesson: cascaded and direct speech-to-speech

12

Efficient Transformers

FlashAttention, grouped-query attention, speculative decoding, quantization, pruning, distillation, adapters, LoRA, and prompt compression.

Lab: benchmark one model under two quantization levels and two context lengths.

Open lesson: efficient transformers and latency tradeoffs

13

Inference Hosting

KV-cache memory, continuous batching, PagedAttention, llama.cpp, MLX, vLLM-style serving, telemetry, cold starts, and local privacy.

Lab: build a local inference readiness report with latency, memory, throughput, and quality notes.

Open lesson: inference hosting and serving design

14

Evaluation, RAG, And Quality Systems

Speech eval stacks, RAG for voice assistants, LLM judges, human review, slice metrics, and rollout gates.

Lab: build a privacy-safe speech RAG evaluation sheet with latency, retrieval, and spoken-answer scores.

Open lesson: speech evaluation, RAG, and quality systems

15

Production Debugging

First-hour incident response for ASR, TTS, speech-to-speech, audio RAG, rollback, and privacy-safe telemetry.

Lab: compute partial-churn and latency-slice signals from synthetic request records.

Open lesson: production debugging for speech systems

16

Data And Evaluation Operations

Dataset contracts, labeling quality, eval slice design, CI gates, drift monitoring, and model release data lineage.

Lab: compute slice regressions from old and new metric maps before a release review.

Open lesson: speech data and evaluation operations

17

ML System Design Casebook

Practice advanced design rounds for streaming ASR, speech-to-speech agents, inference platforms, data flywheels, and TTS incidents.

Lab: implement canary gates and real-time audio capacity estimates.

Open lesson: ML system design casebook

18

Advanced ML Exam Drills

Timed prompts for ASR, TTS, audio RAG, inference platforms, incidents, rollout gates, and staff-level tradeoff explanations.

Lab: answer each prompt under time pressure, then compare against hidden strong outlines.

Open lesson: advanced ML exam drills

19

Speech Serving Observability

SLOs, dashboards, burn rates, drift slices, rollback decisions, cost alerts, and postmortems for production speech systems.

Lab: implement SLO burn-rate, slice-drift, and rollout-recommendation utilities.

Open lesson: speech serving observability and SLOs

20

Speech Safety, Privacy, And Security

Threat modeling for spoken prompt injection, voice impersonation, PII, consent, TTS abuse, and privacy-safe incident response.

Lab: implement transcript redaction, sensitive-tool confirmation, and abuse-spike detection utilities.

Open lesson: speech safety, privacy, and security

21

Speech Inference Platform Readiness

Readiness reviews for ASR, TTS, LLM, and speech-to-speech serving: capacity, gates, SLOs, incidents, and cost-aware rollout decisions.

Lab: implement rollout gates, queue burn-rate checks, and cost-regression detectors with synthetic aggregate metrics.

Open lesson: speech inference platform readiness

22

Speech ML/AI Production Incidents

First-hour triage, rollback judgment, slice regression analysis, privacy-safe incident packets, drift, and speech-to-speech failure drills.

Lab: rank slice regressions, recommend rollout actions, and produce aggregate-only incident packets.

Open lesson: audio ML production incident drills

24

Voice Agent Evaluation Benchmarks

Advanced evaluation plans for spoken RAG agents, speech-to-speech repair, task success, latency, safety, cost, and launch gates.

Lab: implement a voice-agent release-decision aggregator over synthetic slice metrics.

Open lesson: voice agent evaluation benchmarks

25

Speech AI Interview Sprint

Timed capstone prompts for speech system design, R&D judgment, production incidents, serving platforms, MLOps, and coding follow-ups.

Lab: rank canary rollout risks and connect Blind 75 patterns to speech serving capacity and constrained decoding.

Open lesson: advanced speech AI interview sprint

26

Speech Model CI/CD And Release Engineering

Model registry contracts, CI gates, canaries, rollback targets, production release drills, and research-to-serving handoff discipline.

Lab: implement registry validation, canary promotion gates, and rollback target resolution over synthetic release metadata.

Open lesson: speech model CI/CD and release engineering

27

Speech ML/AI Capstone Exam

Final advanced practice for speech R&D, system design, inference hosting, incidents, MLOps, cost control, and coding follow-ups.

Lab: answer timed capstone prompts, then compare against hidden strong outlines and worked Python utilities.

Open lesson: advanced audio ML capstone exam

28

Speech AI Debugging Casebook

Production debugging cases for ASR, TTS, speech-to-speech agents, spoken RAG, safety classifiers, and rollback decisions.

Lab: implement first-bad-release detection, slice-aware rollback selection, and privacy-safe incident packet builders.

Open lesson: speech AI debugging casebook

29

Model Serving On-Call Runbook

Production on-call practice for ASR, TTS, LLM, retrieval, and speech-to-speech serving incidents under real traffic constraints.

Lab: implement queue saturation, cost-per-success, and rollback-candidate utilities with synthetic aggregate metrics.

Open lesson: model serving on-call runbook

30

Coding Interview Workbook

Invariant-first coding practice for Blind 75 patterns, test design, debugging, and production speech ML follow-ups.

Lab: solve worked Python drills, then connect each pattern to ASR, RAG, queueing, or model-serving scenarios.

Open lesson: advanced coding interview workbook

31

Stacks And Sliding Windows

Blind 75 stack, monotonic-stack, parser, and sliding-window drills with worked Python, hidden answers, edge tests, and speech-serving follow-ups.

Lab: solve Daily Temperatures, Minimum Window Substring, Decode String, and production queue-window prompts by invariant.

Open lesson: Blind 75 stacks and sliding windows

32

Evaluation And Incident Exam

Advanced practice for speech evaluation stacks, rollout gates, production incidents, privacy-safe debugging, drift, cost spikes, and rollback ranking.

Lab: implement promotion gates, sustained-drift detection, and rollback-target ranking over synthetic aggregate metrics.

Open lesson: advanced audio ML evaluation and incident exam

33

Real-Time Speech System Design Lab

Practice complete spoken-turn design, latency budgets, canary rollback, partial transcript churn, and production debugging for real-time speech products.

Lab: implement latency budget reports, prefix churn measurement, and canary rollback triggers over synthetic aggregate telemetry.

Open lesson: real-time speech system design lab

34

Speech Data Flywheels And Active Learning

Practice privacy-safe data loops, active-learning sampling, labeling operations, release gates, and first-hour data incident debugging for speech systems.

Lab: rank aggregate review candidates with consent filtering, slice coverage, regression signals, and privacy-safe metadata.

Open lesson: speech data flywheels and active learning

35

Speech Serving Scaling And Reliability Exam

Practice capacity planning, SLOs, rollout gates, cost-aware routing, rollback judgment, and production debugging for speech serving systems.

Lab: implement peak concurrency, canary rollback, and sliding-window error-budget utilities over synthetic aggregate metrics.

Open lesson: speech serving scaling and reliability exam

36

Speech GPU Serving Capacity Planning

Practice GPU memory math, batching, autoscaling, load shedding, warm pools, cost controls, and rollback for speech serving systems.

Lab: implement sliding-window queue alerts, priority admission, and retry-storm detectors over synthetic aggregate serving metrics.

Open lesson: speech GPU serving capacity planning

37

Speech Feature Pipelines And Data Contracts

Design privacy-safe audio feature pipelines, schema contracts, label drift checks, online/offline parity tests, and feature-store boundaries.

Lab: implement audio chunk merging, active-learning ranking, and schema coverage drift detectors with hidden Python answers.

Open lesson: speech feature pipelines and data contracts

38

Speech AI Load Testing And Chaos Readiness

Practice launch readiness for ASR, TTS, spoken RAG, and speech-to-speech systems under load, dependency failures, cost pressure, and rollback constraints.

Lab: implement load-test gates, retry-storm detection, and capacity step-load summaries over synthetic aggregate metrics.

Open lesson: speech AI load testing and chaos readiness

39

Full-Duplex Speech-To-Speech Research Update

Study the latest full-duplex speech model direction: native listen-while-speaking models, micro-turn cascades, role-conditioned agents, action streams, asynchronous retrieval, interactivity alignment, and benchmark gaps.

Lab: compare native, cascaded, front-end-assisted, and retrieval-augmented duplex serving designs with rollout gates and interruption metrics.

Open lesson: full-duplex speech-to-speech research update

Career Ladder

Practice From Junior To Principal

This track turns the speech ML/AI domain tour into role-level practice: junior implementation habits, mid-level debugging, advanced system ownership, staff-level platform tradeoffs, and principal-level strategy across ML system design, R&D, MLOps, serving, monitoring, rollback, and speech and multimodal AI depth.

Advanced ML Exam Drills

Run timed interview and exam prompts for ASR, TTS, audio RAG, GPU serving, MLOps incidents, rollback, and cost control.

Open lesson: advanced ML exam drills

Speech AI Debugging Casebook

Practice first-hour incident method, hypothesis-driven slice analysis, rollback judgment, spoken RAG failures, and safety/cost tradeoffs.

Open lesson: speech AI debugging casebook

Full-Duplex S2S Research Update

Practice staff-level tradeoffs for native full-duplex SpeechLMs, VAD-free micro-turn cascades, role-conditioned duplex models, asynchronous RAG, action streams, privacy risk, and turn-taking benchmarks.

Open lesson: full-duplex speech-to-speech research update

What This Adds

The target bar is not definitions. It is being able to design, build, ship, monitor, debug, and improve audio-text ML systems under real constraints.

Coding Practice

Core Coding Patterns

Speech ML/AI engineers still need reliable general coding. This chapter covers the full Blind 75 problem bank with pattern strategies, hidden-answer notes, and common mistakes, then connects those patterns to speech data, evaluation, and serving work.

Graphs, Trees, And DP Deep Dive

Practice state search, recursive invariants, memoization, DP transitions, cycle handling, and worked Python solutions for high-yield Blind 75 items.

Open lesson: Blind 75 graphs, trees, and DP

Arrays, Intervals, Heaps, And Matrices

Practice high-frequency Blind 75 implementation rounds with hidden strategies, invariants, edge cases, and worked Python for production-shaped data structures.

Open lesson: Blind 75 arrays and intervals

Pipeline Coding Drills

Practice Merge Intervals, Top K Elements, and sliding windows as speech feature-pipeline utilities with privacy and schema-contract follow-ups.

Open lesson: speech pipeline coding drills

How To Use It

Solve from the prompt first, classify the pattern, state invariants, then open the hidden strategy and compare against your code.

Projects

Milestones To Build

Project A: Audio Feature Lab

Feature extraction notebook with waveform, STFT, mel, augmentation, and metadata-only privacy rules.

Project B: ASR Error Analyzer

Run ASR, compute WER/CER, classify substitutions/deletions/insertions, and produce a readable report.

Project C: TTS Latency Bench

Compare local TTS paths by warmup time, generation time, perceived quality, and playback stability.

Project D: Speech Agent Loop

Combine VAD, ASR, LLM, TTS, shared state, and transcript UI into a local privacy-preserving assistant.

Project E: Efficient Serving Report

Benchmark prompt length, context memory, quantization, batching, and model choice for local hosting.

Project F: Speech-To-Speech Design Review

Compare cascaded and direct architectures for latency, controllability, alignment, safety, and quality.

Project G: Speech RAG Eval Harness

Build a small synthetic eval for spoken queries, noisy ASR hypotheses, document retrieval, grounded answers, and TTS latency.

Project H: Release Readiness Review

Write a model card, eval summary, rollout plan, rollback checklist, and first-hour incident runbook for one ASR or TTS release.

Project I: Speech Data Flywheel Review

Design a privacy-safe data collection, labeling, eval, CI, and drift-monitoring plan for one production speech model improvement.

Project J: Speech Safety Red-Team Pack

Create sanitized spoken prompt-injection, voice replay, PII leakage, and TTS abuse fixtures with release gates and rollback actions.

Project K: Inference Platform Readiness Review

Write a launch review for shared ASR, TTS, and speech-to-speech serving with capacity math, SLOs, canaries, rollback, and cost guardrails.

Project L: Audio Incident Runbook Pack

Create privacy-safe runbooks, dashboards, synthetic repros, rollout gates, and rollback criteria for three realistic speech production incidents.

Project M: Voice Agent Eval Benchmark Pack

Create synthetic spoken RAG, account-help, interruption, safety, latency, and rollback fixtures with slice-level launch thresholds.

Project N: Speech Model Release Pipeline

Build a sanitized model registry entry, CI gate report, canary promotion decision, rollback target, and post-release dashboard checklist.

Project O: Speech ML/AI Mock Exam

Run a timed capstone with architecture, incident, eval, cost, MLOps, and coding sections using only synthetic fixtures and aggregate metrics.

Project P: Speech AI Debugging Casebook

Create a reusable pack of synthetic incidents, slice dashboards, rollback decisions, and privacy-safe incident packets for advanced practice practice.

Project Q: Model Serving On-Call Pack

Create synthetic dashboards, rollback gates, cost-per-success reports, and graceful-degradation runbooks for ASR, TTS, LLM, and spoken RAG serving.

Project R: Streaming Coding Pattern Pack

Build stack and sliding-window utilities for queue alerts, partial transcript churn, parser readiness, and canary rollback exercises.

Project S: Full-Duplex S2S Launch Review

Write a launch review for a full-duplex speech-to-speech agent, including interruption handling, backchannel metrics, micro-turn budgets, action-stream safety, and rollback gates.

Canonical Sources

Starting Reading List