Foundation
Learn the math, data representations, and model families behind modern audio/text ML.
Speech ML/AI Domain Study Guide
A progressive tour from junior foundations through production ownership, staff-level system design, and principal-level audio AI strategy.
North Star
Learn the math, data representations, and model families behind modern audio/text ML.
Write small experiments for features, training loops, ASR/TTS evaluation, and inference.
Measure latency, memory, context size, batching, quantization, and streaming behavior.
Connect product goals, safety, cost, evaluation, reliability, and platform strategy across the speech AI stack.
Study Rhythm
Curriculum
Lesson pages include code, checkpoints, and hidden answers. Start with the transformer lesson, then return here to connect it to ASR, TTS, and speech-to-speech systems.
Set up Python, notebooks, PyTorch, audio tools, Git, experiment logs, and privacy boundaries.
Lab: load a WAV, plot waveform/spectrogram, save metadata only.
Linear algebra, probability, gradients, cross entropy, sequence likelihoods, and optimization intuition.
Lab: implement linear regression and softmax classification from scratch.
Tensors, autograd, MLPs, initialization, normalization, regularization, train/validation loops.
Lab: train a small classifier and write a failure analysis.
Sampling, quantization, STFT, mel filterbanks, MFCCs, log-mel features, framing, and augmentation.
Lab: compare waveform, spectrogram, and mel features for the same utterance.
Tokenization, embeddings, n-grams, RNNs, seq2seq, attention, beam search, and decoding tradeoffs.
Lab: build a tiny character language model and inspect its decoding mistakes.
Self-attention, causal masks, encoder/decoder stacks, positional encodings, RoPE, KV cache, and scaling limits.
Lab: implement single-head attention, then measure sequence length cost.
Dynamic time warping, HMM/GMM systems, pronunciation lexicons, acoustic models, language models, and decoding graphs.
Lab: diagram a classic ASR pipeline and map modern replacements.
CTC, Listen-Attend-Spell, RNN-T, Conformer, wav2vec 2.0, Whisper, WER/CER, streaming ASR, and VAD.
Lab: run two ASR models on the same clips and compare errors by category.
Concatenative synthesis, statistical parametric speech, vocoders, Tacotron, WaveNet, FastSpeech, VITS, and voice cloning risks.
Lab: compare TTS latency and intelligibility across two local engines.
Discrete audio tokens, residual vector quantization, codec language models, prosody, speaker information, and compression artifacts.
Lab: encode/decode audio and listen for bitrate-dependent artifacts.
Contrastive learning, cross-modal retrieval, audio captioning, speech translation, shared embedding spaces, and alignment losses.
Lab: build a tiny audio-text retrieval evaluation with hand-labeled examples.
Cascaded ASR-LLM-TTS, direct speech-to-speech, streaming turn-taking, barge-in, echo control, memory, and agent state.
Lab: design latency budgets for a local speech assistant turn.
FlashAttention, grouped-query attention, speculative decoding, quantization, pruning, distillation, adapters, LoRA, and prompt compression.
Lab: benchmark one model under two quantization levels and two context lengths.
KV-cache memory, continuous batching, PagedAttention, llama.cpp, MLX, vLLM-style serving, telemetry, cold starts, and local privacy.
Lab: build a local inference readiness report with latency, memory, throughput, and quality notes.
Speech eval stacks, RAG for voice assistants, LLM judges, human review, slice metrics, and rollout gates.
Lab: build a privacy-safe speech RAG evaluation sheet with latency, retrieval, and spoken-answer scores.
First-hour incident response for ASR, TTS, speech-to-speech, audio RAG, rollback, and privacy-safe telemetry.
Lab: compute partial-churn and latency-slice signals from synthetic request records.
Dataset contracts, labeling quality, eval slice design, CI gates, drift monitoring, and model release data lineage.
Lab: compute slice regressions from old and new metric maps before a release review.
Practice advanced design rounds for streaming ASR, speech-to-speech agents, inference platforms, data flywheels, and TTS incidents.
Lab: implement canary gates and real-time audio capacity estimates.
Timed prompts for ASR, TTS, audio RAG, inference platforms, incidents, rollout gates, and staff-level tradeoff explanations.
Lab: answer each prompt under time pressure, then compare against hidden strong outlines.
SLOs, dashboards, burn rates, drift slices, rollback decisions, cost alerts, and postmortems for production speech systems.
Lab: implement SLO burn-rate, slice-drift, and rollout-recommendation utilities.
Threat modeling for spoken prompt injection, voice impersonation, PII, consent, TTS abuse, and privacy-safe incident response.
Lab: implement transcript redaction, sensitive-tool confirmation, and abuse-spike detection utilities.
Readiness reviews for ASR, TTS, LLM, and speech-to-speech serving: capacity, gates, SLOs, incidents, and cost-aware rollout decisions.
Lab: implement rollout gates, queue burn-rate checks, and cost-regression detectors with synthetic aggregate metrics.
First-hour triage, rollback judgment, slice regression analysis, privacy-safe incident packets, drift, and speech-to-speech failure drills.
Lab: rank slice regressions, recommend rollout actions, and produce aggregate-only incident packets.
Linked lists, binary search, heaps, dynamic programming, monotonic predicates, pointer invariants, and production follow-ups.
Lab: solve Blind 75 pointer, heap, binary-search, and DP problems before opening hidden Python answers.
Open lesson: Blind 75 linked lists, binary search, and heaps
Advanced evaluation plans for spoken RAG agents, speech-to-speech repair, task success, latency, safety, cost, and launch gates.
Lab: implement a voice-agent release-decision aggregator over synthetic slice metrics.
Timed capstone prompts for speech system design, R&D judgment, production incidents, serving platforms, MLOps, and coding follow-ups.
Lab: rank canary rollout risks and connect Blind 75 patterns to speech serving capacity and constrained decoding.
Model registry contracts, CI gates, canaries, rollback targets, production release drills, and research-to-serving handoff discipline.
Lab: implement registry validation, canary promotion gates, and rollback target resolution over synthetic release metadata.
Final advanced practice for speech R&D, system design, inference hosting, incidents, MLOps, cost control, and coding follow-ups.
Lab: answer timed capstone prompts, then compare against hidden strong outlines and worked Python utilities.
Production debugging cases for ASR, TTS, speech-to-speech agents, spoken RAG, safety classifiers, and rollback decisions.
Lab: implement first-bad-release detection, slice-aware rollback selection, and privacy-safe incident packet builders.
Production on-call practice for ASR, TTS, LLM, retrieval, and speech-to-speech serving incidents under real traffic constraints.
Lab: implement queue saturation, cost-per-success, and rollback-candidate utilities with synthetic aggregate metrics.
Invariant-first coding practice for Blind 75 patterns, test design, debugging, and production speech ML follow-ups.
Lab: solve worked Python drills, then connect each pattern to ASR, RAG, queueing, or model-serving scenarios.
Blind 75 stack, monotonic-stack, parser, and sliding-window drills with worked Python, hidden answers, edge tests, and speech-serving follow-ups.
Lab: solve Daily Temperatures, Minimum Window Substring, Decode String, and production queue-window prompts by invariant.
Advanced practice for speech evaluation stacks, rollout gates, production incidents, privacy-safe debugging, drift, cost spikes, and rollback ranking.
Lab: implement promotion gates, sustained-drift detection, and rollback-target ranking over synthetic aggregate metrics.
Practice complete spoken-turn design, latency budgets, canary rollback, partial transcript churn, and production debugging for real-time speech products.
Lab: implement latency budget reports, prefix churn measurement, and canary rollback triggers over synthetic aggregate telemetry.
Practice privacy-safe data loops, active-learning sampling, labeling operations, release gates, and first-hour data incident debugging for speech systems.
Lab: rank aggregate review candidates with consent filtering, slice coverage, regression signals, and privacy-safe metadata.
Practice capacity planning, SLOs, rollout gates, cost-aware routing, rollback judgment, and production debugging for speech serving systems.
Lab: implement peak concurrency, canary rollback, and sliding-window error-budget utilities over synthetic aggregate metrics.
Practice GPU memory math, batching, autoscaling, load shedding, warm pools, cost controls, and rollback for speech serving systems.
Lab: implement sliding-window queue alerts, priority admission, and retry-storm detectors over synthetic aggregate serving metrics.
Design privacy-safe audio feature pipelines, schema contracts, label drift checks, online/offline parity tests, and feature-store boundaries.
Lab: implement audio chunk merging, active-learning ranking, and schema coverage drift detectors with hidden Python answers.
Practice launch readiness for ASR, TTS, spoken RAG, and speech-to-speech systems under load, dependency failures, cost pressure, and rollback constraints.
Lab: implement load-test gates, retry-storm detection, and capacity step-load summaries over synthetic aggregate metrics.
Study the latest full-duplex speech model direction: native listen-while-speaking models, micro-turn cascades, role-conditioned agents, action streams, asynchronous retrieval, interactivity alignment, and benchmark gaps.
Lab: compare native, cascaded, front-end-assisted, and retrieval-augmented duplex serving designs with rollout gates and interruption metrics.
Career Ladder
This track turns the speech ML/AI domain tour into role-level practice: junior implementation habits, mid-level debugging, advanced system ownership, staff-level platform tradeoffs, and principal-level strategy across ML system design, R&D, MLOps, serving, monitoring, rollback, and speech and multimodal AI depth.
Practice realistic prompts for streaming ASR, speech-to-speech assistants, audio model serving platforms, production incidents, and research tradeoffs.
Practice end-to-end architecture for ASR, TTS, speech-to-speech, evaluation, deployment, rollback, and cost-aware serving.
Practice full ML system design cases with hidden strong answers, rollout gates, production tradeoffs, and coding follow-ups.
Run timed interview and exam prompts for ASR, TTS, audio RAG, GPU serving, MLOps incidents, rollback, and cost control.
Work through serving architecture, CI/CD gates, observability, drift, rollback, cost controls, and realistic incident drills for audio systems.
Practice first-hour triage, rollback judgment, telemetry slicing, and privacy-safe debugging for speech production incidents.
Practice release reviews for ASR, TTS, embeddings, and speech-to-speech systems, including gates, canaries, debugging, and rollback.
Practice SLO definition, dashboard design, burn-rate alerts, drift checks, rollback recommendations, and speech incident postmortems.
Practice spoken prompt injection defense, voice impersonation controls, PII handling, consent, abuse monitoring, and privacy-safe debugging.
Practice GPU/CPU placement, batching, KV-cache memory, cost controls, rollback, and debugging for ASR, TTS, LLM, and speech-to-speech serving.
Practice release gates, real-time versus batch isolation, cost regressions, queue SLO burn rates, and first-hour speech platform incident response.
Practice production incident response for partial churn, TTS latency, voice RAG grounding, noisy accent regressions, barge-in, drift, and rollback.
Practice end-to-end quality systems for voice assistants: ASR slices, retrieval failures, LLM judges, human review, and rollout decisions.
Practice benchmark design for spoken RAG agents, multi-turn account help, speech-to-speech repair, launch gates, and cost-quality decisions.
Practice timed advanced rounds that combine speech system design, R&D tradeoffs, serving infrastructure, incidents, and coding follow-ups.
Practice registry contracts, CI gates, canary promotion, rollback, and production release engineering for ASR, TTS, and speech-to-speech systems.
Practice a four-hour advanced round covering speech R&D, architecture, incidents, cost and latency tradeoffs, MLOps, and coding utilities.
Practice speech eval design, ASR/TTS/RAG release gates, production incident response, drift checks, cost controls, and rollback ranking.
Practice full spoken-turn architecture, end-to-end latency traces, partial churn debugging, canary gates, and handoff checklists.
Practice first-hour incident method, hypothesis-driven slice analysis, rollback judgment, spoken RAG failures, and safety/cost tradeoffs.
Practice queue triage, tail-latency diagnosis, cost-per-success analysis, graceful degradation, and rollback selection for speech AI serving.
Practice invariant-first Blind 75 coding, edge tests, worked Python, and production follow-ups for speech ML systems.
Practice dataset contracts, labeling rubrics, CI release gates, drift monitoring, and data flywheel tradeoffs for speech products.
Practice active learning, consent-aware sampling, label audits, data release diffs, and production data incident response for speech AI.
Practice serving-plane design, capacity math, SLOs, CI/CD gates, rollback, cost controls, and production incident drills.
Practice GPU serving capacity, autoscaling signals, load shedding, warm TTS voice pools, cost controls, and real-time queue incidents.
Practice end-to-end load tests, chaos drills, retry-storm detection, capacity gates, dependency timeouts, and rollback triggers for speech AI systems.
Practice staff-level tradeoffs for native full-duplex SpeechLMs, VAD-free micro-turn cascades, role-conditioned duplex models, asynchronous RAG, action streams, privacy risk, and turn-taking benchmarks.
The target bar is not definitions. It is being able to design, build, ship, monitor, debug, and improve audio-text ML systems under real constraints.
Coding Practice
Speech ML/AI engineers still need reliable general coding. This chapter covers the full Blind 75 problem bank with pattern strategies, hidden-answer notes, and common mistakes, then connects those patterns to speech data, evaluation, and serving work.
Practice arrays, hashing, binary search, dynamic programming, graphs, intervals, linked lists, matrix traversal, strings, trees, tries, and heaps.
Practice state search, recursive invariants, memoization, DP transitions, cycle handling, and worked Python solutions for high-yield Blind 75 items.
Practice high-frequency Blind 75 implementation rounds with hidden strategies, invariants, edge cases, and worked Python for production-shaped data structures.
Practice substring windows, palindromes, combinatorial search, prefix trees, pruning, edge cases, and production-style production follow-ups.
Practice pointer rewiring, monotonic predicates, priority queues, top-k streams, and worked Python solutions for common Blind 75 rounds.
Open lesson: Blind 75 linked lists, binary search, and heaps
Practice 1D and 2D DP state definitions, recurrences, hidden-answer drills, worked Python, and production ML follow-ups.
Practice recursive tree contracts, BST boundaries, trie prefix search, Word Search II pruning, and speech production follow-ups.
Practice interview contracts, invariants, edge-case tests, worked Python drills, and production follow-ups for audio ML work.
Practice Valid Parentheses, Daily Temperatures, Histogram, Character Replacement, Minimum Window, Decode String, and speech-serving follow-ups.
Practice XOR invariants, masks, popcount, reverse bits, integer boundaries, encode/decode strings, and production feature-flag follow-ups.
Practice Merge Intervals, Top K Elements, and sliding windows as speech feature-pipeline utilities with privacy and schema-contract follow-ups.
Solve from the prompt first, classify the pattern, state invariants, then open the hidden strategy and compare against your code.
Projects
Feature extraction notebook with waveform, STFT, mel, augmentation, and metadata-only privacy rules.
Run ASR, compute WER/CER, classify substitutions/deletions/insertions, and produce a readable report.
Compare local TTS paths by warmup time, generation time, perceived quality, and playback stability.
Combine VAD, ASR, LLM, TTS, shared state, and transcript UI into a local privacy-preserving assistant.
Benchmark prompt length, context memory, quantization, batching, and model choice for local hosting.
Compare cascaded and direct architectures for latency, controllability, alignment, safety, and quality.
Build a small synthetic eval for spoken queries, noisy ASR hypotheses, document retrieval, grounded answers, and TTS latency.
Write a model card, eval summary, rollout plan, rollback checklist, and first-hour incident runbook for one ASR or TTS release.
Design a privacy-safe data collection, labeling, eval, CI, and drift-monitoring plan for one production speech model improvement.
Create sanitized spoken prompt-injection, voice replay, PII leakage, and TTS abuse fixtures with release gates and rollback actions.
Write a launch review for shared ASR, TTS, and speech-to-speech serving with capacity math, SLOs, canaries, rollback, and cost guardrails.
Create privacy-safe runbooks, dashboards, synthetic repros, rollout gates, and rollback criteria for three realistic speech production incidents.
Create synthetic spoken RAG, account-help, interruption, safety, latency, and rollback fixtures with slice-level launch thresholds.
Build a sanitized model registry entry, CI gate report, canary promotion decision, rollback target, and post-release dashboard checklist.
Run a timed capstone with architecture, incident, eval, cost, MLOps, and coding sections using only synthetic fixtures and aggregate metrics.
Create a reusable pack of synthetic incidents, slice dashboards, rollback decisions, and privacy-safe incident packets for advanced practice practice.
Create synthetic dashboards, rollback gates, cost-per-success reports, and graceful-degradation runbooks for ASR, TTS, LLM, and spoken RAG serving.
Build stack and sliding-window utilities for queue alerts, partial transcript churn, parser readiness, and canary rollback exercises.
Write a launch review for a full-duplex speech-to-speech agent, including interruption handling, backchannel metrics, micro-turn budgets, action-stream safety, and rollback gates.
Canonical Sources