ML Foundations For Speech Engineers

Core Math

Know What The Model Is Optimizing

A strong engineer can explain the objective before opening a profiler. These foundations show up in acoustic modeling, language modeling, speaker embeddings, retrieval, ranking, and production calibration.

Vectors And Matrices

Audio batches are tensors: batch, time, feature, channel, layer, head, and vocabulary axes must stay explicit.

Question: Why do shape mistakes survive code review?

Many tensor libraries broadcast silently. A wrong axis can produce plausible numbers while training the wrong objective. Experienced engineers write shape comments, assertions, and tests for feature extraction, masking, padding, and batch collation.

Probability

ASR and TTS systems rank uncertain hypotheses, not certainties. Learn likelihoods, priors, calibration, and confidence.

Question: Why is confidence not accuracy?

Confidence is a model score or derived probability. Accuracy is measured against labels. A model can be overconfident on noisy microphones or rare names, so launch decisions require calibration curves and slice metrics, not raw confidence alone.

Loss Functions

Cross entropy, CTC, contrastive losses, regression losses, and ranking losses encode different supervision assumptions.

Question: When is cross entropy the wrong mental model?

Cross entropy fits aligned classification targets. Streaming ASR often has unaligned audio and text, so CTC or transducer losses marginalize alignments. Retrieval uses contrastive objectives, and TTS may combine duration, acoustic, adversarial, and vocoder losses.

Optimization

Learning rate, batch size, initialization, normalization, clipping, and schedulers determine whether gradients are useful.

Question: What should you check before blaming the model?

Check data labels, leakage, feature normalization, loss scale, gradient norms, learning rate, batch construction, masking, train/eval mode, random seeds, and whether the model can overfit a tiny batch. Most early failures are pipeline failures.

Training Loop

A Minimal Classifier With Advanced Habits

This tiny example is not an ASR system. It is a clean template for habits that matter later: explicit shapes, split metrics, seeded fixtures, and small-batch overfit checks.

import random
import numpy as np
import torch
from torch import nn


def seed_everything(seed=7):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)
    mps = getattr(torch, "mps", None)
    mps_backend = getattr(torch.backends, "mps", None)
    if (
        mps is not None
        and mps_backend is not None
        and hasattr(mps, "manual_seed")
        and mps_backend.is_available()
    ):
        mps.manual_seed(seed)


class TinyClassifier(nn.Module):
    def __init__(self, in_features, hidden, classes):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_features, hidden),
            nn.ReLU(),
            nn.LayerNorm(hidden),
            nn.Linear(hidden, classes),
        )

    def forward(self, x):
        # x: [batch, features]
        return self.net(x)


def train_step(model, batch, optimizer):
    model.train()
    x, y = batch
    logits = model(x)
    assert logits.shape[0] == y.shape[0]
    loss = nn.functional.cross_entropy(logits, y)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()
    return float(loss.detach())

Question: Why include a tiny-batch overfit test?

If the model cannot drive loss near zero on a handful of examples, something basic is broken: labels, shapes, masking, optimizer, feature scale, loss wiring, or train/eval mode. This test is cheap and catches failures before expensive experiments.

Question: How does this map to speech?

Replace fixed vectors with frames or learned audio tokens, add padding masks, and choose a sequence objective. The same discipline remains: verify tensor axes, loss behavior, gradient scale, split metrics, seeded fixtures, and a small overfit run. Seeds reduce run-to-run noise; on Apple Silicon, seed MPS explicitly when the installed PyTorch version exposes the MPS RNG helper. Exact determinism can still vary across PyTorch releases, hardware, kernels, and distributed training settings.

Evaluation

Measure Generalization, Not Just Training Loss

Split data by speaker, session, device, domain, and time when leakage is possible.
Track aggregate metrics and critical slices separately.
Compare against a frozen baseline before claiming progress.
Report confidence intervals or repeated-seed variance for small eval sets.
Keep private audio and transcripts out of Git; store sanitized aggregate results only.

Exam Prompt: Design A Validation Split For Wake-Word Detection

You have recordings from 500 speakers, multiple rooms, and several microphone models. How do you split train, validation, and test data?

Hidden answer: validation design

Split primarily by speaker so identity does not leak. Reserve device and room slices in validation and test, and keep a time-based holdout if recordings were collected over multiple app versions. Report false accepts, false rejects, latency, and noisy/far-field slices. Use a locked test set only for release decisions, not iterative tuning.

Exam Prompt: Accuracy Improved But Product Quality Dropped

Offline validation accuracy increased, but users report worse dictation quality after release. What hypotheses do you test first?

Hidden answer: investigation path

Check data mismatch, leakage in validation, degraded latency, partial transcript churn, punctuation or normalization changes, entity error rate, accent or device slices, confidence calibration, rollout mix, and downstream UI behavior. Product quality may depend on timing, stability, and edit distance on important words, not global accuracy alone.

Debugging

First Checks When Training Looks Wrong

Loss Does Not Move

Start with labels, feature scale, learning rate, frozen parameters, masking, and optimizer wiring.

Question: What is the fastest useful experiment?

Overfit 8 to 32 examples with augmentation disabled. If that fails, inspect one batch by hand, print shapes, check logits, check gradients, and confirm labels match the input examples.

Validation Regresses

Look for overfitting, leakage removal, train/eval mismatch, bad augmentation, or slice-specific failures.

Question: What makes speech validation fragile?

Speakers, microphones, rooms, language mix, noise, silence, and transcript normalization can shift independently. A validation set that misses one of those axes can approve a model that fails in production.

Study Companion

NotebookLM Review

Use the shared NotebookLM artifact as a companion review for this foundations module when you want an alternate summary or study guide before working through the prompts.

NotebookLM Artifact

Open the shared NotebookLM review artifact

Inline podcast playback can be added here when there is a direct public WAV or MP3 URL. The NotebookLM artifact link opens the NotebookLM page, but it is not itself an embeddable audio file.

Advanced Practice

Foundations Prompts

Prompt 1: Explain Bias, Variance, And Data Scale For ASR

A candidate model underfits clean speech and performs poorly on noisy speech. Explain what you would change and how you would measure it.

Hidden answer: strong response shape

Separate capacity from data mismatch. Underfitting clean speech suggests optimization, architecture, feature, or capacity issues. Poor noisy performance may require augmentation, better data coverage, denoising, or robust objectives. Measure clean and noisy slices separately, include entity error and latency, and require a baseline comparison before shipping.

Prompt 2: Diagnose A Train-Serving Skew

Offline eval is strong, but production requests show lower confidence and higher correction rate. What do you inspect?

Hidden answer: skew checklist

Compare sample rate, channel handling, loudness normalization, VAD, text normalization, tokenizer version, model artifact hash, dependency versions, padding and masks, runtime precision, feature extraction config, traffic slices, and client microphone changes. Add canaries that run the same public fixture through training and serving paths.