Production Playbook

Speech Model Release And Rollback

A advanced release checklist for ASR, TTS, speech embeddings, and speech-to-speech systems. Use it for interviews, design reviews, and real production readiness.

Artifact Contract

What Must Ship With A Model

A model is not only weights. A production release needs every input, transform, dependency, metric, and rollback path required to recreate behavior.

  1. Identity: model name, semantic version, owner, training code commit, serving code commit, and release date.
  2. Inputs: sample rate, channels, codec, max duration, language assumptions, text normalization, tokenizer, and feature config.
  3. Outputs: transcript schema, timestamps, confidence, speaker labels, audio format, or embedding dimension.
  4. Eval: baseline comparison, slice metrics, latency percentiles, memory footprint, cost estimate, and known regressions.
  5. Controls: canary plan, rollback command, feature flag, data retention policy, and on-call owner.
Question: Why is feature config part of the artifact?

Speech models are sensitive to preprocessing. A mel-bin count, normalization rule, VAD threshold, codec setting, or tokenizer update can change accuracy, latency, and failure modes without changing weights. Release reviews treat preprocessing as part of the model contract.

Promotion Gates

Block Releases That Cannot Be Operated

Quality Gate

Compare against the current production model on global and critical slices.

Hidden answer: ASR gate example

Require no global WER regression, no critical entity error regression, and no material loss on accents, noise buckets, domain vocabulary, short utterances, long calls, and code-switched speech. Add manual review for any slice with high business risk.

Serving Gate

Prove the release can satisfy latency, memory, and concurrency constraints.

Hidden answer: Load test checklist

Measure p50, p95, p99, queue time, first partial, final latency, first audio byte, GPU memory, batch size, timeout rate, retry rate, and cost per audio hour. Run warm and cold tests because speech users feel first-turn latency sharply.

Privacy Gate

Verify raw audio, transcripts, voiceprints, and API tokens are not logged or committed.

Hidden answer: Privacy-safe observability

Prefer trace IDs, durations, language probabilities, confidence distributions, redacted text, aggregate acoustic quality signals, and explicitly opted-in debug samples. Do not store private audio by default just because it makes debugging easier.

Rollback Gate

Demonstrate that the old path can be restored before user impact grows.

Hidden answer: Rollback readiness

Keep the previous model artifact available, preserve compatible request and response schemas, document the flag or routing change, and test rollback in staging. If migration changes state or outputs, define a forward-fix plan too.

Rollout

Canary By Risk, Not Just Percentage

Release Plan: New Streaming ASR Model

The candidate model improves offline WER by 7 percent relative, but uses more memory and changes partial transcript behavior. Propose a production rollout.

Hidden answer: Advanced rollout plan

Start with shadow traffic to compare transcripts, latency, and partial churn without user exposure. Then canary low-risk tenants, short calls, and one language before expanding. Gate each step on WER proxies, entity errors, p95 latency, p99 latency, queue depth, OOMs, partial churn, cancellation rate, and support tickets. Keep old workers warm until the canary is stable.

Release Plan: New TTS Voice

A new voice has better preference-test scores but slower first audio byte. The product goal is conversational response.

Hidden answer: Tradeoff framing

Separate quality-sensitive and latency-sensitive use cases. Canary on non-critical flows first, stream by sentence, keep warm pools, cache fixed prompts, and set an explicit first-audio-byte budget. Roll back or route to the old voice when p95 latency breaks the conversation SLO, even if offline preference scores look better.

Production Debugging

First 30 Minutes Of An Incident

  1. Scope: identify affected tenants, languages, devices, model versions, and traffic paths.
  2. Stabilize: pause rollout, reduce traffic, disable shadow load, or roll back if SLO impact is active.
  3. Compare: replay sanitized fixtures through old and new paths with the same preprocessing and decode settings.
  4. Localize: split queueing, preprocessing, model inference, decoding, postprocessing, network, and UI rendering time.
  5. Learn: add the missing metric, test, fixture, or release gate that would have caught the issue earlier.
Question: What is the common advanced mistake during incidents?

Optimizing before scoping. A strong response first protects users and narrows the blast radius. Only then does it chase root cause. The difference matters because a model bug, a batcher bug, a network issue, and a UI partial-rendering bug can all look like "bad ASR" to users.

Mock Exam

Release Review Prompts

Prompt 1: Approve Or Block?

A speech embedding model improves retrieval recall by 3 percent, but doubles embedding latency and slightly changes vector dimensionality. Should it ship?

Hidden answer: Decision criteria

Do not approve without a migration plan. Changing dimensionality can require reindexing, dual writes, backfills, compatibility shims, and rollback planning. Compare product value from recall against latency, infrastructure cost, index rebuild time, retrieval quality by slice, and whether clients can handle mixed versions.

Prompt 2: Design The CI/CD Checks

You own a repository that serves ASR and TTS models. What checks run before merging model wrapper changes?

Hidden answer: Practical checklist

Include unit tests for schema, sample-rate handling, empty audio, long audio, invalid codec, timeout cancellation, and deterministic fixtures. Add golden-output tests for sanitized samples, latency smoke tests, dependency lock checks, secret scanning, artifact metadata validation, container build, and a staging canary with rollback verification.