Prompt 1: Build A Training Dataset Pipeline For Streaming ASR
You need a weekly ASR training snapshot from consented call audio. The model must improve noisy-device and accented-speech slices without storing unnecessary private content. Design the pipeline.
Hidden answer: advanced design outline
Start with consent and retention policy, then define aggregate-only discovery metrics before selecting clips. Store durable IDs, locale, device class, channel, noise bucket, duration, label status, and transform versions, not casual raw exports. Run privacy filters and redaction before labeling. Stratify sampling by failure slices, keep holdout users out of training, version label instructions, and produce a snapshot manifest that ties every row to audio lineage, transcript version, VAD version, and evaluation eligibility.