Speech Safety, Privacy, And Security

Threat Model

Speech Systems Expand The Attack Surface

Audio systems inherit normal ML risks, then add speaker identity, acoustic context, transcription uncertainty, replay attacks, consent, and the emotional impact of generated voices.

Spoken Prompt Injection

A user or background speaker says instructions that try to override policy, reveal private context, or change tool behavior.

Hidden answer: strong mitigation

Treat ASR output as untrusted user input. Keep system and tool policy outside the transcript, classify tool intent separately, require explicit confirmations for sensitive actions, evaluate retrieval grounding, and log sanitized decision traces rather than raw audio by default.

Voice Impersonation

A generated or replayed voice attempts to pass as a trusted person, approve a transaction, or access private data.

Hidden answer: strong mitigation

Do not treat voice likeness as authentication. Use separate factors, liveness and replay checks where appropriate, risk-based step-up confirmation, consented enrollment, anti-spoof evals, and clear product boundaries around what a cloned voice may do.

PII In Audio And Transcripts

Raw audio, transcripts, embeddings, traces, and labels may contain private names, addresses, health data, or account facts.

Hidden answer: strong mitigation

Design retention before collection. Prefer derived metrics, explicit opt-in review pools, scoped access, redaction, encryption, dataset lineage, deletion workflows, and synthetic fixtures for CI. A debugging need is not a blanket reason to store raw audio.

TTS Abuse And Harmful Output

TTS can produce deceptive calls, harassment, unsafe instructions, or brand-damaging speech if generation is under-controlled.

Hidden answer: strong mitigation

Gate text before synthesis, restrict high-risk voices, maintain voice owner consent, rate-limit abuse patterns, and create rapid takedown and rollback paths. Use disclosure, provenance, or watermark signals where suitable, but do not rely on them as the only fraud or impersonation control. Measure abuse response latency as an operational metric.

Control Plane

Build Safety Into Release Gates

Safety should not be a final checklist after model quality passes. It should be part of dataset contracts, eval design, CI/CD, monitoring, incident response, and cost-aware serving.

Data contract: define consent, retention, redaction, allowed features, labeling access, and deletion behavior.
Model contract: document input modalities, output risks, unsupported use cases, safety filters, and fallback modes.
Eval contract: include prompt injection, replay, PII leakage, toxic synthesis, approved aggregate language slices, and noisy-background cases.
Serving contract: isolate tenants, version policies, record sanitized traces, enforce rate limits, and keep rollback fast.
Incident contract: define severity, owner, mitigation lever, user communication path, and evidence preservation rules.

Question: What is the difference between model safety and system safety?

Model safety asks whether the model behaves acceptably for tested inputs. System safety asks whether the full product remains safe when transcription is wrong, retrieval is stale, policies change, tools are available, attackers adapt, latency spikes, and operators need to debug without exposing private data.

Coding Labs

Small Safety Utilities

These exercises are deliberately compact. They train the habit of turning abstract safety concerns into inspectable controls and tests.

Lab 1: Transcript Redaction Gate

Given a transcript and a list of sensitive patterns, return a redacted transcript and a flag saying whether storage requires restricted handling.

Hidden answer: invariant, tests, and Python solution

Invariant: each sensitive detector is validated and applied to the original transcript before persistence, overlapping matches are redacted as one span, and any match upgrades handling. Test no matches, repeated matches, overlapping categories, invalid detector labels or patterns, and transcripts that should be discarded rather than stored.

import re


def redact_transcript(text, detectors):
    spans = []
    for label, pattern in detectors:
        if not re.fullmatch(r"[A-Z_]+", label):
            raise ValueError("detector labels must be trusted marker names")
        regex = re.compile(pattern)
        spans.extend((m.start(), m.end(), label) for m in regex.finditer(text))

    if not spans:
        return text, False

    spans.sort(key=lambda span: (span[0], -span[1]))
    merged = []
    for start, end, label in spans:
        if not merged or start > merged[-1][1]:
            merged.append([start, end, {label}])
        else:
            merged[-1][1] = max(merged[-1][1], end)
            merged[-1][2].add(label)

    parts = []
    cursor = 0
    for start, end, labels in merged:
        parts.append(text[cursor:start])
        parts.append("[REDACTED_" + "_".join(sorted(labels)) + "]")
        cursor = end
    parts.append(text[cursor:])
    return "".join(parts), True


detectors = [
    ("EMAIL", r"\b[\w.+-]+@[\w-]+\.[\w.-]+\b"),
    ("PHONE", r"\b\d{3}[-.]\d{3}[-.]\d{4}\b"),
]

Lab 2: Sensitive Tool Confirmation

Given a parsed voice intent, decide whether the assistant may execute it, needs confirmation, or must refuse.

Hidden answer: invariant, tests, and Python solution

Invariant: high-risk action classes require explicit confirmation from trusted session state, not a flag parsed from the transcript, and each confirmation is bound to the pending action, not just to an ID that could be replayed with a different command. Disallowed actions are refused even when the transcript sounds confident. Test background speaker commands, money movement, account deletion, mismatched confirmations, replayed confirmation IDs, and harmless queries.

def tool_decision(intent, policy, session):
    action = intent["action"]
    if action in policy["blocked"]:
        return "refuse"
    if action in policy["requires_confirmation"]:
        expected = session.get("pending_confirmation_id")
        confirmed = session.get("confirmed_confirmation_id")
        pending_action = session.get("pending_action")
        confirmed_action = session.get("confirmed_action")
        used_confirmations = session.get("used_confirmation_ids", set())
        if (
            intent.get("speaker") == "active_user"
            and expected
            and confirmed == expected
            and confirmed not in used_confirmations
            and pending_action == action
            and confirmed_action == action
            and intent.get("confirmation_id") == expected
        ):
            return "execute"
        return "confirm"
    return "execute"


policy = {
    "blocked": {"share_secret", "impersonate_person"},
    "requires_confirmation": {"send_message", "delete_file", "place_order"},
}

Lab 3: Abuse Spike Detector

Given per-window counts for blocked TTS requests, detect a possible abuse spike while ignoring tiny sample sizes.

Hidden answer: invariant, tests, and Python solution

Invariant: validate aggregate telemetry first, then alert only when volume is large enough and the blocked fraction meaningfully exceeds a valid baseline fraction. Test impossible counters, invalid baselines, zero traffic, low volume, high volume with normal rate, and high volume with a sharp blocked-rate increase.

def abuse_spike(window, baseline_rate, min_requests, multiplier):
    total = window["total"]
    blocked = window["blocked"]
    if total < 0 or blocked < 0 or blocked > total:
        raise ValueError("blocked and total must be valid aggregate counts")
    if baseline_rate < 0 or baseline_rate > 1 or multiplier <= 0 or min_requests < 0:
        raise ValueError("baseline, multiplier, and threshold must be valid")
    if total < min_requests:
        return False
    if baseline_rate <= 0:
        return blocked > 0
    return (blocked / total) >= baseline_rate * multiplier

Interview Prompts

Advanced Safety And Privacy Questions

Prompt 1: A Voice Agent Executes Background Instructions

A user reports that a support agent followed instructions spoken by a TV in the background. How do you triage and prevent recurrence?

Hidden answer: strong response

Mitigate first by disabling or confirming sensitive tools, then inspect sanitized traces for ASR segments, speaker labels, intent confidence, tool policy version, and confirmation state. Prevent recurrence with active-speaker checks, tool-risk tiers, confirmation prompts, injection evals, and a canary gate that includes noisy background audio.

Prompt 2: A Customer Requests Deletion Of Their Voice Data

What should a production speech platform delete, and what metadata may be kept?

Hidden answer: strong response

Delete raw audio, transcripts, labels, embeddings, derived examples, and model-training references linked to the user according to the retention contract. Keep only permitted aggregate metrics and audit records that do not reconstruct the user's content. For checkpoints already trained on the data, record affected lineage, stop future use, and retrain, unlearn, or replace the model when policy or law requires it. The answer should mention lineage, backups, downstream datasets, and proof of deletion.

Prompt 3: A TTS Voice Is Used For Fraudulent Calls

Your platform detects a spike in generated calls impersonating a public-facing employee. What actions do you take?

Hidden answer: strong response

Rate-limit or disable the abusive route, preserve evidence, notify safety/legal owners, revoke the voice if consent or policy is violated, add detection rules, and review gaps in voice enrollment, output filtering, and abuse monitoring. Measure time to mitigation and ship regression tests before re-enabling risky paths.