Research Watch

Full-Duplex Speech-To-Speech Research Update

A advanced study report on the latest full-duplex speech-to-speech direction: models that listen while speaking, handle barge-in and backchannels, and force new serving and evaluation contracts.

Updated June 16, 2026

The Field Is Moving From Turn-Based Voice To Synchronous Dialogue

The current research trend is clear: spoken agents are moving away from VAD-bounded listen-then-speak loops toward systems that maintain a live user stream and a live assistant stream at the same time. The hard problems are no longer only ASR quality or TTS quality; they are timing, interruption, backchanneling, semantic coherence under overlap, action safety, and serving reliability.

  1. Native duplex: one model jointly tracks user audio, assistant audio, and dialogue state.
  2. Micro-turn cascade: ASR, LLM, and TTS remain modular but exchange short streaming chunks instead of full turns.
  3. Role-conditioned duplex: the agent must keep persona, task, and voice constraints while reacting in real time.
  4. Action-aware duplex: planning and tool calls are emitted alongside speech on a shared timeline.
  5. Retrieval-aware duplex: factuality must improve without breaking real-time conversational flow.
  6. Multimodal duplex: speech, vision, tools, and slower reasoning layers must coordinate without blocking live interaction.
  7. Benchmark pressure: interruption, overlap, backchannel, privacy, temporal instruction following, and multi-round tool-use benchmarks are becoming mandatory.

Model Landscape

What Changed Recently

Moshi Established The Open Baseline

Moshi framed spoken dialogue as speech-to-speech generation with parallel user and assistant streams. Its inner-monologue design predicts time-aligned text before audio tokens, helping preserve linguistic quality while keeping real-time full-duplex behavior.

SyncLLM Formalized Clock-Synchronous LLMs

SyncLLM treats full-duplex dialogue as a timing problem: the model must generate the next chunk before the current chunk ends. This makes latency part of the model contract, not only an inference optimization after training.

PersonaPlex Adds Role And Voice Control

PersonaPlex targets natural full-duplex behavior while allowing text role prompts and voice conditioning. This moves duplex systems closer to deployable customer-service, tutor, and companion agents where role adherence is part of product quality.

DuplexCascade Keeps The LLM-Centered Stack

DuplexCascade argues that cascaded ASR-LLM-TTS systems can still be strong if they abandon brittle VAD endpoints and operate through chunk-wise micro-turns with conversational control tokens.

Seeduplex Signals Production Native Duplex

ByteDance describes Seeduplex as a native full-duplex Speech LLM focused on interference suppression and adaptive endpoint behavior. For study purposes, treat it as evidence that large-scale products are moving past half-duplex voice UX.

BayLing-Duplex Is The Newest Conversion Recipe

BayLing-Duplex starts from a public GLM-4-Voice checkpoint and adds only a few dialogue-state tokens, turning listen, speak, and stop decisions into ordinary autoregressive prediction. The reported result is strong interruption and turn-taking without a separate turn-taking module.

DuplexOmni Splits Interaction From Thinking

DuplexOmni extends the duplex frame to speech plus video and keeps a low-latency interaction layer separate from a slower, pluggable thinking layer. The serving lesson is to let perception and speech continue while deeper reasoning, tools, or external agents return results asynchronously.

Qwen3.5-Omni Raises The Omni Baseline

Qwen3.5-Omni is not primarily a duplex paper, but its large-scale speech, video, text, long-context, and speech-generation coverage raises the baseline for what full-duplex systems will be compared against. Treat it as an adjacent production trend: native duplex needs natural timing, but it also has to compete with stronger multilingual and multimodal perception stacks.

Nemotron 3 VoiceChat Makes Deployment Concrete

NVIDIA positions Nemotron 3 VoiceChat as a 12B end-to-end, real-time full-duplex speech-to-speech model with open, inspectable weights. That makes the research trend more operational: teams can compare unified native S2S serving against cascaded stacks under enterprise latency, cost, and observability constraints.

Raon-SpeechChat Adds Another Open Deployment Target

KRAFTON's Raon-SpeechChat-9B is another public full-duplex speech language model built for simultaneous listening and speaking. Its release strengthens the practical comparison set for teams that want to test open native-duplex models against API systems and modular cascades.

DuplexSLA Adds Actions To The Clock

DuplexSLA introduces a speech-language-action formulation where assistant audio and structured action tokens share a 160 ms chunk timeline. This points toward real-time voice agents that can plan and use tools without waiting for a clean end-of-turn boundary.

UAF Moves Some Difficulty To The Front End

Unified audio front-end work reframes VAD, turn detection, speaker recognition, ASR, and state control as a single streaming sequence prediction problem. This is a practical route when the back-end model is not yet fully native duplex.

Turn Detection Is Becoming Its Own Layer

FastTurn and JAL-Turn show a parallel path to usable duplex: improve low-latency hold, shift, backchannel, and interruption decisions without replacing the whole model. They fuse acoustic and streaming semantic cues so a cascaded system can behave more like a synchronous one while keeping ASR, policy, and TTS separable.

TurnGuide Reopens Text Guidance

TurnGuide argues that end-to-end full-duplex models still benefit from turn-level text guidance if the guidance is aligned carefully with speech timing. This is a useful counterweight to purely audio token designs: text can improve coherence, but only if it does not break the clock.

Interactivity Alignment Becomes Post-Training

Kyutai's June 2026 work treats timing behavior as an alignment problem, not only a pretraining artifact. It uses RL rewards for pause handling, turn-taking, backchanneling, and user interruption, plus a response-quality reward to avoid making the model more interactive but less useful.

MoshiRAG Adds Knowledge Without Losing The Floor

MoshiRAG keeps a compact full-duplex interface and performs selective retrieval asynchronously. The key product lesson is that retrieval can run during the natural delay before the core answer, preserving conversational timing while improving factuality.

Serving Implications

Full Duplex Changes The Launch Review

An experienced serving review must treat full duplex as a concurrent streaming product, not as ASR plus LLM plus TTS glued together. Every live session has at least two audio timelines, a dialogue state, and possibly an action stream that can trigger external effects.

Latency Budget

Track user audio chunk ingest, semantic update, assistant decision, first assistant audio, barge-in response, assistant stop, and overlap recovery. Report p50/p95/p99 by event type, not only by whole turn.

Concurrency Budget

Duplex sessions occupy read and write streams simultaneously. Model GPU memory, codec state, session state, audio buffers, and action queues as active resources for the whole call, not only during response generation.

Evaluation Gates

Add interruption success, false interruption rate, backchannel appropriateness, overlap handling, assistant stop latency, semantic coherence after interruption, and voice continuity to ordinary ASR, TTS, and task-success gates.

Safety Gates

Tool calls and spoken output can overlap. Gate sensitive actions on confirmed intent, stable user state, and explicit confirmation where needed. A partial or interrupted utterance must not trigger a high-risk action.

Privacy Gates

Always-on full-duplex models expose more than transcripts. Recent privacy work reports speaker-identity leakage from hidden states in end-to-end models, so launch reviews should include voiceprint leakage tests, anonymization options, and retention limits for embeddings, codec tokens, and internal activations.

Retrieval Gates

Retrieval-augmented duplex models need event-level tracing: when the model detects a knowledge-seeking query, when retrieval starts, whether speech fillers or acknowledgements cover the wait, whether the final answer cites grounded state, and whether stale retrieval is cancelled after interruption.

Async Thinking Gates

Multimodal duplex systems need explicit contracts for slower reasoning paths: when background thinking starts, what partial answer can be spoken before it returns, how updates are merged into ongoing speech, and how stale tool or reasoning results are cancelled after user correction.

Open-Weight Deployment Gates

Open full-duplex S2S weights shift risk from API integration to platform ownership. Review GPU fit, codec/tokenizer versions, session isolation, voice safety filters, red-team coverage, telemetry minimization, and rollback to a known cascaded stack.

Turn-Control Gates

If the product uses a separate turn detector, measure its false hold, false shift, backchannel, interruption, and non-primary speaker behavior independently from ASR word error rate. A strong ASR transcript can still produce bad duplex UX if turn control waits too long, interrupts a thinking pause, or treats noise as the primary user.

Semantic Interruption Gates

SID-Bench and its Average Penalty Time metric make interruption handling less binary. Measure both trigger-happy false alarms and late stops, then tune the gate using semantic intent, not only VAD energy or partial transcript confidence.

Evaluation Update

Benchmarks Are Moving Toward Real Tasks

The most useful recent benchmarks combine natural audio behavior with verifiable outcomes. A system can sound smooth and still fail the job if it mishandles correction, tool state, accents, noise, or the timing of a user interruption.

FDB-v3 Adds Human Disfluency And Tool Use

Full-Duplex-Bench-v3 evaluates spoken language models on real human audio with annotated disfluencies and chained API-call scenarios. The reported pattern is interview-relevant: high responsiveness is not the same as reliable multi-step task completion.

Tau2-Voice Exposes The Voice/Text Gap

Task-grounded voice-agent evaluations show that clean text-agent success does not transfer automatically to full-duplex audio. Noise, accents, interruptions, and speech repair can cut completion rates even when latency and responsiveness look strong.

Synchronization Work Gives Debugging Signals

Moshi synchronization studies probe whether two dialogue models' hidden states coordinate under noise and decoding changes. For production, this suggests diagnostics beyond surface transcripts: timing representations, anticipatory turn cues, and degradation under channel noise.

Multi-Round Benchmarks Matter

FDB-v2, MTR-DuplexBench, and newer task-grounded suites push beyond single interruption clips. Launch gates should cover context consistency, entity tracking, correction handling, safety compliance, and tool-state repair across the whole call.

HumDial-FDBench Adds Challenge Pressure

The ICASSP 2026 HumDial full-duplex track releases dual-channel human conversations and a public benchmark for interruptions, overlapping speech, feedback, and dynamic turn negotiation. Treat it as the evaluation bridge between scripted overlap clips and messy human dialogue.

FDB v1.5 Separates Overlap Behaviors

Full-Duplex-Bench v1.5 tests user interruptions, backchannels, background speech, and users talking to someone else as separate overlap classes. This matters because a good assistant should stop quickly for a real barge-in but hold the floor through noise or side speech.

Turn-Detector Tests Need Real Audio

FastTurn and JAL-Turn both emphasize realistic overlap, pauses, noise, multilingual data, and partial observations. For interview and launch reviews, do not evaluate turn-taking only from clean transcripts. Include raw streaming audio, partial ASR states, speaker changes, and timing windows.

Game-Time Tests Temporal Instruction Following

Game-Time shows that spoken language models can handle basic tasks but still fail tempo, synchronized response, and simultaneous speaking constraints. Add temporal command-following to duplex evals when the product depends on coaching, tutoring, games, or live collaboration.

Practice

Advanced Prompts And Coding Drills

Prompt 1: Native Duplex Versus Micro-Turn Cascade

A team wants to replace its ASR-LLM-TTS voice agent with a native full-duplex SpeechLM. What questions do you ask before approving the migration?

Hidden answer: strong review checklist

Compare interruption quality, backchannel timing, task success, safety controllability, debuggability, cost, data requirements, observability, rollback path, and tool integration. Native duplex may win on natural timing and paralinguistics. A micro-turn cascade may win on reasoning quality, policy control, and operational isolation. Require a canary plan with per-event SLOs.

Prompt 2: Design A Duplex Benchmark Pack

Build a benchmark for a full-duplex customer-support voice agent. Include multi-turn repair, interruption, backchannel, and tool-use scenarios.

Hidden answer: benchmark shape

Create scripted audio fixtures with time-aligned user speech, expected assistant timing windows, allowed backchannels, prohibited interruptions, tool-call preconditions, and recovery requirements. Score task success, interruption success, false interruption, backchannel precision, answer groundedness, tool safety, p95 event latency, and cost per successful conversation.

Prompt 3: Duplex RAG Under Interruption

A full-duplex agent starts retrieving account-policy context while it is already speaking. The user interrupts with a correction that changes the entity being discussed. What should the serving system do?

Hidden answer: cancellation and grounding plan

Treat retrieval as cancelable session state. Stop stale assistant audio within the barge-in SLO, invalidate retrieval tied to the old entity, preserve only safe context, re-run retrieval after the new intent is stable, and require the answer generator to reference the current entity and source timestamp. Log the event as interruption, retrieval cancellation, and state repair.

Prompt 4: Privacy Review For Native Duplex

The model team wants to log codec tokens and hidden states for every call to debug turn-taking regressions. What do you approve, reject, and measure?

Hidden answer: privacy-safe debugging

Reject broad raw logging by default. Approve short, consented, access-controlled fixtures or aggregate metrics first. If internal states are needed, run speaker-linkability tests, minimize retention, hash or anonymize identifiers, separate debugging from user identity, and document deletion paths. Measure whether anonymization preserves task success and turn-taking metrics.

Coding Drill: Interruption Gate

Given streaming events with time_ms, speaker, and event, flag assistant speech segments that continue more than max_stop_ms after a user barge-in event.

Hidden answer: Python solution
def late_stop_segments(events, max_stop_ms):
    barge_in_at = None
    assistant_speaking = False
    late = []

    for event in sorted(events, key=lambda item: item["time_ms"]):
        t = event["time_ms"]
        speaker = event["speaker"]
        kind = event["event"]

        if speaker == "assistant" and kind == "speech_start":
            assistant_speaking = True
        elif speaker == "assistant" and kind == "speech_end":
            if assistant_speaking and barge_in_at is not None:
                delay = t - barge_in_at
                if delay > max_stop_ms:
                    late.append({"barge_in_at": barge_in_at, "stopped_at": t, "delay_ms": delay})
            assistant_speaking = False
            barge_in_at = None
        elif speaker == "user" and kind == "barge_in" and assistant_speaking:
            barge_in_at = t

    return late

Sources

Primary Reading