Moshi Established The Open Baseline
Moshi framed spoken dialogue as speech-to-speech generation with
parallel user and assistant streams. Its inner-monologue design
predicts time-aligned text before audio tokens, helping preserve
linguistic quality while keeping real-time full-duplex behavior.
SyncLLM Formalized Clock-Synchronous LLMs
SyncLLM treats full-duplex dialogue as a timing problem: the model
must generate the next chunk before the current chunk ends. This
makes latency part of the model contract, not only an inference
optimization after training.
PersonaPlex Adds Role And Voice Control
PersonaPlex targets natural full-duplex behavior while allowing
text role prompts and voice conditioning. This moves duplex systems
closer to deployable customer-service, tutor, and companion agents
where role adherence is part of product quality.
DuplexCascade Keeps The LLM-Centered Stack
DuplexCascade argues that cascaded ASR-LLM-TTS systems can still be
strong if they abandon brittle VAD endpoints and operate through
chunk-wise micro-turns with conversational control tokens.
Seeduplex Signals Production Native Duplex
ByteDance describes Seeduplex as a native full-duplex Speech LLM
focused on interference suppression and adaptive endpoint behavior.
For study purposes, treat it as evidence that large-scale products
are moving past half-duplex voice UX.
BayLing-Duplex Is The Newest Conversion Recipe
BayLing-Duplex starts from a public GLM-4-Voice checkpoint and adds
only a few dialogue-state tokens, turning listen, speak, and stop
decisions into ordinary autoregressive prediction. The reported
result is strong interruption and turn-taking without a separate
turn-taking module.
DuplexOmni Splits Interaction From Thinking
DuplexOmni extends the duplex frame to speech plus video and keeps
a low-latency interaction layer separate from a slower, pluggable
thinking layer. The serving lesson is to let perception and speech
continue while deeper reasoning, tools, or external agents return
results asynchronously.
Qwen3.5-Omni Raises The Omni Baseline
Qwen3.5-Omni is not primarily a duplex paper, but its large-scale
speech, video, text, long-context, and speech-generation coverage
raises the baseline for what full-duplex systems will be compared
against. Treat it as an adjacent production trend: native duplex
needs natural timing, but it also has to compete with stronger
multilingual and multimodal perception stacks.
Nemotron 3 VoiceChat Makes Deployment Concrete
NVIDIA positions Nemotron 3 VoiceChat as a 12B end-to-end,
real-time full-duplex speech-to-speech model with open,
inspectable weights. That makes the research trend more operational:
teams can compare unified native S2S serving against cascaded
stacks under enterprise latency, cost, and observability constraints.
Raon-SpeechChat Adds Another Open Deployment Target
KRAFTON's Raon-SpeechChat-9B is another public full-duplex speech
language model built for simultaneous listening and speaking. Its
release strengthens the practical comparison set for teams that
want to test open native-duplex models against API systems and
modular cascades.
DuplexSLA Adds Actions To The Clock
DuplexSLA introduces a speech-language-action formulation where
assistant audio and structured action tokens share a 160 ms chunk
timeline. This points toward real-time voice agents that can plan
and use tools without waiting for a clean end-of-turn boundary.
UAF Moves Some Difficulty To The Front End
Unified audio front-end work reframes VAD, turn detection, speaker
recognition, ASR, and state control as a single streaming sequence
prediction problem. This is a practical route when the back-end
model is not yet fully native duplex.
Turn Detection Is Becoming Its Own Layer
FastTurn and JAL-Turn show a parallel path to usable duplex:
improve low-latency hold, shift, backchannel, and interruption
decisions without replacing the whole model. They fuse acoustic and
streaming semantic cues so a cascaded system can behave more like a
synchronous one while keeping ASR, policy, and TTS separable.
TurnGuide Reopens Text Guidance
TurnGuide argues that end-to-end full-duplex models still benefit
from turn-level text guidance if the guidance is aligned carefully
with speech timing. This is a useful counterweight to purely audio
token designs: text can improve coherence, but only if it does not
break the clock.
Interactivity Alignment Becomes Post-Training
Kyutai's June 2026 work treats timing behavior as an alignment
problem, not only a pretraining artifact. It uses RL rewards for
pause handling, turn-taking, backchanneling, and user interruption,
plus a response-quality reward to avoid making the model more
interactive but less useful.
MoshiRAG Adds Knowledge Without Losing The Floor
MoshiRAG keeps a compact full-duplex interface and performs
selective retrieval asynchronously. The key product lesson is that
retrieval can run during the natural delay before the core answer,
preserving conversational timing while improving factuality.