What question did this study set out to answer?

The aim is to create a system that facilitates natural, two-way voice conversations between humans and AI using large language models.

March 22, 2026Open Access

Bidirectional Real-Time Voice Conversation with Large Language Models: A Pipeline Architecture for Human-Like AI Dialogue

Read Full Paperexternally

Key Points

The aim is to create a system that facilitates natural, two-way voice conversations between humans and AI using large language models.
Developed a voice 'call' system for user and AI interaction.
Implemented client-side voice activity detection and WebSocket-based audio streaming.
Utilized speech-to-text via Whisper for analysis.
Conducted language model inference with GPT-5.4.
Executed text-to-speech via ElevenLabs for real-time audio responses.
Achieved sub-3-second end-to-end latency for conversations.
Reported a time-to-first-audio of approximately 1.6 seconds.
Showed effective handling of mid-utterance interruptions.

Abstract

Current approaches to AI interaction remain largely text-based and unidirectional - a user types, an AI responds. This paper presents a system architecture that enables bidirectional, real-time voice conversation between a human user and a large language model, achieving sub-3-second end-to-end latency. The system allows users to initiate a voice "call" with an AI agent, speak naturally, receive spoken responses, and interrupt mid-utterance - closely replicating the dynamics of a human telephone conversation. We describe the full pipeline: client-side voice activity detection, WebSocket-based audio streaming, server-side speech-to-text via Whisper, LLM inference with GPT-5.4, sentence-boundary text-to-speech pipelining via ElevenLabs, and real-time audio chunk delivery back to the client. We report a P50 time-to-first-audio of approximately 1.6 seconds and discuss architectural decisions that make this latency achievable. The architecture is domain-agnostic and applicable to any use case requiring natural spoken human-AI interaction.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Dan Zabrotski

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Bidirectional Real-Time Voice Conversation with Large Language Models: A Pipeline Architecture for Human-Like AI Dialogue

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study