Current approaches to AI interaction remain largely text-based and unidirectional - a user types, an AI responds. This paper presents a system architecture that enables bidirectional, real-time voice conversation between a human user and a large language model, achieving sub-3-second end-to-end latency. The system allows users to initiate a voice "call" with an AI agent, speak naturally, receive spoken responses, and interrupt mid-utterance - closely replicating the dynamics of a human telephone conversation. We describe the full pipeline: client-side voice activity detection, WebSocket-based audio streaming, server-side speech-to-text via Whisper, LLM inference with GPT-5.4, sentence-boundary text-to-speech pipelining via ElevenLabs, and real-time audio chunk delivery back to the client. We report a P50 time-to-first-audio of approximately 1.6 seconds and discuss architectural decisions that make this latency achievable. The architecture is domain-agnostic and applicable to any use case requiring natural spoken human-AI interaction.
Building similarity graph...
Analyzing shared references across papers
Loading...
Dan Zabrotski
Building similarity graph...
Analyzing shared references across papers
Loading...
Dan Zabrotski (Fri,) studied this question.
www.synapsesocial.com/papers/69bf898bf665edcd009e954b — DOI: https://doi.org/10.5281/zenodo.19140279