What question did this study set out to answer?

This research aims to benchmark the GPT-SoVITS text-to-speech system on Apple Silicon hardware, identifying key performance metrics and issues.

April 10, 2026Open Access

Benchmarking Real-Time Voice Cloning on Consumer Apple Silicon: A Practical Evaluation of GPT-SoVITS on M-Series Hardware

Key Points

This research aims to benchmark the GPT-SoVITS text-to-speech system on Apple Silicon hardware, identifying key performance metrics and issues.
Conducted benchmarking of GPT-SoVITS on consumer Apple Silicon.
Resolved seven platform incompatibilities including float16 precision errors.
Fine-tuned a voice model using 37 minutes of speech data in approximately 70 minutes.
Measured end-to-end latency for real-time voice generation.
Achieved 1.5-second end-to-end latency for real-time voice agent.
Identified and fixed critical compatibility issues.
Demonstrated practical performance capabilities on MacBook Pro M4 Pro.

Abstract

We present the first systematic benchmark of GPT-SoVITS, an open-source few-shot text-to-speech system, running entirely on consumer Apple Silicon hardware. We identify and resolve seven critical platform incompatibilities, including pervasive float16 precision errors. Using a MacBook Pro M4 Pro (24GB), we fine-tune a voice model on 37 minutes of speech data in ~70 minutes and achieve 1.5-second end-to-end latency for a real-time voice agent. Our complete toolkit is released as open source. Code: https://github.com/akhilsingh-git/voice-clone-toolkit

Benchmarking Real-Time Voice Cloning on Consumer Apple Silicon: A Practical Evaluation of GPT-SoVITS on M-Series Hardware

Key Points

Abstract

Cite This Study