We present the first systematic benchmark of GPT-SoVITS, an open-source few-shot text-to-speech system, running entirely on consumer Apple Silicon hardware. We identify and resolve seven critical platform incompatibilities, including pervasive float16 precision errors. Using a MacBook Pro M4 Pro (24GB), we fine-tune a voice model on 37 minutes of speech data in ~70 minutes and achieve 1.5-second end-to-end latency for a real-time voice agent. Our complete toolkit is released as open source. Code: https://github.com/akhilsingh-git/voice-clone-toolkit
Akhil Singh (Tue,) studied this question.