Long-context audio reasoning is underserved in both training data and evaluation. Existing benchmarks target short-context tasks, and the open-ended generation tasks most relevant to long-context reasoning pose well-known challenges for automatic evaluation. We propose a synthetic data generation pipeline designed to serve both as a training resource and as a controlled evaluation environment, and instantiate it for first-visit doctor-patient conversations with SOAP note generation as the task. The pipeline has three stages, persona-driven dialogue generation, multi-speaker audio synthesis with overlap/pause modeling, room acoustics, and sound events, and LLM-based reference SOAP note production, built entirely on open-weight models. We release 8,800 synthetic conversations with 1.3k hours of corresponding audio and reference notes. Evaluating current open-weight systems, we find that cascaded approaches still substantially outperform end-to-end models.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yanis Labrak
David Grünert
Séverin Baroudi
Johns Hopkins University
University of Pittsburgh
The Ohio State University
Building similarity graph...
Analyzing shared references across papers
Loading...
Labrak et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69d894ec6c1944d70ce05da6 — DOI: https://doi.org/10.5281/zenodo.19458084