Detecting violent content in audio recordings is crucial for public safety, autonomous surveillance, and content moderation, particularly when visual cues are unreliable or unavailable. A resource-aware two-stage cascade system is proposed for acoustic violence detection that combines a lightweight Least Squares Linear Detector (LSLD) as a first-stage screener with a trimmed version of YAMNet as a second-stage classifier. A percentile-based forwarding rule controls the fraction of segments routed to the deep stage, turning the accuracy–cost trade-off into an explicit operating parameter for always-on deployment. The approach is evaluated on a publicly released dataset of real-world violent audio augmented with background noise and artificial reverberation. The results in the low-false-alarm regime show that the proposed cascade preserves performance close to a Stage 2-only baseline while substantially reducing average deep-inference workload. An ablation study validates the role of the LSLD as an inexpensive pre-filter, and robustness is assessed under clean, reverberant, and 12 dB noise conditions. Finally, an analytic energy consumption model is provided, which links computational workload to daily energy demand and photovoltaic sizing on ultra-low-power hardware, supporting sustainable off-grid deployment.
Zhu-Zhou et al. (Mon,) studied this question.