What question did this study set out to answer?

The aim is to develop a lightweight framework for personal voice activity detection that improves detection accuracy while maintaining efficiency.

April 10, 2026Open Access

HGRN2-Based Personal Voice Activity Detection: A Lightweight Recurrent Framework for Inference and Training

Key Points

The aim is to develop a lightweight framework for personal voice activity detection that improves detection accuracy while maintaining efficiency.
Introduced FDE-HGRN2, a new recurrent framework replacing LSTM with HGRN2 gated linear RNN.
Used cosine-annealing learning rate schedule for training.
Evaluated on LibriSpeech-derived PVAD benchmark with multiple speakers and target designations.
Utilized 40-dimensional Mel-filterbank features and 256-dimensional d-vector embeddings as inputs.
FDE-HGRN2 outperforms original FDE-RNN baseline and multiple leading PVAD models.
Achieved improved mean Average Precision and frame-level accuracy.
Reduced parameter count of the recurrent backbone by approximately 15%, leading to smaller models.

Abstract

This study presents HGRN2-based Flexible Dynamic Encoder Personal VAD (FDE-HGRN2), a recurrent framework for personal voice activity detection (PVAD). Building on the original LSTM-based FDE-RNN backbone, we replace all recurrent modules with the recently introduced HGRN2 gated linear RNN and adopt a cosine-annealing learning rate schedule to improve both detection accuracy and efficiency. HGRN2 uses gated linear recurrence with non-parametric state expansion, enlarging the recurrent state without increasing the number of trainable parameters and enabling more expressive long-range temporal modeling than conventional LSTMs. We evaluate FDE-HGRN2 on a LibriSpeech-derived PVAD benchmark, where multi-speaker mixtures are constructed by concatenating one to three speakers per utterance and randomly designating a target speaker, following established PVAD data construction practices to ensure direct comparability with prior work. The system uses 40-dimensional Mel-filterbank features as acoustic inputs and conditions the detector on 256-dimensional d-vector embeddings extracted from a pretrained speaker verification network. Experimental results show that FDE-HGRN2 consistently outperforms the original FDE-RNN baseline and several state-of-the-art PVAD models in terms of mean Average Precision and frame-level accuracy, while reducing the parameter count of the recurrent backbone by roughly 15% and yielding substantially smaller models than many competing systems. These findings indicate that HGRN2 provides a more temporally expressive and parameter-efficient alternative to LSTM for PVAD, offering a favorable accuracy–efficiency trade-off for real-world, deployment-oriented personalized speech interfaces.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Tzu-Wei Wang

Tai-You Chen

Chien-Chia Chiu

Journals

Electronics

Actions

Institutions

National Taiwan Normal University

National Chi Nan University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

HGRN2-Based Personal Voice Activity Detection: A Lightweight Recurrent Framework for Inference and Training

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study