Los puntos clave no están disponibles para este artículo en este momento.
Mapping speech tokens to the same feature space as text tokens has become the paradigm for integrating speech modality into decoder-only large language models (LLMs). An alternative is to use an encoder-decoder architecture that incorporates speech features through cross-attention. In this work, we connect the Whisper encoder with ChatGLM3 and provide in-depth comparisons of these two approaches using Chinese automatic speech recognition (ASR) and named entity recognition (NER) tasks. We evaluate their performance using the F1 score and a fine-grained taxonomy of ASR-NER errors. Our experiments reveal that the encoder-decoder model outperforms the decoder-only model if the context is short, while the decoder-only model benefits from a long context as it fully exploits all layers of the LLM. Additionally, we obtain a state-of-the-art F1 score of 0.805 on the AISHELL-NER test set by using chain-of-thought NER which first infers long-form ASR transcriptions and then predicts NER labels.
Building similarity graph...
Analyzing shared references across papers
Loading...
Li et al. (Sun,) studied this question.
www.synapsesocial.com/papers/68e59d79b6db643587537935 — DOI: https://doi.org/10.21437/interspeech.2024-103
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Yuang Li
Jiawei Yu
Min Zhang
Building similarity graph...
Analyzing shared references across papers
Loading...