What question did this study set out to answer?

The aim is to improve radiology report analysis by adapting language models with medical vocabulary.

April 1, 2026Open Access

Domain Adaptation with Medical Vocabulary-Aware Tokenizer for Radiology Report Analysis in RadNLP at KAIYO03

Key Points

The aim is to improve radiology report analysis by adapting language models with medical vocabulary.
Developed a two-step domain-transfer method for tokenizer and LM adaptation.
Replaced low-frequency tokens with high-frequency medical bi- and tri-grams.
Continually pre-trained the LM on a medical corpus using masked language modeling.
Evaluated performance on lung cancer staging in the RadNLP 2024 shared task.
The proposed method improved task performance significantly.
Customization of tokenizers and retraining language models reduced the domain gap.
Results were validated using both English and Japanese radiology reports.

Abstract

Recent advances in language models (LMs) have significantly improved the handling of complex medical narratives compared to classical methods. However, one major obstacle to the practical usage of these LMs in the medical domain is that the models lack training on medical knowledge. In particular, standard tokenizers trained on open-domain corpora fail to accurately capture domain-specific terminologies, abbreviations, and writing styles in radiology reports or clinical notes. To address this issue, we propose a two-step domain-transfer method that updates both the tokenizer vocabulary and the LM representations. First, we replace low-frequency tokens in the original general-domain vocabulary with high-frequency bi- and tri-grams extracted from medical text, ensuring that domain-relevant tokens are learned. Second, we continually pre-train the LM on the medical corpus using the masked language modeling to more closely align the model parameters to the domain-specific language parameters. We evaluated the effectiveness of this approach in the RadNLP 2024 shared task on lung cancer staging from radiology reports, covering both English and Japanese. Experimental results indicate that our method improves performance on this specialized task, suggesting that customizing tokenizers and re-training language models can substantially mitigate the domain gap. In the future, we address standardizing radiology report formats to facilitate more robust and accurate automated analysis.

Domain Adaptation with Medical Vocabulary-Aware Tokenizer for Radiology Report Analysis in RadNLP at KAIYO03

Key Points

Abstract

Cite This Study