Recent advances in language models (LMs) have significantly improved the handling of complex medical narratives compared to classical methods. However, one major obstacle to the practical usage of these LMs in the medical domain is that the models lack training on medical knowledge. In particular, standard tokenizers trained on open-domain corpora fail to accurately capture domain-specific terminologies, abbreviations, and writing styles in radiology reports or clinical notes. To address this issue, we propose a two-step domain-transfer method that updates both the tokenizer vocabulary and the LM representations. First, we replace low-frequency tokens in the original general-domain vocabulary with high-frequency bi- and tri-grams extracted from medical text, ensuring that domain-relevant tokens are learned. Second, we continually pre-train the LM on the medical corpus using the masked language modeling to more closely align the model parameters to the domain-specific language parameters. We evaluated the effectiveness of this approach in the RadNLP 2024 shared task on lung cancer staging from radiology reports, covering both English and Japanese. Experimental results indicate that our method improves performance on this specialized task, suggesting that customizing tokenizers and re-training language models can substantially mitigate the domain gap. In the future, we address standardizing radiology report formats to facilitate more robust and accurate automated analysis.
Shirafuji et al. (Fri,) studied this question.