What question did this study set out to answer?

The aim is to develop a cost-effective framework for automated Boolean query generation using fine-tuned generative models in biomedical systematic literature reviews.

May 8, 2026Open Access

Empowering open medium-sized generative language models for effective structured search in biomedical systematic reviews

Key Points

The aim is to develop a cost-effective framework for automated Boolean query generation using fine-tuned generative models in biomedical systematic literature reviews.
Developed a novel three-stage framework for automated Boolean query generation using BioGPT and BioT5.
Fine-tuned models on datasets of PubMed article titles, MeSH terms, and keywords using title-only and title-plus-metadata prompts.
Evaluated performance on benchmark datasets CLEF TAR and FASS-BSLR, comparing with state-of-the-art baselines.
Fine-tuned BioGPT achieved a Precision of 0.2544 and F1 of 0.2392 on the CLEF TAR dataset, surpassing all baselines.
On the FASS dataset, BioGPT reached a Recall of 0.1801, considerably exceeding the performance of PubMed-Title and PubMed-Keyword.
BioT5 also showed improved results, outperforming most baselines but remained slightly behind BioGPT.

Abstract

Systematic Literature Reviews (SLRs) are essential in biomedical research, particularly for informing public health policy and clinical decision-making. However, the manual generation of Boolean queries for literature search is resource-intensive, prone to error, and difficult to scale. Recent advances in large language models (LLMs) have demonstrated potential, yet most existing approaches rely on zero-shot prompting of commercial models, overlooking the cost-efficiency and domain adaptability of fine-tuned open-source alternatives. This study proposes a novel, three-stage framework that employs medium-sized, open-source generative models, specifically BioGPT and BioT5, for automated Boolean query generation over PubMed. We develop and release datasets comprising PubMed article titles, MeSH terms, and keywords, and fine-tune the models using both title-only and title-plus-metadata prompts. We evaluate performance on two benchmark datasets: CLEF TAR and FASS-BSLR. Our experiments include comparisons with state-of-the-art baselines, prompt-based large language models, and ablation studies exploring the effects of training data size, metadata inclusion, and post-processing with PubMed’s Automatic Term Mapping. Fine-tuned BioGPT outperforms both traditional TAR models and commercial LLMs across key retrieval metrics. On the CLEF TAR dataset, it achieves a Precision of 0.2544, F1 of 0.2392, MAP@1000 of 0.1424, and NDCG@1000 of 0.2490, which surpass all baselines. On the FASS dataset, it reaches a Recall of 0.1801 and NDCG@1000 of 0.0900, again outperforming all competing models. While slightly behind BioGPT, BioT5 still outperforms most baselines. Notably, BioGPT’s Recall of 0.1801 on FASS is more than twice that of PubMed-Title and PubMed-Keyword, and exceeds GPT-3.5 Turbo, GPT-4, Gemini-2, and Llama-3. This work demonstrates that fine-tuned, open-source, medium-sized generative models can match or exceed the performance of much larger commercial LLMs in Boolean query generation for biomedical SLRs. These models offer a cost-effective, privacy-preserving, and scalable alternative for structured retrieval of biomedical scholarly texts. • Proposes a novel three-stage framework for automated Boolean query generation in biomedical systematic literature reviews (SLRs). • Demonstrates that fine-tuned, medium-sized open-source models (BioGPT, BioT5) outperform large commercial LLMs on CLEF TAR and FASS benchmarks. • Shows that structured incorporation of MeSH terms and keywords improves large-scale recall and ranking quality. • Provides a cost-effective, privacy-preserving alternative to proprietary LLMs for scalable biomedical evidence synthesis.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Leandra Budau

Richard Finney

Faezeh Ensan

Journals

International Journal of Medical Informatics

Actions

Institutions

Metropolitan University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Empowering open medium-sized generative language models for effective structured search in biomedical systematic reviews

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study