Systematic Literature Reviews (SLRs) are essential in biomedical research, particularly for informing public health policy and clinical decision-making. However, the manual generation of Boolean queries for literature search is resource-intensive, prone to error, and difficult to scale. Recent advances in large language models (LLMs) have demonstrated potential, yet most existing approaches rely on zero-shot prompting of commercial models, overlooking the cost-efficiency and domain adaptability of fine-tuned open-source alternatives. This study proposes a novel, three-stage framework that employs medium-sized, open-source generative models, specifically BioGPT and BioT5, for automated Boolean query generation over PubMed. We develop and release datasets comprising PubMed article titles, MeSH terms, and keywords, and fine-tune the models using both title-only and title-plus-metadata prompts. We evaluate performance on two benchmark datasets: CLEF TAR and FASS-BSLR. Our experiments include comparisons with state-of-the-art baselines, prompt-based large language models, and ablation studies exploring the effects of training data size, metadata inclusion, and post-processing with PubMed’s Automatic Term Mapping. Fine-tuned BioGPT outperforms both traditional TAR models and commercial LLMs across key retrieval metrics. On the CLEF TAR dataset, it achieves a Precision of 0.2544, F1 of 0.2392, MAP@1000 of 0.1424, and NDCG@1000 of 0.2490, which surpass all baselines. On the FASS dataset, it reaches a Recall of 0.1801 and NDCG@1000 of 0.0900, again outperforming all competing models. While slightly behind BioGPT, BioT5 still outperforms most baselines. Notably, BioGPT’s Recall of 0.1801 on FASS is more than twice that of PubMed-Title and PubMed-Keyword, and exceeds GPT-3.5 Turbo, GPT-4, Gemini-2, and Llama-3. This work demonstrates that fine-tuned, open-source, medium-sized generative models can match or exceed the performance of much larger commercial LLMs in Boolean query generation for biomedical SLRs. These models offer a cost-effective, privacy-preserving, and scalable alternative for structured retrieval of biomedical scholarly texts. • Proposes a novel three-stage framework for automated Boolean query generation in biomedical systematic literature reviews (SLRs). • Demonstrates that fine-tuned, medium-sized open-source models (BioGPT, BioT5) outperform large commercial LLMs on CLEF TAR and FASS benchmarks. • Shows that structured incorporation of MeSH terms and keywords improves large-scale recall and ranking quality. • Provides a cost-effective, privacy-preserving alternative to proprietary LLMs for scalable biomedical evidence synthesis.
Building similarity graph...
Analyzing shared references across papers
Loading...
Leandra Budau
Richard Finney
Faezeh Ensan
International Journal of Medical Informatics
Metropolitan University
Building similarity graph...
Analyzing shared references across papers
Loading...
Budau et al. (Fri,) studied this question.
www.synapsesocial.com/papers/69fd7e5cbfa21ec5bbf069df — DOI: https://doi.org/10.1016/j.ijmedinf.2026.106463