Abstract Background Uganda has a high incidence of tuberculosis (TB), and chest X-rays are widely used for diagnosis. However, interpreting chest X-rays requires radiologists, who are in shortage in Uganda. Machine learning has shown potential to automate this process but requires a large dataset of annotated chest X-ray images. We developed a multi-task, multimodal machine learning model that effectively utilizes a small dataset of X-ray images. Methods Starting with a dataset of chest X-ray images annotated with labels for TB diagnosis and segmentation tasks, we curated a multimodal dataset with labels for Visual Question Answering (VQA), object detection and report generation. We developed PaliGemma-CXR, by finetuning PaliGemma, a foundation vision-language model jointly on all five tasks. Results PaliGemma-CXR achieved competitive performance across all tasks and outperformed the same model architecture when trained separately on individual tasks. Specifically, it attained 90.32% accuracy on TB diagnosis, 98.95% accuracy on close-ended VQA, a BLEU score of 41.3 for report generation, and mean average precisions (mAP) of 19.4 and 16.0 for object detection and segmentation, respectively. Conclusion Our multi-task model outperforms task-specific approaches, making it a more effective and easier-to-deploy solution in low-resource clinical settings where a single model can perform multiple essential tasks.
Musinguzi et al. (Thu,) studied this question.