What question did this study set out to answer?

To evaluate the effectiveness of different LLM coordination strategies in title-abstract screening tasks.

April 17, 2026Open Access

Comparing Single-Agent and Multi-Agent Strategies in LLM-Based Title-Abstract Screening

Puntos clave

To evaluate the effectiveness of different LLM coordination strategies in title-abstract screening tasks.
Compared five coordination strategies: single-agent baseline, majority voting, recall-focused ensemble, confidence-weighted aggregation, and two-stage filtering.
Used four open-source LLMs: Mistral 7B, LLaMA 3.1 8B, Granite 3.3 8B, Qwen 2.5 7B.
Employed zero-shot and few-shot configurations for model evaluation.
Analyzed a Gold Standard of 200 papers on blockchain-based e-voting from a corpus of 2036 records.
Achieved 100% recall, 70.4% precision, and 82.6% F1 score with the single-agent strategy using Qwen 2.5 7B in few-shot mode.
Realized a 43.4% reduction in manual screening efforts compared to multi-agent approaches.
Confidence-weighted aggregation yielded results similar to majority voting, indicating no added value from self-reported model confidence.

Resumen

Title-abstract screening remains labour-intensive, especially in interdisciplinary domains where shared terminology increases misclassification risk. This study compared five LLM coordination strategies—single-agent baseline, majority voting, recall-focused ensemble, confidence-weighted aggregation, and two-stage filtering—using four 4-bit quantised open-source models (Mistral 7B, LLaMA 3.1 8B, Granite 3.3 8B, Qwen 2.5 7B) in zero-shot and few-shot configurations. The evaluation was conducted on a Gold Standard of 200 papers from a corpus of 2036 records on blockchain-based e-voting. The best-performing configuration—a single-agent strategy with Qwen 2.5 7B in few-shot mode—achieved recall of 100%, precision of 70.4%, F1 of 82.6%, and a 43.4% reduction in manual screening effort, outperforming all multi-agent alternatives. Confidence-weighted aggregation produced results identical to majority voting, indicating that self-reported confidence from 7–8B parameter models did not add discriminative value. All screening decisions were logged on a private blockchain with timestamped anchoring for reproducibility. These results suggest that, for domain-specific screening tasks, careful model selection outweighs multi-agent coordination overhead, and that few-shot prompting with a well-matched model can achieve human-level recall with substantially reduced manual effort.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Irina Radeva

Teodora Noncheva

Lyubka Doukovska

Journals

Electronics

Actions

Institutions

Bulgarian Academy of Sciences

Trakia University

Institute of Information and Communication Technologies

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Comparing Single-Agent and Multi-Agent Strategies in LLM-Based Title-Abstract Screening

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study