Title-abstract screening remains labour-intensive, especially in interdisciplinary domains where shared terminology increases misclassification risk. This study compared five LLM coordination strategies—single-agent baseline, majority voting, recall-focused ensemble, confidence-weighted aggregation, and two-stage filtering—using four 4-bit quantised open-source models (Mistral 7B, LLaMA 3.1 8B, Granite 3.3 8B, Qwen 2.5 7B) in zero-shot and few-shot configurations. The evaluation was conducted on a Gold Standard of 200 papers from a corpus of 2036 records on blockchain-based e-voting. The best-performing configuration—a single-agent strategy with Qwen 2.5 7B in few-shot mode—achieved recall of 100%, precision of 70.4%, F1 of 82.6%, and a 43.4% reduction in manual screening effort, outperforming all multi-agent alternatives. Confidence-weighted aggregation produced results identical to majority voting, indicating that self-reported confidence from 7–8B parameter models did not add discriminative value. All screening decisions were logged on a private blockchain with timestamped anchoring for reproducibility. These results suggest that, for domain-specific screening tasks, careful model selection outweighs multi-agent coordination overhead, and that few-shot prompting with a well-matched model can achieve human-level recall with substantially reduced manual effort.
Building similarity graph...
Analyzing shared references across papers
Loading...
Irina Radeva
Teodora Noncheva
Lyubka Doukovska
Electronics
Bulgarian Academy of Sciences
Trakia University
Institute of Information and Communication Technologies
Building similarity graph...
Analyzing shared references across papers
Loading...
Radeva et al. (Wed,) studied this question.
synapsesocial.com/papers/69e1cecc5cdc762e9d857cc3 — DOI: https://doi.org/10.3390/electronics15081661