What question did this study set out to answer?

The research aims to evaluate the effectiveness of open-source Large Language Models in enhancing the automation and accuracy of Security Operations Centers.

April 17, 2026Open Access

Benchmarking Open-Source Large Language Models for Security Operations Center Automation: A Comprehensive Evaluation of Privacy-Preserving Threat Detection with Prompt Engineering, SOC Maturity Assessment, and Future Autonomous SOC Architectur

Key Points

The research aims to evaluate the effectiveness of open-source Large Language Models in enhancing the automation and accuracy of Security Operations Centers.
Conducted a comprehensive assessment across ten open-source LLMs using the Ollama inference framework.
Evaluated models against 28 synthetic security use cases and seven performance dimensions.
Developed a SOC maturity assessment framework spanning five generations.
Proposed an AI SOC architecture incorporating multiple advanced technologies.
Prompt engineering improved accuracy by 12–28 percentage points and reduced token consumption by 40–65%.
Local LLMs achieved 78–94% accuracy on SOC triage tasks, with Qwen 2.5 14B being the top-performing model.
AI enhancements could recover 40–60% of visibility lost due to exclusion-driven alert management.

Abstract

Modern enterprise Security Operations Centers (SOCs) are overwhelmed by the sheer volume and diversity of security alerts generated across their technology stack. A mid-sized organization typically processes 1,000–10,000 alerts daily, while large enterprises face 50,000–150,000+ alerts across multiple security products: SIEM platforms generate 500 5,000 correlation alerts daily from billions of raw log events; EDR solutions produce 200–2,000 endpoint alerts per day across thousands of managed devices; Data Loss Prevention (DLP) systems trigger 100–800 policy violation alerts daily, many of which are false positives from legitimate business workflows; Cloud Security (CASB/CSPM) tools generate 300 1,500 configuration and access anomaly alerts; Identity and Access Management (IAM/IDP) systems produce 200–1,000 authentication anomaly alerts; email security gateways flag 500–3,000 phishing and spam alerts daily; and compliance monitoring tools generate 100–500 regulatory deviation alerts across HIPAA, PCI-DSS, GDPR, and SOX frameworks. The consequences of this alert volume are severe. Due to limited analyst capacity and rigid detection rules, organizations routinely implement broad exclusion lists and suppression f ilters that silently discard 30–60% of alerts before human review—creating dangerous blind spots. Insider threats, which account for 25–30% of data breaches, are frequently missed because behavioral anomalies are excluded as “known business activity.” DLP alerts for sensitive data movement are sup pressed when they involve executive accounts or high-volume business units. Lateral movement detection fails silently when service account activity is blanket-excluded from monitoring. The result is a critical lack of visibility: organizations believe their SOC is monitoring their environment, but entire attack categories go undetected due to exclusion-driven coverage gaps. This paper presents a comprehensive empirical evaluation of ten open-source Large Language Models (LLMs) deployed entirely offline using the Ollama inference framework to ad dress these challenges. We evaluate models across 28 synthetic security use cases and seven dimensions. Our critical finding is that prompt engineering has larger impact than model selection—engineered prompts reduce token consumption by 40–65% while improving accuracy by 12–28 percentage points. We contribute a SOC maturity assessment framework span ning five generations, present a Future AI SOC architecture incorporating Agentic AI, RAG, MCP, Agentless scanning, API Security monitoring, and Multi-Agent systems, and provide a technology roadmap through 2030. Locally deployed LLMs achieve 78–94% accuracy on SOC triage tasks, with Qwen 2.5 14B ranking as the best overall model. By processing alerts that would otherwise be excluded or suppressed, AI-assisted SOC operations can recover 40–60% of the visibility currently lost to exclusion-based alert management. All benchmark data is released under CC BY 4.0. Index Terms—Open-source LLM, SOC automation, prompt engineering, agentic AI, RAG, MCP, agentless security, API security, multi-agent systems, MITRE ATT&CK, threat detec tion, privacy-preserving AI, SOC maturity model, autonomous security operations, DLP, insider threat, alert fatigue, exclusion management, SIEM, EDR, compliance monitoring.

Benchmarking Open-Source Large Language Models for Security Operations Center Automation: A Comprehensive Evaluation of Privacy-Preserving Threat Detection with Prompt Engineering, SOC Maturity Assessment, and Future Autonomous SOC Architectur

Key Points

Abstract

Cite This Study