Large Language Models (LLMs) offer promising capabilities for converting unstructured software documentation into structured task flows, yet their outputs often lack procedural reliability critical for software engineering. This paper presents a comprehensive framework that benchmarks five leading LLMs-Gemini 2.5 Pro, Grok 3, GPT-Omni, DeepSeek-R1, and LLaMA-3-across five prompting strategies, including Zero-Shot, Chain-of-Thought, and ISO 21502-Guided, using real-world software tutorials from the "Build Your Own X" repository. We introduce the Hybrid Semantic Similarity Metric (HSSM), which combines SentenceTransformer embeddings with context-aware key-term overlap, capturing both semantic fidelity and procedural coherence. Compared to traditional metrics like BERTScore, SBERT, and USE, HSSM demonstrates significantly lower variance (CV: 1.5–2.9%) and stronger correlation with human judgments. Our results show that even minimal prompting (Zero-Shot) can yield highly aligned task flows (HSSM: 96.33%) when evaluated with robust metrics. This work offers a scalable evaluation paradigm for LLM-assisted software planning, with implications for AI-driven project management, prompt engineering, and procedural generation in software education and tooling.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sarim et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68e861b07ef2f04ca37e49af — DOI: https://doi.org/10.1038/s41598-025-19170-9
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Mohammed Sarim
Faraz Masood
Manas Maheshwari
Scientific Reports
Aligarh Muslim University
University of Science and Technology
Building similarity graph...
Analyzing shared references across papers
Loading...