What type of study is this?

This is a Quantitative Study study.

October 10, 2025Open Access

Generating reliable software project task flows using large language models through prompt engineering and robust evaluation

Key Points

Robust evaluation shows even minimal prompting yields task flows with 96.33% alignment, enhancing software project workflows.
Using the Hybrid Semantic Similarity Metric for evaluation, this approach shows lower variance and stronger correlation with human judgments.
Benchmarking five leading large language models reveals potential for improving procedural reliability in software education.
The study introduces effective prompting strategies for LLMs, indicating significant implications for AI-driven project management.

Abstract

Large Language Models (LLMs) offer promising capabilities for converting unstructured software documentation into structured task flows, yet their outputs often lack procedural reliability critical for software engineering. This paper presents a comprehensive framework that benchmarks five leading LLMs-Gemini 2.5 Pro, Grok 3, GPT-Omni, DeepSeek-R1, and LLaMA-3-across five prompting strategies, including Zero-Shot, Chain-of-Thought, and ISO 21502-Guided, using real-world software tutorials from the "Build Your Own X" repository. We introduce the Hybrid Semantic Similarity Metric (HSSM), which combines SentenceTransformer embeddings with context-aware key-term overlap, capturing both semantic fidelity and procedural coherence. Compared to traditional metrics like BERTScore, SBERT, and USE, HSSM demonstrates significantly lower variance (CV: 1.5–2.9%) and stronger correlation with human judgments. Our results show that even minimal prompting (Zero-Shot) can yield highly aligned task flows (HSSM: 96.33%) when evaluated with robust metrics. This work offers a scalable evaluation paradigm for LLM-assisted software planning, with implications for AI-driven project management, prompt engineering, and procedural generation in software education and tooling.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Sarim et al. (Wed,) studied this question.

www.synapsesocial.com/papers/68e861b07ef2f04ca37e49af — DOI: https://doi.org/10.1038/s41598-025-19170-9

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Correlation Coefficients: Appropriate Use and Interpretation· 2018 · 10,075 citations
An Empirical Study of the Code Generation of Safety-Critical Software Using LLMs· 2024 · 39 citations
On the Role of Morphological Information for Contextual Lemmatization· 2023 · 13 citations
BLEU· 2001 · 21,375 citations
The Stanford typed dependencies representation· 2008 · 921 citations

Authors

Mohammed Sarim

Faraz Masood

Manas Maheshwari

Journals

Scientific Reports

Actions

Institutions

Aligarh Muslim University

University of Science and Technology

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Generating reliable software project task flows using large language models through prompt engineering and robust evaluation

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion