Laws and regulations increasingly influence software design, development, and quality assurance in regulated domains; however, the technology-neutral formulation of legal provisions complicates the derivation of concrete specifications, requirements, and acceptance criteria needed to verify software compliance. Producing these artifacts manually is labour-intensive and error-prone. Recent advances in generative AI, particularly large language models (LLMs), offer the potential for automated assistance in deriving software engineering artifacts from legal texts. Following a quasi-experimental design, we present the first systematic human-subject evaluation of LLMs’ ability to automatically derive Gherkin behavioural specifications from legal texts. Gherkin is a domain-specific language for specifying system behaviours through scenario-based descriptions written in the Given--When--Then format. Due to their structured and machine-readable nature, Gherkin specifications lend themselves more readily to automation within software-development processes. We recruited 10 participants to evaluate Gherkin specifications generated from food-safety regulations by two LLMs, Claude and Llama. Sixty specifications were generated. Each participant independently assessed 12 specifications across five quality criteria: relevance , clarity , completeness , singularity , and time savings . Each specification was evaluated by two participants, yielding 120 assessments with quantitative ratings and qualitative feedback. Ratings were uniformly high (top-two categories): relevance 95%, clarity 100%, completeness 94.2%, singularity 93.4%, and time savings 91.7%. No statistically reliable differences were observed across participants or between LLMs. Qualitative feedback noted occasional omissions, hallucinations, and mixed intents; the first two, in particular, underscore the importance of human oversight, especially in safety-critical domains where non-compliance can have severe consequences. Our results suggest that, in the context of food safety, LLMs can assist in deriving Gherkin specifications from legal texts; however, observed omissions and hallucinations necessitate systematic human review.
Building similarity graph...
Analyzing shared references across papers
Loading...
Shabnam Hassani
Mehrdad Sabetzadeh
Daniel Amyot
Information and Software Technology
University of Ottawa
Building similarity graph...
Analyzing shared references across papers
Loading...
Hassani et al. (Sun,) studied this question.
www.synapsesocial.com/papers/69ca1210883daed6ee094e48 — DOI: https://doi.org/10.1016/j.infsof.2026.108122