Laws and regulations increasingly influence software design, development, and quality assurance in regulated domains; however, the technology-neutral formulation of legal provisions complicates the derivation of concrete specifications, requirements, and acceptance criteria needed to verify software compliance. Producing these artifacts manually is labour-intensive and error-prone. Recent advances in generative AI, particularly large language models (LLMs), offer the potential for automated assistance in deriving software engineering artifacts from legal texts. Following a quasi-experimental design, we present the first systematic human-subject evaluation of LLMs’ ability to automatically derive Gherkin behavioural specifications from legal texts. Gherkin is a domain-specific language for specifying system behaviours through scenario-based descriptions written in the Given--When--Then format. Due to their structured and machine-readable nature, Gherkin specifications lend themselves more readily to automation within software-development processes. We recruited 10 participants to evaluate Gherkin specifications generated from food-safety regulations by two LLMs, Claude and Llama. Sixty specifications were generated. Each participant independently assessed 12 specifications across five quality criteria: relevance , clarity , completeness , singularity , and time savings . Each specification was evaluated by two participants, yielding 120 assessments with quantitative ratings and qualitative feedback. Ratings were uniformly high (top-two categories): relevance 95%, clarity 100%, completeness 94.2%, singularity 93.4%, and time savings 91.7%. No statistically reliable differences were observed across participants or between LLMs. Qualitative feedback noted occasional omissions, hallucinations, and mixed intents; the first two, in particular, underscore the importance of human oversight, especially in safety-critical domains where non-compliance can have severe consequences. Our results suggest that, in the context of food safety, LLMs can assist in deriving Gherkin specifications from legal texts; however, observed omissions and hallucinations necessitate systematic human review.
Hassani et al. (Sun,) studied this question.