What question did this study set out to answer?

This research aims to evaluate the ability of large language models (LLMs) to derive Gherkin specifications from food-safety regulations.

March 30, 2026Open Access

From law to Gherkin: A human-centred quasi-experiment on the quality of LLM-generated behavioural specifications from food-safety regulations

Key Points

This research aims to evaluate the ability of large language models (LLMs) to derive Gherkin specifications from food-safety regulations.
Utilized a quasi-experimental design with 10 participants
Generated 60 Gherkin specifications using two LLMs, Claude and Llama
Each participant assessed 12 specifications for five quality criteria
Collected quantitative ratings and qualitative feedback on quality criteria
High ratings in quality criteria: relevance 95%, clarity 100%, completeness 94.2%, singularity 93.4%, time savings 91.7%
No statistically reliable differences found between participants or LLMs
Qualitative feedback noted issues like omissions, hallucinations, indicating need for human review

Abstract

Laws and regulations increasingly influence software design, development, and quality assurance in regulated domains; however, the technology-neutral formulation of legal provisions complicates the derivation of concrete specifications, requirements, and acceptance criteria needed to verify software compliance. Producing these artifacts manually is labour-intensive and error-prone. Recent advances in generative AI, particularly large language models (LLMs), offer the potential for automated assistance in deriving software engineering artifacts from legal texts. Following a quasi-experimental design, we present the first systematic human-subject evaluation of LLMs’ ability to automatically derive Gherkin behavioural specifications from legal texts. Gherkin is a domain-specific language for specifying system behaviours through scenario-based descriptions written in the Given--When--Then format. Due to their structured and machine-readable nature, Gherkin specifications lend themselves more readily to automation within software-development processes. We recruited 10 participants to evaluate Gherkin specifications generated from food-safety regulations by two LLMs, Claude and Llama. Sixty specifications were generated. Each participant independently assessed 12 specifications across five quality criteria: relevance , clarity , completeness , singularity , and time savings . Each specification was evaluated by two participants, yielding 120 assessments with quantitative ratings and qualitative feedback. Ratings were uniformly high (top-two categories): relevance 95%, clarity 100%, completeness 94.2%, singularity 93.4%, and time savings 91.7%. No statistically reliable differences were observed across participants or between LLMs. Qualitative feedback noted occasional omissions, hallucinations, and mixed intents; the first two, in particular, underscore the importance of human oversight, especially in safety-critical domains where non-compliance can have severe consequences. Our results suggest that, in the context of food safety, LLMs can assist in deriving Gherkin specifications from legal texts; however, observed omissions and hallucinations necessitate systematic human review.

From law to Gherkin: A human-centred quasi-experiment on the quality of LLM-generated behavioural specifications from food-safety regulations

Key Points

Abstract

Cite This Study