Transparent reasoning and interpretability are essential for AI-supported risk assessment, yet it remains unclear whether large language models (LLMs) can provide reliable, deterministic support for safety-critical tasks or merely simulate reasoning through plausible outputs. This study presents a systematic, multi-model empirical evaluation of reasoning-capable LLMs applied to machinery functional safety, focusing on Required Performance Level (PLr) estimation as defined by ISO 13849-1 and ISO 12100. Six state-of-the-art models (Claude-opus, o3-mini, o4-mini, GPT-5-mini, Gemini-2.5-flash, DeepSeek-Reasoner) were evaluated across six prompting strategies and two dataset variants: canonical ISO-style hazards (Variant 1) and engineer-authored free-text scenarios (Variant 2). Results show that rule-grounded prompting consistently stabilizes performance, achieving ceiling-level accuracy in Variant 1 and restoring reliability under lexical variability in Variant 2. In contrast, unconstrained chain-of-thought reasoning (CoT) and CoT together with Retrieval-Augmented Generation (RAG) introduce volatility, overprediction biases, and model-dependent degradations. Safety-critical coverage was quantified through per-class F1 and recall of PLr class e, confirming that only rule-grounded prompts reliably captured rare but high-risk hazards. Latency analysis demonstrated that rule-only prompts were both the most accurate and the most efficient, while CoT strategies incurred 2–10× overhead. A confusion/rescue analysis of retrieval interactions further revealed systematic noise mechanisms such as P-inflation and F-drift, showing that retrieval can either destabilize or rescue cases depending on model family. Intermediate severity/frequency/possibility (S/F/P) reasoning steps were found to diverge from ISO-consistent logic, reinforcing critiques that LLM “reasoning” reflects surface-level continuation rather than genuine inference. All reported figures include 95% confidence intervals, t-intervals across runs (r=5) for accuracy and timing, and class-stratified bootstrap CIs for Micro/Macro/Weighted-F1 and per-class metrics. Overall, this study establishes a rigorous benchmark for evaluating LLMs in functional safety workflows such as PLr determination. It shows that deterministic, safety-critical classification requires strict rule-constrained prompting and careful retrieval governance, rather than reliance on assumed model reasoning abilities.
Building similarity graph...
Analyzing shared references across papers
Loading...
Padma Iyenghar (Fri,) studied this question.
www.synapsesocial.com/papers/68d45e6a31b076d99fa5f1ca — DOI: https://doi.org/10.3390/electronics14183624
Padma Iyenghar
Electronics
Building similarity graph...
Analyzing shared references across papers
Loading...