Deploying large language models for high-stakes domain-specific reasoning requires addressing challengesabsent from standard benchmarks: handling incomplete information, quantifying uncertainty, and performing multi-step numerical calculations with authoritative source attribution. We present a hybrid architecture combining parameter-efficient fine-tuning via Quantized Low-Rank Adaptation (QLoRA) with Retrieval-Augmented Generation (RAG), evaluated on Saudi Arabia’s End-of-Service Benefits calculation—a legally binding financial computation involving 16 interacting legal provisions across 35 termination scenarios. Our contributions include: a comprehensive synthetic dataset of 10,000 samples systematically modeling real-world legal consultation complexities—incomplete information (15%), conflicting evidence (10%), legal interpretation ambiguities (5%), and adversarial examples (5%)—grounded in empirical distributions from 47,382 actual cases, 3,847 labor court disputes, and expert interviews (n=23); a hybrid architectural approach demonstrating that combining QLoRA fine-tuning (0.42% trainable parameters, 93.5% memory reduction) with retrieval-augmented generation yields complementary benefits, outperforming isolated components by 5.8–8.7 percentage points;and integrated uncertainty quantification mechanisms combining epistemic (MC Dropout), aleatoric (retrieval confidence, linguistic hedging), and calibration (temperature scaling) methods achieving Expected Calibration Error of 0.043 and 89.4% precision in detecting ambiguous cases requiring human review. Evaluation on 1,000 held-out synthetic test cases—stratified across six complexity tiers—shows 94.2% accuracy (±5% tolerance), 91.5% legal citation correctness, and graceful degradation across complexity tiers (98.7% standard cases → 82.0% adversarial examples). We note that all quantitative evaluation is conducted on synthetic data; real-world deployment validation remains an important next step. Human evaluation by five Saudi legal experts (inter-rater κ = 0.73) yields 4.4/5 overall rating with unanimous recommendation for pilot deployment. While our primary evaluation relies on synthetic data and focuses on a single legal calculation domain, the methodological framework—synthetic modeling of domain ambiguity, architectural patterns for parametric-retrieval integration, and uncertainty-aware human-AI collaboration—provides a transferable template for specialized reasoning tasks requiring numerical precision, source attribution, and confidence calibration. We discuss threats to external validity and outline concrete steps toward real-world validation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Nasser Aldosari
Farookh Hussain
Mohammed Tawfik
University of Technology Sydney
Ajloun National University
Building similarity graph...
Analyzing shared references across papers
Loading...
Aldosari et al. (Sat,) studied this question.
www.synapsesocial.com/papers/69df2ba0e4eeef8a2a6b0a3d — DOI: https://doi.org/10.19139/soic-2310-5070-3444