Standard LLM benchmarks measure capability - what the model can do - but not constraint - what the model should not do. We present a practical evaluation framework for assessing LLM sensitive data safety across four data categories: credentials, personally identifiable information (PII), protected health information (PHI), and financial data. Testing 24+ models across 6 model families, we find that models exhibit a clear sensitivity hierarchy: format-based recognition (structured credentials, SSN patterns) is significantly more reliable than context-based recognition (names that become sensitive through association with diagnoses or financial data). A model with a 0% credential leak rate leaked patient identifiers on every PHI test run. We document two distinct failure modes - leaking (echoing sensitive data verbatim) and missing (failing to identify sensitive data entirely) - and demonstrate that aggressive prompt engineering and fine-tuning on negative examples both increase rather than decrease leak rates. We propose a minimum evaluation protocol: binary scoring, multi-run testing (3+ runs per model), and category-specific assessment. The framework is designed to be handed to an evaluation team and integrated into a model selection pipeline. Architectural patterns that predict which models fail are presented in a companion paper.
Building similarity graph...
Analyzing shared references across papers
Loading...
Mohammad Al Zubaidi
Building similarity graph...
Analyzing shared references across papers
Loading...
Mohammad Al Zubaidi (Sat,) studied this question.
www.synapsesocial.com/papers/69e07dfe2f7e8953b7cbef8d — DOI: https://doi.org/10.5281/zenodo.19574048