What question did this study set out to answer?

The aim is to evaluate the consistency and reliability of XAI methods (SHAP, LIME, PDP) across different datasets and models.

April 15, 2026Open Access

A Cross-Domain Empirical Comparison of SHAP, LIME, and PDP: Concordance, Stability, and Sensitivity

Key Points

The aim is to evaluate the consistency and reliability of XAI methods (SHAP, LIME, PDP) across different datasets and models.
Conducted a cross-domain comparison of four explainable AI methods on three benchmark datasets.
Utilized four ML models: XGBoost, Random Forest, LightGBM, and Logistic Regression.
Evaluated methods based on concordance, stability, computational efficiency, and sensitivity to data conditions.
SHAP and LIME showed variable concordance based on dataset characteristics, with strong correlation in low-dimensional datasets.
LIME's stability significantly decreased in high-dimensional datasets, while SHAP remained stable across diverse conditions.
SHAP and PDP consistently demonstrated high concordance, indicating a robust relationship in their outputs.

Abstract

Post-hoc explainable artificial intelligence (XAI) methods such as SHAP (SHapley Additive exPlanations), LIME (Local Interpretable Model-agnostic Explanations), PDP (Partial Dependence Plot), and ALE (Accumulated Local Effects) are widely used to interpret machine learning models, yet it remains unclear whether these methods produce consistent explanations when applied to the same model and data. This study presents a systematic cross-domain comparison of four XAI methods across three benchmark datasets (German Credit, N = 1, 000; Heart Disease, N = 270; Adult Census, N = 48, 842) and four ML models (XGBoost, Random Forest, LightGBM, Logistic Regression), using a four-dimensional evaluation framework encompassing concordance, stability, computational efficiency, and sensitivity to data conditions. Results reveal that XAI concordance is dataset-dependent rather than universal: SHAP–LIME agreement ranged from strong positive correlation (Spearman = 0. 92) on the small, low-dimensional Heart Disease dataset to negative correlation (= -0. 47) on the large, high-dimensional Adult Census dataset. LIME explanation stability degraded dramatically in high-dimensional settings (rank coefficient of variation = 0. 40, Top-3 Jaccard similarity = 0. 52), while SHAP remained stable across all conditions (CV < 0. 14). SHAP and PDP showed consistently high concordance (= 0. 55–0. 90), forming a reliable cross-validation pair. A controlled correlation injection experiment further showed that the top SHAP feature in the small Heart Disease dataset dropped from rank 1 to rank 7 when a synthetic feature with r = 0. 9 was added, demonstrating SHAP's vulnerability to correlated features in small-sample settings. These findings provide empirical evidence that XAI method selection substantively affects analytical conclusions and offer condition-specific guidelines for practitioners choosing among post-hoc explanation methods.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Minyeong Kim

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A Cross-Domain Empirical Comparison of SHAP, LIME, and PDP: Concordance, Stability, and Sensitivity

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study