Evaluating Explainability: A Framework for Systematic Assessment of Explainable AI Features in Medical Imaging
Abstract
Explainability features are intended to provide insight into the internal mechanisms of an Artificial Intelligence (AI) device, but there is a lack of evaluation techniques for assessing the quality of provided explanations. We propose a framework to assess and report explainable AI features in medical images. Our evaluation framework for AI explainability is based on four criteria that relate to the particular needs in AI-enabled medical devices: (1) Consistency quantifies the variability of explanations to similar inputs; (2) plausibility estimates how close the explanation is to the ground truth; (3) fidelity assesses the alignment between the explanation and the model internal mechanisms; and (4) usefulness evaluates the impact on task performance of the explanation. Finally, we developed a scorecard for AI explainability methods in medical imaging that serves as a complete description and evaluation to accompany this type of device. We describe these four criteria and give examples on how they can be evaluated. As a case study, we use Ablation CAM and Eigen CAM to illustrate the evaluation of explanation heatmaps on the detection of breast lesions on synthetic mammographies. The first three criteria are evaluated for task-relevant scenarios. This framework establishes criteria through which the quality of explanations provided by medical devices can be quantified.
Key Points
Objective
The aim is to propose a structured evaluation framework to assess the explainability features of AI in medical imaging.
Methods
- Developed a framework based on four criteria for assessing explainability: consistency, plausibility, fidelity, and usefulness.
- Created a scorecard to systematically evaluate AI explainability methods in medical imaging.
- Illustrated the framework using Ablation CAM and Eigen CAM on synthetic mammographies for breast lesion detection.
Results
- Established a method for quantifying the quality of explanations provided by AI in medical devices.
- Demonstrated that the first three criteria can be effectively evaluated in task-relevant scenarios.