We address the ”black-box problem” in LLMs by tracing outputs to the behavior of theirinternal states in a way that is stable, causal, and trajectory-aware.1 Existing attribution methods (IG, SHAP, attention weights) analyze single forward passes, ignore trajectory multiplicity,lack stability under variation, and lack reverse probabilistic admissibility. We introduce ReverseMarkov Chains (RMC), a post-hoc framework that integrates Integrated Gradients (local sensitivity), L3-Shapley values (coalitional causality), and reverse posterior weighting (trajectoryplausibility). We show that reverse posterior weighting stabilizes attribution across multiple forward trajectories that yield identical outputs. Theoretical guarantees follow from axiomatic IGsensitivity and L3-Shapley admissibility under an SCM approximation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Gabiro Arnauld
Building similarity graph...
Analyzing shared references across papers
Loading...
Gabiro Arnauld (Sat,) studied this question.
www.synapsesocial.com/papers/69ada962bc08abd80d5bcab3 — DOI: https://doi.org/10.5281/zenodo.18903789