Deep-learning accelerators such as TPUs and GPUs now run ever larger models. Conventional fault-tolerant accelerators, designed for CNNs, are ineffective and cost-prohibitive for emerging large language models (LLMs) due to their exponentially higher computational demands. To address the impact of soft errors on computation, various specialized fault-tolerant DNN accelerators have been proposed, typically employing full-element protection. Yet, emerging LLMs exhibit exponentially higher computational demands compared to traditional CNN models, rendering conventional fault-tolerant accelerators both cost-prohibitive and ineffective in handling multi-point faults. To tackle these challenges, we conduct fault injection experiments on multiple representative LLMs, revealing the inherent parameter redundancy in Transformer-based models. Specifically, only 1%–2% of the elements significantly affect the output when perturbed—these critical elements are identified as outliers. Leveraging this insight, we propose OrCA, a hierarchically redundant fault-tolerant accelerator, which introduces the principle of selective protection for critical elements and optimizes the dataflow accordingly. Through extensive fault injection experiments and hardware simulations, we demonstrate that OrCA outperforms conventional fault-tolerant accelerators, achieving superior protection at equal or lower area overhead. Notably, OrCA delivers better performance under fault rates up to 10× higher and supports elastic protection against diverse hardware faults (e.g., transient and permanent faults), adapting to varying fault-tolerance requirements. Furthermore, OrCA breaks the limitation of traditional accelerators that require separate error detection and correction steps, enabling more efficient fault resilience.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yihao Shi
Sheng Ma
Tao Li
ACM Transactions on Design Automation of Electronic Systems
National University of Defense Technology
National Defense University
Milli Savunma Üniversitesi
Building similarity graph...
Analyzing shared references across papers
Loading...
Shi et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69d896566c1944d70ce07ad4 — DOI: https://doi.org/10.1145/3801976
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: