Fault Aware

Fault-aware computing focuses on designing and training deep learning models that are resilient to hardware failures, a critical issue as models scale to massive sizes and deploy on increasingly unreliable hardware. Current research emphasizes techniques like adaptive resource allocation (e.g., for Mixture-of-Experts models), pipeline adaptation to handle failures in distributed training, and fault-aware quantization to mitigate the impact of permanent hardware faults in specialized accelerators. These advancements aim to improve the reliability and efficiency of large-scale deep learning training and deployment, impacting both the development of robust AI systems and the design of efficient hardware architectures.

Papers

October 14, 2024

Data-Aware Training Quality Monitoring and Certification for Reliable Deep Learning
Farhang Yeganegi, Arian Eamaz, Mojtaba Soltanalian
Deep Learning Deep Learning Model Fault Aware

July 5, 2024

Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement
Yongji Wu, Wenjie Qu, Tianyang Tao, Zhuang Wang, Wei Bai, Zhuohao Li, Yuan Tian, Jiaheng Zhang, Matthew Lentz, Danyang Zhuo
Large Language Model Mixture of Expert Tiny Refinement Elicit Resilience Elastic Net Expert Parallelism Expert Selection Fault Aware Sparsely Activated Mixture of Expert

May 22, 2024

ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation
Swapnil Gandhi, Mark Zhao, Athinagoras Skiadopoulos, Christos Kozyrakis
Neural Network Deep Neural Network Leg Failure Training Throughput Energy Pipeline Control SLIP Fault Aware Compatible Learning

January 19, 2024

FARe: Fault-Aware GNN Training on ReRAM-based PIM Accelerators
Pratyush Dhingra, Chukwufumnanya Ogbogu, Biresh Kumar Joardar, Janardhan Rao Doppa, Ananth Kalyanaraman, Partha Pratim Pande
Free Counterpart Resistive Random Access Memory Processing in Memory GNN Performance Fault Aware

August 16, 2023

Exploring Winograd Convolution for Cost-effective Neural Network Fault Tolerance
Xinghua Xue, Cheng Liu, Bo Liu, Haitong Huang, Ying Wang, Tao Luo, Lei Zhang, Huawei Li, Xiaowei Li
CNN Performance Winograd Convolution Fault Aware Fault Tolerant Neural Network

June 20, 2023

Inshrinkerator: Compressing Deep Learning Training Checkpoints via Dynamic Quantization
Amey Agrawal, Sameer Reddy, Satwik Bhattamishra, Venkata Prabhakara Sarath Nookala, Vidushi Vashishth, Kexin Rong, Alexey Tumanov
Pre Trained Model Compression Neural Network Compression Dynamic Quantization Fault Aware Delta Compression

May 21, 2023

FAQ: Mitigating the Impact of Faults in the Weight Memory of DNN Accelerators through Fault-Aware Quantization
Muhammad Abdullah Hanif, Muhammad Shafique
Global Impact DNN Accelerator Weight Freezing Post Fault Trajectory Fault Aware

April 20, 2023

eFAT: Improving the Effectiveness of Fault-Aware Training for Mitigating Permanent Faults in DNN Hardware Accelerators
Muhammad Abdullah Hanif, Muhammad Shafique
Deep Neural Network DNN Accelerator Post Fault Trajectory Fault Aware

March 14, 2023

ISimDL: Importance Sampling-Driven Acceleration of Fault Injection Simulations for Evaluating the Robustness of Deep Learning
Alessio Colucci, Andreas Steininger, Muhammad Shafique
Deep Learning Native Robustness Importance Aware Fault Injection Neural Accelerator Fault Aware Accelerated Sampling

Fault Aware

Papers

Data-Aware Training Quality Monitoring and Certification for Reliable Deep Learning

Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement

ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation

FARe: Fault-Aware GNN Training on ReRAM-based PIM Accelerators

Exploring Winograd Convolution for Cost-effective Neural Network Fault Tolerance

Inshrinkerator: Compressing Deep Learning Training Checkpoints via Dynamic Quantization

FAQ: Mitigating the Impact of Faults in the Weight Memory of DNN Accelerators through Fault-Aware Quantization

eFAT: Improving the Effectiveness of Fault-Aware Training for Mitigating Permanent Faults in DNN Hardware Accelerators

ISimDL: Importance Sampling-Driven Acceleration of Fault Injection Simulations for Evaluating the Robustness of Deep Learning