Fault Tolerance

Fault tolerance research aims to design and implement systems that continue functioning correctly despite hardware or software failures. Current efforts focus on improving the resilience of deep learning models, particularly large language models and convolutional neural networks, using techniques like checkpointing, redundant computations, and model-level hardening (e.g., parameter duplication, pruning). This is crucial for ensuring the reliability of AI systems in safety-critical applications such as autonomous driving and robotics, as well as enhancing the efficiency of large-scale model training and deployment.

Papers

December 29, 2022

FlatENN: Train Flat for Enhanced Fault Tolerance of Quantized Deep Neural Networks
Akul Malhotra, Sumeet Kumar Gupta
Activation Sparsity Quantized Neural Network Fault Tolerance Document Flattening

October 16, 2022

Towards Dynamic Fault Tolerance for Hardware-Implemented Artificial Neural Networks: A Deep Learning Approach
Daniel Gregorek, Nils Hülsmeier, Steffen Paul
Deep Learning Deep Learning Approach Fault Tolerance Transient Fault

August 16, 2022

DRAGON: Decentralized Fault Tolerance in Edge Federations
Shreshth Tuli, Giuliano Casale, Nicholas R. Jennings
Neural Network GAN Model Generative Network Computational Resource Fault Tolerance Six DRAGON Fly

March 10, 2022

SoftSNN: Low-Cost Fault Tolerance for Spiking Neural Network Accelerators under Soft Errors
Rachmad Vidya Wicaksana Putra, Muhammad Abdullah Hanif, Muhammad Shafique
Hardware Accelerator Fault Tolerance Neural Network Accelerator Soft Error Fault Tolerant Neural Network

February 17, 2022

Winograd Convolution: A Perspective from Fault Tolerance
Xinghua Xue, Haitong Huang, Cheng Liu, Ying Wang, Tao Luo, Lei Zhang
Visual Perspective Energy Efficient Fault Tolerance Winograd Convolution Fault Tolerant Neural Network

February 4, 2022

SignSGD: Fault-Tolerance to Blind and Byzantine Adversaries
Jason Akoun, Sebastien Meyer
Stochastic Gradient Descent Fault Tolerance Byzantine Fault SignSGD MV SignSGD Algorithm

December 7, 2021

MESA: Offline Meta-RL for Safe Adaptation and Fault Tolerance
Michael Luo, Ashwin Balakrishna, Brijen Thananjeyan, Suraj Nair, Julian Ibarz, Jie Tan, Chelsea Finn, Ion Stoica, Ken Goldberg
Reinforcement Learning Meta Learning Safe Exploration Fault Tolerance Offline Meta Reinforcement Learning

December 4, 2021

PreGAN: Preemptive Migration Prediction Network for Proactive Fault-Tolerant Edge Computing
Shreshth Tuli, Giuliano Casale, Nicholas R. Jennings
Generative Adversarial Network Edge Computing Pre Trained Convolutional Neural Network Fault Tolerance ProActive Behavior Edge Deployment

November 9, 2021

Analyzing and Improving Fault Tolerance of Learning-Based Navigation Systems
Zishen Wan, Aqeel Anwar, Yu-Shun Hsiao, Tianyu Jia, Vijay Janapa Reddi, Arijit Raychowdhury
Learning Based Navigation System Fault Tolerance Unmanned Vehicle Protection Mechanism