Fault Tolerance

Fault tolerance research aims to design and implement systems that continue functioning correctly despite hardware or software failures. Current efforts focus on improving the resilience of deep learning models, particularly large language models and convolutional neural networks, using techniques like checkpointing, redundant computations, and model-level hardening (e.g., parameter duplication, pruning). This is crucial for ensuring the reliability of AI systems in safety-critical applications such as autonomous driving and robotics, as well as enhancing the efficiency of large-scale model training and deployment.

Papers