Fault Tolerance
Fault tolerance research aims to design and implement systems that continue functioning correctly despite hardware or software failures. Current efforts focus on improving the resilience of deep learning models, particularly large language models and convolutional neural networks, using techniques like checkpointing, redundant computations, and model-level hardening (e.g., parameter duplication, pruning). This is crucial for ensuring the reliability of AI systems in safety-critical applications such as autonomous driving and robotics, as well as enhancing the efficiency of large-scale model training and deployment.
Papers
August 16, 2022
March 10, 2022
February 17, 2022
February 4, 2022
December 7, 2021
December 4, 2021