Fault Tolerance
Fault tolerance research aims to design and implement systems that continue functioning correctly despite hardware or software failures. Current efforts focus on improving the resilience of deep learning models, particularly large language models and convolutional neural networks, using techniques like checkpointing, redundant computations, and model-level hardening (e.g., parameter duplication, pruning). This is crucial for ensuring the reliability of AI systems in safety-critical applications such as autonomous driving and robotics, as well as enhancing the efficiency of large-scale model training and deployment.
Papers
October 25, 2024
August 30, 2024
August 8, 2024
August 2, 2024
May 17, 2024
April 16, 2024
April 14, 2024
February 5, 2024
January 22, 2024
January 21, 2024
November 3, 2023
October 16, 2023
September 17, 2023
September 15, 2023
May 16, 2023
February 27, 2023
February 9, 2023
February 3, 2023
December 29, 2022