Fault Aware
Fault-aware computing focuses on designing and training deep learning models that are resilient to hardware failures, a critical issue as models scale to massive sizes and deploy on increasingly unreliable hardware. Current research emphasizes techniques like adaptive resource allocation (e.g., for Mixture-of-Experts models), pipeline adaptation to handle failures in distributed training, and fault-aware quantization to mitigate the impact of permanent hardware faults in specialized accelerators. These advancements aim to improve the reliability and efficiency of large-scale deep learning training and deployment, impacting both the development of robust AI systems and the design of efficient hardware architectures.
Papers
October 14, 2024
July 5, 2024
May 22, 2024
January 19, 2024
August 16, 2023
June 20, 2023
May 21, 2023
April 20, 2023
March 14, 2023