Soft Error
Soft errors, transient faults causing bit flips in hardware, pose a significant reliability challenge for increasingly complex systems, particularly in deep learning and embedded applications like autonomous vehicles. Current research focuses on developing efficient error mitigation strategies, including adaptive protection schemes that tailor redundancy to input complexity and selective protection of hardware components based on their vulnerability. These efforts aim to improve the reliability of AI systems and safety-critical devices without incurring excessive performance or area overhead, addressing a crucial need for robust and dependable computing in various domains.
Papers
Designing DNNs for a trade-off between robustness and processing performance in embedded devices
Jon GutiƩrrez-Zaballa, Koldo Basterretxea, Javier Echanobe
Soft Checksums to Flag Untrustworthy Machine Learning Surrogate Predictions and Application to Atomic Physics Simulations
Casey Lauer, Robert C. Blake, Jonathan B. Freund