this weeks paper discusses how silent hardware failure can lead to actual user facing errors. In the paper Facebook found that some files went missing because of a power function “1.1^53=0” failing due to hardware failures. This failures where never raised anywhere and the system seemed completely healthy. Super interesting to learn about this new error vector for large scale applications.
Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure services. SDCs are not captured by error re-porting mechanisms within a Central Processing Unit (CPU) and hence are not traceable at the hardware level. However, the data corruptions propagate across the stack and manifest as application-level problems. These types of errors can result in data loss and can require months of debug engineering time.In this paper, we describe common defect types observed in silicon manufacturing that leads to SDCs. We discuss a real-world example of silent data corruption within a data center application.We provide the debug flow followed to root-cause and triage faulty instructions within a CPU using a case study, as an illustration on how to debug this class of errors. We provide a high-level overview of the mitigations to reduce the risk of silent data corruptions with in a large production fleet.In our large-scale infrastructure, we have run a vast library of silent error test scenarios across hundreds of thousands of ma-chines in our fleet. This has resulted in hundreds of CPUs detected for these errors, showing that SDCs are a systemic issue across generations. We have monitored SDCs for a period longer than18 months. Based on this experience, we determine that reducing silent data corruptions requires not only hardware resiliency and production detection mechanisms, but also robust fault-tolerant software architectures.