How to lose files with your processor failing to calculate 1.1^53 correctly

Moin moin,

this weeks paper discusses how silent hardware failure can lead to actual user facing errors. In the paper Facebook found that some files went missing because of a power function “1.1^53=0” failing due to hardware failures. This failures where never raised anywhere and the system seemed completely healthy. Super interesting to learn about this new error vector for large scale applications.

Software exists to create business value

I am Simon Frey, the author of the Weekly CS Paper Newsletter. And I have great news: You can work with me

As CTO as a Service, I will help you choose the right technology for your company, build up your team and be a deeply technical sparring partner for your product development strategy.

Checkout my website simon-frey.com to learn more or directly contact me via the button below.

Simon Frey Header image
Let’s work together!

Abstract:

Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure services. SDCs are not captured by error re-porting mechanisms within a Central Processing Unit (CPU) and hence are not traceable at the hardware level. However, the data corruptions propagate across the stack and manifest as application-level problems. These types of errors can result in data loss and can require months of debug engineering time.In this paper, we describe common defect types observed in silicon manufacturing that leads to SDCs. We discuss a real-world example of silent data corruption within a data center application.We provide the debug flow followed to root-cause and triage faulty instructions within a CPU using a case study, as an illustration on how to debug this class of errors. We provide a high-level overview of the mitigations to reduce the risk of silent data corruptions with in a large production fleet.In our large-scale infrastructure, we have run a vast library of silent error test scenarios across hundreds of thousands of ma-chines in our fleet. This has resulted in hundreds of CPUs detected for these errors, showing that SDCs are a systemic issue across generations. We have monitored SDCs for a period longer than18 months. Based on this experience, we determine that reducing silent data corruptions requires not only hardware resiliency and production detection mechanisms, but also robust fault-tolerant software architectures.

Download Link:

https://arxiv.org/pdf/2102.11245.pdf

Weekly in-depth computer science knowledge to become a better programmer. For free!
Over 2000 subcribers. One click unsubscribe.