A paper authored by a team from Universities of Wisconsin and Chicago:
We harden the Hadoop Distributed File System (HDFS) against fail- silent (non fail-stop) behaviors that result from memory corruption and software bugs using a new approach: selective and lightweight versioning (SLEEVE). With this approach, actions performed by important subsystems of HDFS (e.g., namespace management) are checked by a second implementation of the subsystem that uses lightweight, approximate data structures. We show that HARDFS detects and recovers from a wide range of fail-silent behaviors caused by random bit flips, targeted corruptions, and real software bugs. In particular, HARDFS handles 90% of the fail-silent faults that result from random memory corruption and correctly detects and recovers from 100% of 78 targeted corruptions and 5 real-world bugs. Moreover, it recov- ers orders of magnitude faster than full reboot by using micro-recovery. The extra protection in HARDFS incurs minimal performance and space overheads.
At very large scale, failures that we consider to be very rare can occur more frequently. HDFS already deals with handling machine and disk failure. This paper is about handling memory corruptions.
You can download it from here.
Original title and link: HDFS Paper: HARDFS - Hardening HDFS With Selective and Lightweight Versioning ( ©myNoSQL)