hdfs: All content tagged as hdfs in NoSQL databases and polyglot persistence
Great retrospective with many architecture details of the improvements added to HDFS in 2012 and what is planned for this year by Todd Lipcon.
For a quick overview:
- 2012: HDFS 2.0
- HA (in 2 phases)
- Performance improvements:
- for Impala: faster libhdfs, APIs for spindle-based scheduling
- for HBase and Accumulo: direct reads from block files in secure environments, application level checksums and IOPS elimintation
- on-the-wire encryption
- rolling upgrades and wire compatibility
- HDFS snapshots
- better storage density and file formats
- caching and hierarchical storage management
Original title and link: What’s New and Upcoming in HDFS ( ©myNoSQL)
A paper authored by a team from Universities of Wisconsin and Chicago:
We harden the Hadoop Distributed File System (HDFS) against fail- silent (non fail-stop) behaviors that result from memory corruption and software bugs using a new approach: selective and lightweight versioning (SLEEVE). With this approach, actions performed by important subsystems of HDFS (e.g., namespace management) are checked by a second implementation of the subsystem that uses lightweight, approximate data structures. We show that HARDFS detects and recovers from a wide range of fail-silent behaviors caused by random bit flips, targeted corruptions, and real software bugs. In particular, HARDFS handles 90% of the fail-silent faults that result from random memory corruption and correctly detects and recovers from 100% of 78 targeted corruptions and 5 real-world bugs. Moreover, it recov- ers orders of magnitude faster than full reboot by using micro-recovery. The extra protection in HARDFS incurs minimal performance and space overheads.
At very large scale, failures that we consider to be very rare can occur more frequently. HDFS already deals with handling machine and disk failure. This paper is about handling memory corruptions.
You can download it from here.
Original title and link: HDFS Paper: HARDFS - Hardening HDFS With Selective and Lightweight Versioning ( ©myNoSQL)
Quantcast released a new Hadoop file system QFS:
- fully compatible with HDFS
- licensed under Apache 2.0 license
- written in C++
- while HDFS replicates data 3 times, QFS requires only 1.5x raw capacity
- QFS supports two types of fault tolerance: chunk replication and Reed-Solomon encoding
QFS components (more details here):
QFS performance comparison to HDFS:
Now I’m looking forward to hear comments from HDFS experts about QFS.
Original title and link: Quantcast File System for Hadoop ( ©myNoSQL)
It’s unfortunate that the post focuses mostly on the usage of Spring and RabitMQ and the slidedeck doesn’t dive deeper into the architecture, data flows, and data stores, but the diagrams below should give you an idea of this truly polyglot persistentency architecture:
The slide deck presenting architecture principles and numbers about the platform after the break.
The post is a bit old, but the data contained comparing different compression methods is helpful:
Original title and link: Comparing File Formats and Compression Methods in HDFS and Hive ( ©myNoSQL)
As I’m slowly recovering after a severe poisoning that I initially ignored but finally put me to bed for almost a week, I’m going to post some of the most interesting articles I’ve read while resting.
Hadoop Namenode’s single point of failure has always been mentioned as one of the weaknesses of Hadoop and also as a differentiator of other Hadoop-based commercial offerings. But now the Namenode HA branch was merged into trunk and while it will take a couple of cicles to complete the tests, this will become soon part of the Hadoop distribution.
Significant enhancements were completed to make HOT Failover work:
- Configuration changes for HA
- Notion of active and standby states were added to the Namenode
- Client-side redirection
- Standby processing journal from Active
- Dual block reports to Active and Standby
In a follow up post to Gartner’s article Apache Hadoop 1.0 Doesn’t Clear Up Trunks and Branches Questions. Do Distributions?, the advantage of using custom distributions will slowly vanish and the open source version will be the one you’ll want to have in production.
Original title and link: Hadoop Namenode High Availability Merged to HDFS Trunk ( ©myNoSQL)