HDFS: All content tagged as HDFS in NoSQL databases and polyglot persistence
Edd Dumbill enumerates the various components of the Hadoop ecosystem:
Original title and link: The components and their functions in the Hadoop ecosystem ( ©myNoSQL)
A picture is worth a thousand words. A comic-like explanation of HDFS is worth some too:
See it in full size. Credit Maneesh Varshney
Original title and link: Hadoop Distributed File System HDFS: A Cartoon Is Worth A ( ©myNoSQL)
Steve Loughran starts with a critical look at Netapp Open solution for Hadoop paper:
Actually it is weirder than I first thought. This is still HDFS, just running on more expensive hardware. You get the (current) HDFS limitations: no native filesystem mounting, a namenode to care about, security on a par with NFS, without the cost savings of pure-SATA-no-licensing-fees. Instead you have to use RAID everywhere, which not only bumps up your cost of storage, puts you at risk of RAID controller failure and errors in the OS drivers for those controller (hence their strict rules about which Linux releases to trust). If you do follow their recommendations and rely on hardware for data integrity, you’ve cut down the probability of node-local job execution, so all FUD about replication traffic is now moot as at least 1/3 more of your tasks will be running remote -possibly even with the Fair Scheduler, which waits for a bit to see if a local slot becomes free. What they are doing then is adding some HA hardware underneath a filesystem that is designed to give strong availability out of medium availability hardware. I have seen such a design before, and thought it sucked then too. Information week says this is a response to EMC, but it looks more like NetApp’s strategy to stay relevant, and Cloudera are partnering with them as NetApp offered them money and if it sells into more “enterprise customers” then why not? With the extra hardware costs of NetApp the cloudera licenses will look better value, and clearly both NetApp and their customers are in need of the hand-holding that Cloudera can offer.
Then in a follow up post, he looks at a couple of alternatives (Lustre, GPFS, IBRIX, etc):
I’m not against running MapReduce—or the entire Hadoop stack—against alternate filesystems. There are some good cases where it makes sense. Other filesystems offer security, NFS mounting, the ability to be used by other applications and other features. HDFS is designed to scale well on “commodity” hardware, (where servers containing Xeon E5 series parts with 64GB RAM, 10GbE and 8-12 SFF HDDs are considered a subset of “commodity”).
Original title and link: A Short Incursion Into Alternate Hadoop Filesystems ( ©myNoSQL)
Dhruba Borthakur started a series of posts — part 1 and part 2 — describing both the process that lead Facebook to using HBase and Hadoop, but also the projects where these are used and their requirements:
After considerable research and experimentation, we chose Hadoop and HBase as the foundational storage technology for these next generation applications. The decision was based on the state of HBase at the point of evaluation as well as our confidence in addressing the features that were lacking at that point via in- house engineering. HBase already provided a highly consistent, high write-throughput key-value store. The HDFS NameNode stood out as a central point of failure, but we were confident that our HDFS team could build a highly-available NameNode (AvatarNode) in a reasonable time-frame, and this would be useful for our warehouse operations as well. Good disk read-efficiency seemed to be within striking reach (pending adding Bloom filters to HBase’’s version of LSM Trees, making local DataNode reads efficient and caching NameNode metadata). Based on our experience operating the Hive/Hadoop warehouse, we knew HDFS was stellar in tolerating and isolating faults in the disk subsystem. The failure of entire large HBase/HDFS clusters was a scenario that ran against the goal of fault-isolation, but could be considerably mitigated by storing data in smaller HBase clusters. Wide area replication projects, both in-house and within the HBase community, seemed to provide a promising path to achieving disaster recovery.
The second part is describing 3 problems Facebook is solving using HBase and Hadoop and provides further details on the requirements of each of these.
The two posts represent a great resource for understanding not only where HBase and Hadoop can be used, but also on how to formulate the requirements (and non-requirements) for new systems.
A Facebook team will present the paper “Apache Hadoop Goes Realtime at Facebook” at ACM SIGMOD. I’m looking forward for the moment the paper will be available.