HDFS: All content tagged as HDFS in NoSQL databases and polyglot persistence
Thursday, 14 February 2013
What’s New and Upcoming in HDFS
Great retrospective with many architecture details of the improvements added to HDFS in 2012 and what is planned for this year by Todd Lipcon.
For a quick overview:
- 2012: HDFS 2.0
- HA (in 2 phases)
- Performance improvements:
- for Impala: faster libhdfs, APIs for spindle-based scheduling
- for HBase and Accumulo: direct reads from block files in secure environments, application level checksums and IOPS elimintation
- on-the-wire encryption
- rolling upgrades and wire compatibility
- 2013:
- HDFS snapshots
- better storage density and file formats
- caching and hierarchical storage management
Original title and link: What’s New and Upcoming in HDFS (©myNoSQL)
Data Deduplication Tactics With HDFS and MapReduce
5 techniques and links to research papers about data deduplication using HDFS and MapReduce:
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and HDFS can be leveraged for eliminating duplicate data.
Patrick Durusau
Original title and link: Data Deduplication Tactics With HDFS and MapReduce (©myNoSQL)
via: http://www.hadoopsphere.com/2013/02/data-de-duplication-tactics-with-hdfs.html
HDFS Paper: HARDFS - Hardening HDFS With Selective and Lightweight Versioning
A paper authored by a team from Universities of Wisconsin and Chicago:
We harden the Hadoop Distributed File System (HDFS) against fail- silent (non fail-stop) behaviors that result from memory corruption and software bugs using a new approach: selective and lightweight versioning (SLEEVE). With this approach, actions performed by important subsystems of HDFS (e.g., namespace management) are checked by a second implementation of the subsystem that uses lightweight, approximate data structures. We show that HARDFS detects and recovers from a wide range of fail-silent behaviors caused by random bit flips, targeted corruptions, and real software bugs. In particular, HARDFS handles 90% of the fail-silent faults that result from random memory corruption and correctly detects and recovers from 100% of 78 targeted corruptions and 5 real-world bugs. Moreover, it recov- ers orders of magnitude faster than full reboot by using micro-recovery. The extra protection in HARDFS incurs minimal performance and space overheads.
At very large scale, failures that we consider to be very rare can occur more frequently. HDFS already deals with handling machine and disk failure. This paper is about handling memory corruptions.
You can download it from here.
Original title and link: HDFS Paper: HARDFS - Hardening HDFS With Selective and Lightweight Versioning (©myNoSQL)
Monday, 1 October 2012
Quick Reference to Hadoop File System Commands
Steve Jin has put together a quick list of HDFS commands:
The first part
hadoop fsis always the same for file system related commands. After that is very much like typical Unix/Linux commands in syntax. Besides managing the HDFS itself, there are commands to import data files from local file system to HDFS, and export data files from HDFS to local file system. These commands are unique therefore deserve most attention.
[-put ... ] [-copyFromLocal ... ] [-moveFromLocal ... ] [-get [-ignoreCrc] [-crc] ] [-getmerge [addnl]] [-copyToLocal [-ignoreCrc] [-crc] ] [-moveToLocal [-crc] ]
Original title and link: Quick Reference to Hadoop File System Commands (©myNoSQL)
via: http://www.doublecloud.org/2012/09/hadoop-file-system-commands/
Friday, 28 September 2012
Quantcast File System for Hadoop
Quantcast released a new Hadoop file system QFS:
- fully compatible with HDFS
- licensed under Apache 2.0 license
- written in C++
- while HDFS replicates data 3 times, QFS requires only 1.5x raw capacity
- QFS supports two types of fault tolerance: chunk replication and Reed-Solomon encoding
-
QFS components (more details here):
-
QFS performance comparison to HDFS:
Now I’m looking forward to hear comments from HDFS experts about QFS.
Original title and link: Quantcast File System for Hadoop (©myNoSQL)
Monday, 6 August 2012
Big Data at Aadhaar With Hadoop, HBase, MongoDB, MySQL, and Solr
It’s unfortunate that the post focuses mostly on the usage of Spring and RabitMQ and the slidedeck doesn’t dive deeper into the architecture, data flows, and data stores, but the diagrams below should give you an idea of this truly polyglot persistentency architecture:
The slide deck presenting architecture principles and numbers about the platform after the break.
Thursday, 26 July 2012
Attacking HDFS’s Defense: Why Does Cloudera *Really* Use HDFS?
In a reply to Cloudera’s defense of HDFS, Jeff Darcy comments about the portability of HDFS:
This is also not an HDFS exclusive. Any of the alternatives that were developed outside the Hadoopiverse have this quality as well. If you have data in Cassandra or Ceph you can keep it in Cassandra or Ceph as you go Hadoop-distro shopping. The biggest data-portability wall here is HDFS’s, because it’s one of only two such systems (the other being MapR) that’s Hadoop-specific. It doesn’t even try to be a general-purpose filesystem or database. A tremendous amount of work has gone into several excellent tools to import data into HDFS, but that work wouldn’t even be necessary with some of the alternatives. That’s not just a waste of machine cycles; it’s also a waste of engineer cycles. If they hadn’t been stuck in the computer equivalent of shipping and receiving, the engineers who developed those tools might have created something even more awesome. I know some of them, and they’re certainly capable of it. Each application can write the data it generates using some set of interfaces. If HDFS isn’t one of those, or if HDFS through that interface is unbearably slow because the HDFS folks treat anything other than their own special snowflake as second class, then you’ll be the one copying massive amounts of data before you can analyze it … not just once, but every time.
Original title and link: Attacking HDFS’s Defense: Why Does Cloudera *Really* Use HDFS? (©myNoSQL)
via: http://hekafs.org/index.php/2012/07/why-does-cloudera-really-use-hdfs/
Defending Hadoop’s HDFS - Cloudera Version
Building on Eric Baldeschwieler’s defense of HDFS, Cloudera’s Charles Zedlewski adds a couple of HDFS advantages:
- Choice: Customers get to work with any leading hardware vendor and let the best possible price / performer win the decision, not whatever the vendor decided to bundle in.
- Portability: It is possible for customers running Hadoop distributions based on HDFS to move between those different distributions without having to reformat the cluster or copy massive amounts of data. When you’re talking about petabytes of data, this kind of portability is vital. Without it, your vendor has incredible leverage when it comes time to negotiate the next purchase.
- Shared industry R&D We at Cloudera are proud of our employee’s own contributions to HDFS, and they collaborate with their colleagues at Hortonworks. But today you will find that IBM, Microsoft and VMware are also contributing to HDFS to make it work better with their products. In the future I predict you’ll find hard drive, networking and server manufacturers also add patches to HDFS to ensure their technologies run optimally with it.
Original title and link: Defending Hadoop’s HDFS - Cloudera Version (©myNoSQL)
via: http://www.cloudera.com/blog/2012/07/why-we-build-our-platform-on-hdfs/
Defending Hadoop’s HDFS - Hortonworks Version
In reply to the attack to HDFS, Eric Baldeschwieler emphasizes the pros of HDFS:
- Extreme low cost per byte
- Very high bandwidth to support MapReduce workloads
- Data reliability
but also the state of the HDFS competition:
- not designed for Hadoop scale
- not using commodity hardware or open source software
- not meant for MapReduce
- unproven technology
Original title and link: Defending Hadoop’s HDFS - Hortonworks Version (©myNoSQL)
via: http://hortonworks.com/blog/thinking-about-the-hdfs-vs-other-storage-technologies/
Wednesday, 25 July 2012
Attacking Hadoop’s HDFS: 8 Ways to Replace HDFS
Derick Harris for GigaOm:
Ironically, one of Hadoop’s biggest shortcomings now is also one of its biggest strengths going forward —the Hadoop Distributed File System.
But if the growing number of options for replacing HDFS signifies anything, it’s that HDFS isn’t quite where it needs to be.
No alternatives => vendor lock-in => bad
Multiple options => proof of weaknesses => bad
Confused? I am a bit.
Original title and link: Attacking Hadoop’s HDFS: 8 Ways to Replace HDFS (©myNoSQL)
via: http://gigaom.com/cloud/because-hadoop-isnt-perfect-8-ways-to-replace-hdfs/
Monday, 16 July 2012
Comparing File Formats and Compression Methods in HDFS and Hive
The post is a bit old, but the data contained comparing different compression methods is helpful:
Original title and link: Comparing File Formats and Compression Methods in HDFS and Hive (©myNoSQL)
Monday, 5 March 2012
Hadoop Namenode High Availability Merged to HDFS Trunk
As I’m slowly recovering after a severe poisoning that I initially ignored but finally put me to bed for almost a week, I’m going to post some of the most interesting articles I’ve read while resting.
Hadoop Namenode’s single point of failure has always been mentioned as one of the weaknesses of Hadoop and also as a differentiator of other Hadoop-based commercial offerings. But now the Namenode HA branch was merged into trunk and while it will take a couple of cicles to complete the tests, this will become soon part of the Hadoop distribution.
Here’s Jitendra Pandey announcement on Hortonworks’s blog:
Significant enhancements were completed to make HOT Failover work:
- Configuration changes for HA
- Notion of active and standby states were added to the Namenode
- Client-side redirection
- Standby processing journal from Active
- Dual block reports to Active and Standby
In a follow up post to Gartner’s article Apache Hadoop 1.0 Doesn’t Clear Up Trunks and Branches Questions. Do Distributions?, the advantage of using custom distributions will slowly vanish and the open source version will be the one you’ll want to have in production.
Original title and link: Hadoop Namenode High Availability Merged to HDFS Trunk (©myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling




