NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



HDFS: All content tagged as HDFS in NoSQL databases and polyglot persistence

Heterogeneous storages in HDFS

In my post about in-memory databases vs Aster Data and Greenplum vs Hadoop market share, I’ve proposed a scenario in which Aster Data and Greenplum could expand into the space of in-memory databases by employing hybrid storage.

What I haven’t covered in that post is the possibility of Hadoop, actually HDFS, expanding into hybrid storage.

But that’s happening already and Hortonworks is already working on introducing support for heterogeneous storages in HDFS:

We plan to introduce the idea of Storage Preferences for files. A Storage Preference is a hint to HDFS specifying how the application would like block replicas for the given file to be placed. Initially the Storage Preference will include:

  1. The desired number of file replicas (also called the replication factor) and;
  2. The target storage type for the replicas.

Even if the costs of memory will continue to decrease at the same rate as before 2012, when they flat-lined, a cost effective architecture will almost always rely on hybrid storage.

Original title and link: Heterogeneous storages in HDFS (NoSQL database©myNoSQL)

Hadoop on top of… Intel adds Lustre support to Hadoop

Intel adds Lustre support to Hadoop:

We abstracted out an HDFS layer but underneath that it is actually talking to lustre.

This is not the first project based on the principle “we already have this distributed system, file system or database, so why not reusing it for Hadoop?”. What would be the first step of such a project? Provide a HDFS API compatible layer on top of your existing system. But how about the other assumptions in HDFS: large block, sequential, local access, etc? How do you guarantee that your integration addressed all of them?

If this trends continues, I could see one of the companies behind the open source Hadoop, Cloudera or Hortonworks or both, coming up with a TCK sold to any company that claims HDFS compatibility.

Original title and link: Hadoop on top of… Intel adds Lustre support to Hadoop (NoSQL database©myNoSQL)

NFS access to HDFS

Brandon Li, co-author of the HDFS NFS Gateway proposal (PDF) tracked on HDFS-4750:

With NFS access to HDFS, you can mount the HDFS cluster as a volume on client machines and have native command line, scripts or file explorer UI to view HDFS files and load data into HDFS. NFS thus enables file-based applications to perform file read and write operations directly to Hadoop. This greatly simplifies data management in Hadoop and expands the integration of Hadoop into existing toolsets. […] Bringing the full capability of NFS to HDFS is an important strategic initiative for us.

So besides browsing the files stored in HDFS, which to me doesn’t sound too exciting, you’ll be able to upload or even stream data directly to HDFS. Now, that’s cool!

Original title and link: NFS access to HDFS (NoSQL database©myNoSQL)


Introduction to HBase Mean Time to Recover (MTTR) - HBase Resiliency

A fantastic post by Nicolas Liochon and Devaraj Das looking into possible HBase failure scenarios and configurations to reduce the Mean Time to Recover:

There are no global failures in HBase: if a region server fails, all the other regions are still available. For a given data-subset, the MTTR was often considered as around ten minutes. This rule of thumb was actually coming from a common case where the recovery was taking time because it was trying to use replicas on a dead datanode. Ten minutes would be the time taken by HDFS to declare a node as dead. With the new stale mode in HDFS, it’s not the case anymore, and the recovery is now bounded by HBase alone. If you care about MTTR, with the settings mentioned here, most cases will take less than 2 minutes between the actual failure and the data being available again in another region server.

Stepping away for a bit, it looks like the overall complexity comes from the various components involved in HBase (ZooKeeper, HBase, HDFS) and their own failure detection mechanisms. If they are not correctly configured and ordered, things can get pretty ugly; ugly as in longer MTTR than one would expect.

Original title and link: Introduction to HBase Mean Time to Recover (MTTR) - HBase Resiliency (NoSQL database©myNoSQL)


What’s New and Upcoming in HDFS

Great retrospective with many architecture details of the improvements added to HDFS in 2012 and what is planned for this year by Todd Lipcon.

For a quick overview:

  • 2012: HDFS 2.0
    • HA (in 2 phases)
    • Performance improvements:
      • for Impala: faster libhdfs, APIs for spindle-based scheduling
      • for HBase and Accumulo: direct reads from block files in secure environments, application level checksums and IOPS elimintation
    • on-the-wire encryption
    • rolling upgrades and wire compatibility
  • 2013:
    • HDFS snapshots
    • better storage density and file formats
    • caching and hierarchical storage management

Original title and link: What’s New and Upcoming in HDFS (NoSQL database©myNoSQL)

Data Deduplication Tactics With HDFS and MapReduce

5 techniques and links to research papers about data deduplication using HDFS and MapReduce:

Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and HDFS can be leveraged for eliminating duplicate data.

Patrick Durusau

Original title and link: Data Deduplication Tactics With HDFS and MapReduce (NoSQL database©myNoSQL)


HDFS Paper: HARDFS - Hardening HDFS With Selective and Lightweight Versioning

A paper authored by a team from Universities of Wisconsin and Chicago:

We harden the Hadoop Distributed File System (HDFS) against fail- silent (non fail-stop) behaviors that result from memory corruption and software bugs using a new approach: selective and lightweight versioning (SLEEVE). With this approach, actions performed by important subsystems of HDFS (e.g., namespace management) are checked by a second implementation of the subsystem that uses lightweight, approximate data structures. We show that HARDFS detects and recovers from a wide range of fail-silent behaviors caused by random bit flips, targeted corruptions, and real software bugs. In particular, HARDFS handles 90% of the fail-silent faults that result from random memory corruption and correctly detects and recovers from 100% of 78 targeted corruptions and 5 real-world bugs. Moreover, it recov- ers orders of magnitude faster than full reboot by using micro-recovery. The extra protection in HARDFS incurs minimal performance and space overheads.

At very large scale, failures that we consider to be very rare can occur more frequently. HDFS already deals with handling machine and disk failure. This paper is about handling memory corruptions.

You can download it from here.

Original title and link: HDFS Paper: HARDFS - Hardening HDFS With Selective and Lightweight Versioning (NoSQL database©myNoSQL)

Quick Reference to Hadoop File System Commands

Steve Jin has put together a quick list of HDFS commands:

The first part hadoop fs is always the same for file system related commands. After that is very much like typical Unix/Linux commands in syntax. Besides managing the HDFS itself, there are commands to import data files from local file system to HDFS, and export data files from HDFS to local file system. These commands are unique therefore deserve most attention.

[-put ... ]
[-copyFromLocal ... ]
[-moveFromLocal ... ]
[-get [-ignoreCrc] [-crc] ]
[-getmerge [addnl]]
[-copyToLocal [-ignoreCrc] [-crc] ]
[-moveToLocal [-crc] ]

Original title and link: Quick Reference to Hadoop File System Commands (NoSQL database©myNoSQL)


Quantcast File System for Hadoop

Quantcast released a new Hadoop file system QFS:

  1. fully compatible with HDFS
  2. licensed under Apache 2.0 license
  3. written in C++
  4. while HDFS replicates data 3 times, QFS requires only 1.5x raw capacity
  5. QFS supports two types of fault tolerance: chunk replication and Reed-Solomon encoding
  6. QFS components (more details here):


  7. QFS performance comparison to HDFS:

    QFS Performance Comparison to HDFS

Now I’m looking forward to hear comments from HDFS experts about QFS.

Original title and link: Quantcast File System for Hadoop (NoSQL database©myNoSQL)

Big Data at Aadhaar With Hadoop, HBase, MongoDB, MySQL, and Solr

It’s unfortunate that the post focuses mostly on the usage of Spring and RabitMQ and the slidedeck doesn’t dive deeper into the architecture, data flows, and data stores, but the diagrams below should give you an idea of this truly polyglot persistentency architecture:

Architecture of Big Data at Aadhaar

Big Data at Aadhaar Data Stores

The slide deck presenting architecture principles and numbers about the platform after the break.

Attacking HDFS’s Defense: Why Does Cloudera *Really* Use HDFS?

In a reply to Cloudera’s defense of HDFS, Jeff Darcy comments about the portability of HDFS:

This is also not an HDFS exclusive. Any of the alternatives that were developed outside the Hadoopiverse have this quality as well. If you have data in Cassandra or Ceph you can keep it in Cassandra or Ceph as you go Hadoop-distro shopping. The biggest data-portability wall here is HDFS’s, because it’s one of only two such systems (the other being MapR) that’s Hadoop-specific. It doesn’t even try to be a general-purpose filesystem or database. A tremendous amount of work has gone into several excellent tools to import data into HDFS, but that work wouldn’t even be necessary with some of the alternatives. That’s not just a waste of machine cycles; it’s also a waste of engineer cycles. If they hadn’t been stuck in the computer equivalent of shipping and receiving, the engineers who developed those tools might have created something even more awesome. I know some of them, and they’re certainly capable of it. Each application can write the data it generates using some set of interfaces. If HDFS isn’t one of those, or if HDFS through that interface is unbearably slow because the HDFS folks treat anything other than their own special snowflake as second class, then you’ll be the one copying massive amounts of data before you can analyze it … not just once, but every time.

Original title and link: Attacking HDFS’s Defense: Why Does Cloudera *Really* Use HDFS? (NoSQL database©myNoSQL)


Defending Hadoop’s HDFS - Cloudera Version

Building on Eric Baldeschwieler’s defense of HDFS, Cloudera’s Charles Zedlewski adds a couple of HDFS advantages:

  • Choice: Customers get to work with any leading hardware vendor and let the best possible price / performer win the decision, not whatever the vendor decided to bundle in.
  • Portability: It is possible for customers running Hadoop distributions based on HDFS to move between those different distributions without having to reformat the cluster or copy massive amounts of data. When you’re talking about petabytes of data, this kind of portability is vital. Without it, your vendor has incredible leverage when it comes time to negotiate the next purchase.
  • Shared industry R&D We at Cloudera are proud of our employee’s own contributions to HDFS, and they collaborate with their colleagues at Hortonworks. But today you will find that IBM, Microsoft and VMware are also contributing to HDFS to make it work better with their products. In the future I predict you’ll find hard drive, networking and server manufacturers also add patches to HDFS to ensure their technologies run optimally with it.

Original title and link: Defending Hadoop’s HDFS - Cloudera Version (NoSQL database©myNoSQL)