NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



hbase: All content about hbase in NoSQL databases and polyglot persistence

Configuring HBase Memstore: What You Should Know

A very well documented post by Alex Baranau about HBase Memstore, HBase write and read operations and the importance of correctly configuring Memstore:

  • There are number of configuration options for Memstore one can use to achieve better performance and avoid issues. HBase will not adjust settings for you based on usage pattern.
  • Frequent Memstore flushes can affect reading performance and can bring additional load to the system
  • The way Memstore flushes work may affect your schema design


Original title and link: Configuring HBase Memstore: What You Should Know (NoSQL database©myNoSQL)


How to Organize Your HBase Keys

The primary limitation of composite keys is that you can only query efficiently by known components of the composite key in the order they are serialized. Because of this limitation I find it easiest to think of your key like a funnel. Start with the piece of data you always need to partition on, and narrow it down to the more specific data that you don’t often need to distinguish.[…]

As a caveat to this process, keep in mind that HBase partitions its data across region servers based on the same lexicographic ordering that gets us the behavior we’re exploiting. If your reads/writes are heavily concentrated into a few values for the first (or first few) components of your key, you will end up with poorly distributed load across region servers. HBase functions best when the distribution of reads/writes is uniform across all potential row key values. While a perfectly uniform distribution might be impossible, this should still be a consideration when constructing a composite key.

This sounds in a way similar to how Amazon DynamoDB hash and range type primary keys or Oracle NoSQL Major-minor keys are working.

Original title and link: How to Organize Your HBase Keys (NoSQL database©myNoSQL)


HBase HFile Explained

This is probably the most comprehensible and complete articles about how HBase is storing data:

Hadoop comes with a SequenceFile[1] file format that you can use to append your key/value pairs but due to the hdfs append-only capability, the file format cannot allow modification or removal of an inserted value. […] To help you solve this problem Hadoop has another file format, called MapFile[1], an extension of the SequenceFile. The MapFile, in reality, is a directory that contains two SequenceFiles: the data file “/data” and the index file “/index”. The MapFile allows you to append sorted key/value pairs and every N keys (where N is a configurable interval) it stores the key and the offset in the index.

Original title and link: HBase HFile Explained (NoSQL database©myNoSQL)


Hortonworks Data Platform 1.0

Hortonworks has announced the 1.0 release of the Hortonworks Data Platform prior to the Hadoop Summit 2012 together with a lot of supporting quotes from companies like Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata, and VMware.

Some info points:

  1. Hortonworks Data Platform is a platform meant to simplify the installation, integration, management, and use of Apache Hadoop


    1. HDP 1.0 is based on Apache Hadoop 1.0
    2. Apache Ambari is used for installation and provisioning
    3. The same Apache Amabari is behind the Hortonworks Management Console
    4. For Data integration, HDP offers WebHDFS, HCatalog APIs, and Talend Open Studio
    5. Apache HCatalog is the solution offering metadata and table management
  2. Hortonworks Data Platform is 100% open source—I really appreciate Hortonworks’s dedication to the Apache Hadoop project and open source community

  3. HDP comes with 3 levels of support subscriptions, pricing starting at $12500/year for a 10 nodes cluster

One of the most interesting aspects of the Hortonworks Data Platform release is that the high-availability (HA) option for HDP is based on using VMWare-powered virtual machines for the NameNode and JobTracker. My first thought about this approach is that it was chosen to strengthen a partnership with VMWare. On the other hand, Hadoop 2.0 contains already a new highly-available version of the NameNode (Cloudera Hadoop Distribution uses this solution) and VMWare has bigger plans for a virtualization-friendly Hadoop environment with project Serengeti.

You can read a lot of posts about this announcement, but you’ll find all the details in Hortonworks’s John Kreisa’s post here and the PR announcement.

Original title and link: Hortonworks Data Platform 1.0 (NoSQL database©myNoSQL)

Performance Evaluation of HBase and How Hardware Changes Results

Two posts by Oliver Meyn on measuring the performance of two HBase clusters—first results on the original cluster and results on the upgraded cluster— using org.apache.hadoop.hbase.PerformanceEvaluation, the resulting performance charts, Ganglia charts, and some thoughts and feedback from the HBase community.

Original title and link: Performance Evaluation of HBase and How Hardware Changes Results (NoSQL database©myNoSQL)

HBase 0.94 Released: What’s New

With over 350 enhancements and bug fixes, 0.94 is the new major release of HBase. This Cloudera blog post does a good summary of the most interesting improvements:

  • Read caching improvements
  • Seek optimizations
  • WAL writes optimizations
  • added functionality to HBck: fixing orphaned regions, region holes, overlapping regions
  • simplified region sizing
  • atomic Put & Delete in a single transaction

Original title and link: HBase 0.94 Released: What’s New (NoSQL database©myNoSQL)

Notes on the Hadoop and HBase Markets

Curt Monash shares what he heard from his customers:

  • Over half of Cloudera’s customers (nb 100 subscription customers) use HBase
  • Hortonworks thinks a typical enterprise Hadoop cluster has 20-50 nodes, with 50-100 already being on the large side.
  • There are huge amounts of Elastic MapReduce/Hadoop processing in the Amazon cloud. Some estimates say it’s the majority of all Amazon Web Services processing.

Original title and link: Notes on the Hadoop and HBase Markets (NoSQL database©myNoSQL)


Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop

Apache Bigtop:

The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.

Currently packaging:

  • Apache Hadoop 1.0.x
  • Apache Zookeeper 3.4.3
  • Apache HBase 0.92.0
  • Apache Hive 0.8.1
  • Apache Pig 0.9.2
  • Apache Mahout 0.6.1
  • Apache Oozie 3.1.3
  • Apache Sqoop 1.4.1
  • Apache Flume 1.0.0
  • Apache Whirr 0.7.0

Apache Bigtop looks like the first step towards the Big Data LAMP-like platform analysts are calling for. Practically though it’s goal is to ensure that all the components of the wide Hadoop ecosystem remain interoperable.

Original title and link: Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop (NoSQL database©myNoSQL)

Here Is Why in Cassandra vs. HBase, Riak, CouchDB, MongoDB, It's Cassandra FTW

Brian ONeill:

Now, since choosing Cassandra, I can say there are a few other really important less tangible considerations. The first, is the code base. Cassandra has an extremely clean and well maintained code base. Jonathan and team do a fantastic job managing the community and the code. As we adopted NoSQL, the ability to extend the code-base and incorporate our own features has proven invaluable. (e.g. triggers, a REST interface, and server-side wide-row indexing)

Secondly, the community is phenomenal. That results in timely support, and solid releases on a regular schedule. They do a great job prioritizing features, accepting contributions, and cranking out features. (They are now releasing ~quarterly) We’ve all probably been part of other open source projects where the leadership is lacking, and features and releases are unpredictable, which makes your own release planning difficult. Kudos to the Cassandra team.

Everything sounds reasonable except for Riak being the “new kid on the block” and not finding support for it. Basho, where were you hidding?

Original title and link: Here Is Why in Cassandra vs. HBase, Riak, CouchDB, MongoDB, It’s Cassandra FTW (NoSQL database©myNoSQL)


The HBase Roadmap: Where Do We Want HBase to Be in Two Years?

The HBase project management committee:

After further banter, we arrived at list: reliability, operability (insight into the running application, dynamic config. changes, usability improvements that make it easier on a clueful ops), and performance (in this order). It was offered that we are not too bad on performance — especially in 0.94 — and that use cases will drive the performance improvements so focus should be on the first two items in the list. […] To improve reliability, testing has to be better. This has been said repeatedly in the past.

EMC has announced a 1000+ nodes cluster for Apache Hadoop testing, so maybe a similar initiative is needed for HBase too. Considering how many large organizations are using HBase it shouldn’t be difficult to get these resources as long as someone will assume ownership and leadership for it.

Original title and link: The HBase Roadmap: Where Do We Want HBase to Be in Two Years? (NoSQL database©myNoSQL)


Architecture of HBase-based Lucene Implementation

Boris Lublinsky and Mike Segel:

The implementation tries to balance two conflicting requirements - performance: in memory cache can drastically improve performance by minimizing the amount of HBase reads for search and documents retrieval; and scalability: ability to run as many Lucene instances as required to support growing search clients population. The latter requires minimizing of the cache life time to synchronize content with the HBase instance (a single copy of thruth). A compromise is achieved through implementing configurable cache time to live parameter, limiting cache presence in each Lucene instance.

Architecture of HBase-based Lucene implementation

Besides existing Solr scaling approaches and the work to make Solr scalable, there’s also the recently released DataStax Enterprise which integrates Solr on top of Cassandra.

Original title and link: Architecture of HBase-based Lucene Implementation (NoSQL database©myNoSQL)


ACID in HBase: Row Level Operations Explained. Plus Something New

Lars Hofhansl:

HBase employs a kind of MVCC. And HBase has no mixed read/write transactions. […] When a write transaction (a set of puts or deletes) starts it retrieves the next highest transaction number. In HBase this is called a WriteNumber. When a read transaction (a Scan or Get) starts it retrieves the transaction number of the last committed transaction. HBase calls this the ReadPoint.

Understanding the behavior of read and write operations in HBase is definitely useful. Learning that an upcoming HBase version will support atomic multi operations (HBASE-3584) and even multi-row local transactions (HBASE-5229) is priceless.

For HBase atomic multi-operations:

 Delete d = new Delete(ROW);
 Put p = new Put(ROW);
 AtomicRowMutation arm = new AtomicRowMutation(ROW);

and HBase multi-row local transactions is implemented as mutateRowsWithLocks method in HRegion and can be used by coprocessors only (no client API).

Original title and link: ACID in HBase: Row Level Operations Explained. Plus Something New (NoSQL database©myNoSQL)