ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

HBase: All content tagged as HBase in NoSQL databases and polyglot persistence

YCSB Benchmark Results for Cassandra, HBase, MongoDB, MySQL Cluster, and Riak

Put together by the team at Altoros Systems Inc., this time run in the Amazon EC2 and including Cassandra, HBase, MongoDB, MySQL Cluster, sharded MySQL and Riak:

After some of the results had been presented to the public, some observers said MongoDB should not be compared to other NoSQL databases because it is more targeted at working with memory directly. We certainly understand this, but the aim of this investigation is to determine the best use cases for different NoSQL products. Therefore, the databases were tested under the same conditions, regardless of their specifics.

Teaser: HBase got the best results in most of the benchmarks (with flush turned off though). And I’m not sure the setup included the latest HBase read improvements from Facebook.

Original title and link: YCSB Benchmark Results for Cassandra, HBase, MongoDB, MySQL Cluster, and Riak (NoSQL database©myNoSQL)

via: http://www.networkworld.com/cgi-bin/mailto/x.cgi?pagetosend=/news/tech/2012/102212-nosql-263595.html&pagename=/news/tech/2012/102212-nosql-263595.html&pageurl=http://www.networkworld.com/news/tech/2012/102212-nosql-263595.html&site=printpage&nsdr=n


Improving HBase Read Performance at Facebook

Starting from Hypertable v HBase benchmark and building on the things HBase could learn from it, the Facebook team set to improve the read performance in HBase. And they’ve accomplished it:

HBase v Hypertable read performance before-after improvements

Original title and link: Improving HBase Read Performance at Facebook (NoSQL database©myNoSQL)

via: http://hadoopstack.com/hbase-versus-hypertable/


Pig the Big Data Duct Tape: Examples for MongoDB, HBase, and Cassandra

A three part article from Hortonworks showing how Pig can be used with MongoDB, HBase, and Cassandra:

Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.

Original title and link: Pig the Big Data Duct Tape: Examples for MongoDB, HBase, and Cassandra (NoSQL database©myNoSQL)


Accumulo, HBase, Cassandra and Some Unanswered Questions

Cade Metz for Wired:

The bill bars the DoD from using the database unless the department can show that the software is sufficiently different from other databases that mimic BigTable. But at the same time, the bill orders the director of the NSA to work with outside organizations to merge the Accumulo security tools with alternative databases, specifically naming HBase and Cassandra.

Is this good for HBase and Cassandra? Is this good for encouraging innovation? Is this good for supporting businesses? Just a few questions I couldn’t answer myself after reading this article about the investigation initiated by the Senate into NSA’s open sourced Accumulo.

Original title and link: Accumulo, HBase, Cassandra and Some Unanswered Questions (NoSQL database©myNoSQL)

via: http://www.wired.com/wiredenterprise/2012/07/nsa-accumulo-google-bigtable/


Big Data at Aadhaar With Hadoop, HBase, MongoDB, MySQL, and Solr

It’s unfortunate that the post focuses mostly on the usage of Spring and RabitMQ and the slidedeck doesn’t dive deeper into the architecture, data flows, and data stores, but the diagrams below should give you an idea of this truly polyglot persistentency architecture:

Architecture of Big Data at Aadhaar

Big Data at Aadhaar Data Stores

The slide deck presenting architecture principles and numbers about the platform after the break.


Klout Data Architecture: MySQL, HBase, Hive, Pig, Elastic Search, MongoDB, SSAS

Just found slideck (embedded below) describing the data workflow at Klout. Their architecture includes many interesting pieces combining both NoSQL and relational databases with Hadoop and Hive and Pig and traditional BI. Even Excel gets a mention in the slides:

  1. Pig and Hive
  2. HBase
  3. Elastic Search
  4. MongoDB
  5. MySQL

Klout Data Architecture


Configuring HBase Memstore: What You Should Know

A very well documented post by Alex Baranau about HBase Memstore, HBase write and read operations and the importance of correctly configuring Memstore:

  • There are number of configuration options for Memstore one can use to achieve better performance and avoid issues. HBase will not adjust settings for you based on usage pattern.
  • Frequent Memstore flushes can affect reading performance and can bring additional load to the system
  • The way Memstore flushes work may affect your schema design

hbase_read_write_path2_small

Original title and link: Configuring HBase Memstore: What You Should Know (NoSQL database©myNoSQL)

via: http://blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know/


How to Organize Your HBase Keys

The primary limitation of composite keys is that you can only query efficiently by known components of the composite key in the order they are serialized. Because of this limitation I find it easiest to think of your key like a funnel. Start with the piece of data you always need to partition on, and narrow it down to the more specific data that you don’t often need to distinguish.[…]

As a caveat to this process, keep in mind that HBase partitions its data across region servers based on the same lexicographic ordering that gets us the behavior we’re exploiting. If your reads/writes are heavily concentrated into a few values for the first (or first few) components of your key, you will end up with poorly distributed load across region servers. HBase functions best when the distribution of reads/writes is uniform across all potential row key values. While a perfectly uniform distribution might be impossible, this should still be a consideration when constructing a composite key.

This sounds in a way similar to how Amazon DynamoDB hash and range type primary keys or Oracle NoSQL Major-minor keys are working.

Original title and link: How to Organize Your HBase Keys (NoSQL database©myNoSQL)

via: http://tech.flurry.com/137492485


HBase HFile Explained

This is probably the most comprehensible and complete articles about how HBase is storing data:

Hadoop comes with a SequenceFile[1] file format that you can use to append your key/value pairs but due to the hdfs append-only capability, the file format cannot allow modification or removal of an inserted value. […] To help you solve this problem Hadoop has another file format, called MapFile[1], an extension of the SequenceFile. The MapFile, in reality, is a directory that contains two SequenceFiles: the data file “/data” and the index file “/index”. The MapFile allows you to append sorted key/value pairs and every N keys (where N is a configurable interval) it stores the key and the offset in the index.

Original title and link: HBase HFile Explained (NoSQL database©myNoSQL)

via: http://www.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/


Hortonworks Data Platform 1.0

Hortonworks has announced the 1.0 release of the Hortonworks Data Platform prior to the Hadoop Summit 2012 together with a lot of supporting quotes from companies like Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata, and VMware.

Some info points:

  1. Hortonworks Data Platform is a platform meant to simplify the installation, integration, management, and use of Apache Hadoop

    hdp-diagram

    1. HDP 1.0 is based on Apache Hadoop 1.0
    2. Apache Ambari is used for installation and provisioning
    3. The same Apache Amabari is behind the Hortonworks Management Console
    4. For Data integration, HDP offers WebHDFS, HCatalog APIs, and Talend Open Studio
    5. Apache HCatalog is the solution offering metadata and table management
  2. Hortonworks Data Platform is 100% open source—I really appreciate Hortonworks’s dedication to the Apache Hadoop project and open source community

  3. HDP comes with 3 levels of support subscriptions, pricing starting at $12500/year for a 10 nodes cluster

One of the most interesting aspects of the Hortonworks Data Platform release is that the high-availability (HA) option for HDP is based on using VMWare-powered virtual machines for the NameNode and JobTracker. My first thought about this approach is that it was chosen to strengthen a partnership with VMWare. On the other hand, Hadoop 2.0 contains already a new highly-available version of the NameNode (Cloudera Hadoop Distribution uses this solution) and VMWare has bigger plans for a virtualization-friendly Hadoop environment with project Serengeti.

You can read a lot of posts about this announcement, but you’ll find all the details in Hortonworks’s John Kreisa’s post here and the PR announcement.

Original title and link: Hortonworks Data Platform 1.0 (NoSQL database©myNoSQL)


Performance Evaluation of HBase and How Hardware Changes Results

Two posts by Oliver Meyn on measuring the performance of two HBase clusters—first results on the original cluster and results on the upgraded cluster— using org.apache.hadoop.hbase.PerformanceEvaluation, the resulting performance charts, Ganglia charts, and some thoughts and feedback from the HBase community.

Original title and link: Performance Evaluation of HBase and How Hardware Changes Results (NoSQL database©myNoSQL)


HBase 0.94 Released: What’s New

With over 350 enhancements and bug fixes, 0.94 is the new major release of HBase. This Cloudera blog post does a good summary of the most interesting improvements:

  • Read caching improvements
  • Seek optimizations
  • WAL writes optimizations
  • added functionality to HBck: fixing orphaned regions, region holes, overlapping regions
  • simplified region sizing
  • atomic Put & Delete in a single transaction

Original title and link: HBase 0.94 Released: What’s New (NoSQL database©myNoSQL)