HBase: All content tagged as HBase in NoSQL databases and polyglot persistence
Thursday, 25 October 2012
YCSB Benchmark Results for Cassandra, HBase, MongoDB, MySQL Cluster, and Riak
Put together by the team at Altoros Systems Inc., this time run in the Amazon EC2 and including Cassandra, HBase, MongoDB, MySQL Cluster, sharded MySQL and Riak:
After some of the results had been presented to the public, some observers said MongoDB should not be compared to other NoSQL databases because it is more targeted at working with memory directly. We certainly understand this, but the aim of this investigation is to determine the best use cases for different NoSQL products. Therefore, the databases were tested under the same conditions, regardless of their specifics.
Teaser: HBase got the best results in most of the benchmarks (with flush turned off though). And I’m not sure the setup included the latest HBase read improvements from Facebook.
Original title and link: YCSB Benchmark Results for Cassandra, HBase, MongoDB, MySQL Cluster, and Riak (©myNoSQL)
Tuesday, 23 October 2012
Improving HBase Read Performance at Facebook
Starting from Hypertable v HBase benchmark and building on the things HBase could learn from it, the Facebook team set to improve the read performance in HBase. And they’ve accomplished it:
Original title and link: Improving HBase Read Performance at Facebook (©myNoSQL)
Tuesday, 2 October 2012
Pig the Big Data Duct Tape: Examples for MongoDB, HBase, and Cassandra
A three part article from Hortonworks showing how Pig can be used with MongoDB, HBase, and Cassandra:
Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.
- Part 1: Pig, MongoDB and Node.js
- Part 2: Pig, HBase, JRuby and Sinatra
- Part 3: TF-IDF Topics with Cassandra, Python Streaming and Flask
Original title and link: Pig the Big Data Duct Tape: Examples for MongoDB, HBase, and Cassandra (©myNoSQL)
Tuesday, 7 August 2012
Accumulo, HBase, Cassandra and Some Unanswered Questions
Cade Metz for Wired:
The bill bars the DoD from using the database unless the department can show that the software is sufficiently different from other databases that mimic BigTable. But at the same time, the bill orders the director of the NSA to work with outside organizations to merge the Accumulo security tools with alternative databases, specifically naming HBase and Cassandra.
Is this good for HBase and Cassandra? Is this good for encouraging innovation? Is this good for supporting businesses? Just a few questions I couldn’t answer myself after reading this article about the investigation initiated by the Senate into NSA’s open sourced Accumulo.
Original title and link: Accumulo, HBase, Cassandra and Some Unanswered Questions (©myNoSQL)
via: http://www.wired.com/wiredenterprise/2012/07/nsa-accumulo-google-bigtable/
Monday, 6 August 2012
Big Data at Aadhaar With Hadoop, HBase, MongoDB, MySQL, and Solr
It’s unfortunate that the post focuses mostly on the usage of Spring and RabitMQ and the slidedeck doesn’t dive deeper into the architecture, data flows, and data stores, but the diagrams below should give you an idea of this truly polyglot persistentency architecture:
The slide deck presenting architecture principles and numbers about the platform after the break.
Tuesday, 17 July 2012
Klout Data Architecture: MySQL, HBase, Hive, Pig, Elastic Search, MongoDB, SSAS
Just found slideck (embedded below) describing the data workflow at Klout. Their architecture includes many interesting pieces combining both NoSQL and relational databases with Hadoop and Hive and Pig and traditional BI. Even Excel gets a mention in the slides:
- Pig and Hive
- HBase
- Elastic Search
- MongoDB
- MySQL
Configuring HBase Memstore: What You Should Know
A very well documented post by Alex Baranau about HBase Memstore, HBase write and read operations and the importance of correctly configuring Memstore:
- There are number of configuration options for Memstore one can use to achieve better performance and avoid issues. HBase will not adjust settings for you based on usage pattern.
- Frequent Memstore flushes can affect reading performance and can bring additional load to the system
- The way Memstore flushes work may affect your schema design
Original title and link: Configuring HBase Memstore: What You Should Know (©myNoSQL)
via: http://blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know/
Thursday, 12 July 2012
How to Organize Your HBase Keys
The primary limitation of composite keys is that you can only query efficiently by known components of the composite key in the order they are serialized. Because of this limitation I find it easiest to think of your key like a funnel. Start with the piece of data you always need to partition on, and narrow it down to the more specific data that you don’t often need to distinguish.[…]
As a caveat to this process, keep in mind that HBase partitions its data across region servers based on the same lexicographic ordering that gets us the behavior we’re exploiting. If your reads/writes are heavily concentrated into a few values for the first (or first few) components of your key, you will end up with poorly distributed load across region servers. HBase functions best when the distribution of reads/writes is uniform across all potential row key values. While a perfectly uniform distribution might be impossible, this should still be a consideration when constructing a composite key.
This sounds in a way similar to how Amazon DynamoDB hash and range type primary keys or Oracle NoSQL Major-minor keys are working.
Original title and link: How to Organize Your HBase Keys (©myNoSQL)
Monday, 9 July 2012
HBase HFile Explained
This is probably the most comprehensible and complete articles about how HBase is storing data:
Hadoop comes with a SequenceFile[1] file format that you can use to append your key/value pairs but due to the hdfs append-only capability, the file format cannot allow modification or removal of an inserted value. […] To help you solve this problem Hadoop has another file format, called MapFile[1], an extension of the SequenceFile. The MapFile, in reality, is a directory that contains two SequenceFiles: the data file “/data” and the index file “/index”. The MapFile allows you to append sorted key/value pairs and every N keys (where N is a configurable interval) it stores the key and the offset in the index.
Original title and link: HBase HFile Explained (©myNoSQL)
via: http://www.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
Friday, 15 June 2012
Hortonworks Data Platform 1.0
Hortonworks has announced the 1.0 release of the Hortonworks Data Platform prior to the Hadoop Summit 2012 together with a lot of supporting quotes from companies like Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata, and VMware.
Some info points:
-
Hortonworks Data Platform is a platform meant to simplify the installation, integration, management, and use of Apache Hadoop
- HDP 1.0 is based on Apache Hadoop 1.0
- Apache Ambari is used for installation and provisioning
- The same Apache Amabari is behind the Hortonworks Management Console
- For Data integration, HDP offers WebHDFS, HCatalog APIs, and Talend Open Studio
- Apache HCatalog is the solution offering metadata and table management
-
Hortonworks Data Platform is 100% open source—I really appreciate Hortonworks’s dedication to the Apache Hadoop project and open source community
- HDP comes with 3 levels of support subscriptions, pricing starting at $12500/year for a 10 nodes cluster
One of the most interesting aspects of the Hortonworks Data Platform release is that the high-availability (HA) option for HDP is based on using VMWare-powered virtual machines for the NameNode and JobTracker. My first thought about this approach is that it was chosen to strengthen a partnership with VMWare. On the other hand, Hadoop 2.0 contains already a new highly-available version of the NameNode (Cloudera Hadoop Distribution uses this solution) and VMWare has bigger plans for a virtualization-friendly Hadoop environment with project Serengeti.
You can read a lot of posts about this announcement, but you’ll find all the details in Hortonworks’s John Kreisa’s post here and the PR announcement.
Original title and link: Hortonworks Data Platform 1.0 (©myNoSQL)
Friday, 8 June 2012
Performance Evaluation of HBase and How Hardware Changes Results
Two posts by Oliver Meyn on measuring the performance of two HBase clusters—first results on the original cluster and results on the upgraded cluster— using org.apache.hadoop.hbase.PerformanceEvaluation, the resulting performance charts, Ganglia charts, and some thoughts and feedback from the HBase community.
Original title and link: Performance Evaluation of HBase and How Hardware Changes Results (©myNoSQL)
Thursday, 17 May 2012
HBase 0.94 Released: What’s New
With over 350 enhancements and bug fixes, 0.94 is the new major release of HBase. This Cloudera blog post does a good summary of the most interesting improvements:
- Read caching improvements
- Seek optimizations
- WAL writes optimizations
- added functionality to HBck: fixing orphaned regions, region holes, overlapping regions
- simplified region sizing
- atomic Put & Delete in a single transaction
Original title and link: HBase 0.94 Released: What’s New (©myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling





