hbase: All content tagged as hbase in NoSQL databases and polyglot persistence
Thursday, 31 January 2013
SQL Over HBase With Phoenix
Released by the Salesforce team, Phoenix adds a SQL layer on top of HBase and an almost complete JDBC driver.
Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.
The project already has a page about the performance and the results are looking great. For a bullet list summary, check out James Taylor’s post.
Original title and link: SQL Over HBase With Phoenix (©myNoSQL)
Monday, 21 January 2013
11 Interesting Releases From the First Weeks of January
The list of releases I wanted to post about has been growing fast these last couple of weeks, so instead of waiting leaving it to Here it is (in no particular order1):
- (Jan.2nd) Cassandra 1.2 — announcement on DataStax’s blog. I’m currently learning and working on a post looking at what’s new in Cassandra 1.2.
- (Jan.10th) Apache Pig 0.10.1 — Hortonworks wrote about it
- (Jan.10th) DataStax Community Edition 1.2 and OpsCenter 2.1.3 — DataStax announcement
- (Jan.10th) CouchDB 1.0.4, 1.1.2, and 1.2.1 — releases fixing some security vulnerabilities
-
(Jan.11th) MongoDB 2.3.2 unstable — announcement. This dev release includes support for full text indexing. For more details you can check:
- MongoDB Full Text Search Explained and MongoDB Text Search Tutorial
- Full text search in MongoDB: details about supported languages and queries
- Indexing a Markdown blog using MongoDB full text indexing
- Short demo of MongoDB text search and hashed shard keys
- (Jan.12th) Apache HBase 0.94.4 — announcement and release notes
- (Jan.14th) Apache Hive 0.10.0: Hortonworks’s post about it
- (Jan.15th) Hortonworks Data Platform 1.2 featuring Apache Amabari — official PR announcement
- (Jan.16th) Redis 2.6.9 — release notes
- (Jan.16th) HyperDex 1.0RC1 — no docs
- (Jan.16th) Klout’s Brickhouse — announcement:
[…] an open source project extending Hadoop and Hive with a collection of useful user-defined-functions. Its aim is to make the Hive Big Data developer more productive, and to enable scalable and robust dataflows.
-
I’ve tried to order it chronologically, but most probably I’ve failed. ↩
Original title and link: 11 Interesting Releases From the First Weeks of January (©myNoSQL)
Monday, 19 November 2012
HBase Roadmap
Deveraj Das’s post on Hortonworks blog details the current and future work on HBase:
- Reliability and High Availability (all data always available, and recovery from failures is quick)
- Autonomous operation (minimum operator intervention)
- Wire compatibility (to support rolling upgrades across a couple of versions at least)
- Cross data-center replication (for disaster recovery)
- Snapshots and backups (be able to take periodic snapshots of certain/all tables and be able to restore them at a later point if required)
- Monitoring and Diagnostics (which regionserver is hot or what caused an outage)
Future:
- Better and improved clients (asynchronous clients, and, in multiple languages)
- Cell-level security (access control for every cell in a table)
- Multi-tenancy (HBase becomes a viable shared platform for multiple applications using it)
- Secondary indexing functionality
Current work=reliability. Future work=usability.
Original title and link: HBase Roadmap (©myNoSQL)
Thursday, 25 October 2012
YCSB Benchmark Results for Cassandra, HBase, MongoDB, MySQL Cluster, and Riak
Put together by the team at Altoros Systems Inc., this time run in the Amazon EC2 and including Cassandra, HBase, MongoDB, MySQL Cluster, sharded MySQL and Riak:
After some of the results had been presented to the public, some observers said MongoDB should not be compared to other NoSQL databases because it is more targeted at working with memory directly. We certainly understand this, but the aim of this investigation is to determine the best use cases for different NoSQL products. Therefore, the databases were tested under the same conditions, regardless of their specifics.
Teaser: HBase got the best results in most of the benchmarks (with flush turned off though). And I’m not sure the setup included the latest HBase read improvements from Facebook.
Original title and link: YCSB Benchmark Results for Cassandra, HBase, MongoDB, MySQL Cluster, and Riak (©myNoSQL)
Tuesday, 23 October 2012
Improving HBase Read Performance at Facebook
Starting from Hypertable v HBase benchmark and building on the things HBase could learn from it, the Facebook team set to improve the read performance in HBase. And they’ve accomplished it:
Original title and link: Improving HBase Read Performance at Facebook (©myNoSQL)
Tuesday, 2 October 2012
Pig the Big Data Duct Tape: Examples for MongoDB, HBase, and Cassandra
A three part article from Hortonworks showing how Pig can be used with MongoDB, HBase, and Cassandra:
Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.
- Part 1: Pig, MongoDB and Node.js
- Part 2: Pig, HBase, JRuby and Sinatra
- Part 3: TF-IDF Topics with Cassandra, Python Streaming and Flask
Original title and link: Pig the Big Data Duct Tape: Examples for MongoDB, HBase, and Cassandra (©myNoSQL)
Tuesday, 7 August 2012
Accumulo, HBase, Cassandra and Some Unanswered Questions
Cade Metz for Wired:
The bill bars the DoD from using the database unless the department can show that the software is sufficiently different from other databases that mimic BigTable. But at the same time, the bill orders the director of the NSA to work with outside organizations to merge the Accumulo security tools with alternative databases, specifically naming HBase and Cassandra.
Is this good for HBase and Cassandra? Is this good for encouraging innovation? Is this good for supporting businesses? Just a few questions I couldn’t answer myself after reading this article about the investigation initiated by the Senate into NSA’s open sourced Accumulo.
Original title and link: Accumulo, HBase, Cassandra and Some Unanswered Questions (©myNoSQL)
via: http://www.wired.com/wiredenterprise/2012/07/nsa-accumulo-google-bigtable/
Monday, 6 August 2012
Big Data at Aadhaar With Hadoop, HBase, MongoDB, MySQL, and Solr
It’s unfortunate that the post focuses mostly on the usage of Spring and RabitMQ and the slidedeck doesn’t dive deeper into the architecture, data flows, and data stores, but the diagrams below should give you an idea of this truly polyglot persistentency architecture:
The slide deck presenting architecture principles and numbers about the platform after the break.
Tuesday, 17 July 2012
Klout Data Architecture: MySQL, HBase, Hive, Pig, Elastic Search, MongoDB, SSAS
Just found slideck (embedded below) describing the data workflow at Klout. Their architecture includes many interesting pieces combining both NoSQL and relational databases with Hadoop and Hive and Pig and traditional BI. Even Excel gets a mention in the slides:
- Pig and Hive
- HBase
- Elastic Search
- MongoDB
- MySQL
Configuring HBase Memstore: What You Should Know
A very well documented post by Alex Baranau about HBase Memstore, HBase write and read operations and the importance of correctly configuring Memstore:
- There are number of configuration options for Memstore one can use to achieve better performance and avoid issues. HBase will not adjust settings for you based on usage pattern.
- Frequent Memstore flushes can affect reading performance and can bring additional load to the system
- The way Memstore flushes work may affect your schema design
Original title and link: Configuring HBase Memstore: What You Should Know (©myNoSQL)
via: http://blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know/
Thursday, 12 July 2012
How to Organize Your HBase Keys
The primary limitation of composite keys is that you can only query efficiently by known components of the composite key in the order they are serialized. Because of this limitation I find it easiest to think of your key like a funnel. Start with the piece of data you always need to partition on, and narrow it down to the more specific data that you don’t often need to distinguish.[…]
As a caveat to this process, keep in mind that HBase partitions its data across region servers based on the same lexicographic ordering that gets us the behavior we’re exploiting. If your reads/writes are heavily concentrated into a few values for the first (or first few) components of your key, you will end up with poorly distributed load across region servers. HBase functions best when the distribution of reads/writes is uniform across all potential row key values. While a perfectly uniform distribution might be impossible, this should still be a consideration when constructing a composite key.
This sounds in a way similar to how Amazon DynamoDB hash and range type primary keys or Oracle NoSQL Major-minor keys are working.
Original title and link: How to Organize Your HBase Keys (©myNoSQL)
Monday, 9 July 2012
HBase HFile Explained
This is probably the most comprehensible and complete articles about how HBase is storing data:
Hadoop comes with a SequenceFile[1] file format that you can use to append your key/value pairs but due to the hdfs append-only capability, the file format cannot allow modification or removal of an inserted value. […] To help you solve this problem Hadoop has another file format, called MapFile[1], an extension of the SequenceFile. The MapFile, in reality, is a directory that contains two SequenceFiles: the data file “/data” and the index file “/index”. The MapFile allows you to append sorted key/value pairs and every N keys (where N is a configurable interval) it stores the key and the offset in the index.
Original title and link: HBase HFile Explained (©myNoSQL)
via: http://www.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling




