ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

column store: All content tagged as column store in NoSQL databases and polyglot persistence

Big Data Debate: HBase or Cassandra

This debate about the pros and cons of HBase and Cassandra set up by Doug Henschen for InformationWeek and featuring Jonathan Ellis (Cassandra, DataStax) and Michael Hausenbias (MapR) will stir some strong feelings:

Michael Hausenbias: An interesting proof point for the superiority of HBase is the fact that Facebook, the creator of Cassandra, replaced Cassandra with HBase for their internal use.

Jonathan Ellis: The technical shortcomings driving HBase’s lackluster adoption fall into two major categories: engineering problems that can be addressed given enough time and manpower, and architectural flaws that are inherent to the design and cannot be fixed.

✚ One question I couldn’t answer about this dialog is why HBase-side wasn’t covered by either a HBase community member or a user. Indeed MapR has interest in HBase, but their product is not HBase.

Original title and link: Big Data Debate: HBase or Cassandra (NoSQL database©myNoSQL)

via: http://www.informationweek.com/software/enterprise-applications/big-data-debate-will-hbase-become-domina/240159475?nomobile=1


$45millions more for DataStax

Holy cow! That’s a 4 followed by a 5… with no dots in between.

  1. Derrick Harris for GigaOm: NoSQL startup DataStax raises $45M to ride Cassandra’s wave:

    Cassandra’s success with such large users has to do with its ability to handle large-scale online applications that demand steady levels of performance, DataStax CEO Billy Bosworth told me. Scalability and performance have never been among Cassandra’s shortcomings, and the database is capable of replicating data across data centers. Large companies used to choose Oracle for applications that needed these capabilities, but now that NoSQL options are around and relatively mature, companies are rethinking whether the relational database model was ever really correct for some applications in the first place.

  2. Alex Williams for TC: DataStax Readies For IPO, Raises $45M For Modern Database Platform Suited To New Data Intensive World:

    DataStax will use the funding to build out globally and invest in Apache Cassandra, the NoSQL open-source project and foundation for the company’s database distributions. The funding also signals a potential IPO for DataStax but much will depend on the direction of the markets, said CEO Billy Bosworth in an interview yesterday. “We are building the company for that direction (IPO),” he said. “A l lot depends on external factors. Internally, the company is already starting that process.”

According to my books:

  1. This is the largest round raised by a NoSQL company. It tops 10gen’s $45mil for MongoDB.
  2. This is the 3rd largest round raised in the new data market, after Cloudera’s $65mil. and Hortonworks’s $50mil. rounds.

Original title and link: $45millions more for DataStax (NoSQL database©myNoSQL)


Get up and Running with Cassandra on Google Compute Engine

On the Google Cloud Platform blog:

The guide walks you through creating your nodes (instances), setting up Java, and creating and configuring a firewall. Included in the guide are several scripts that make the configuration and setup easy to understand and execute. Once you are finished with your cluster, a simple call to a teardown script cleans up your project’s environment.

Can you speculate why Cassandra is the first NoSQL database that gets mentioned on Google’s blog? (hint: maybe this?)

Original title and link: Get up and Running with Cassandra on Google Compute Engine (NoSQL database©myNoSQL)

via: http://googlecloudplatform.blogspot.com/2013/07/get-up-and-running-with-cassandra-on-google-compute-engine.html


How do you decide what database to use for what task?

Nathan Milford of Outbrain answering the question how do you decide what database to use for what task:

We look at how the data will be queried, its size, and how it needs to be distributed. We might use things like MySQL for historical reasons and MongoDB for smaller tasks, and then Cassandra for situations where data doesn’t all fit into memory or where it spans multiple machines and possibly data centers.

This is indeed the good recipe: data access model, data size, distribution model.

Original title and link: How do you decide what database to use for what task? (NoSQL database©myNoSQL)

via: http://www.datastax.com/dev/blog/the-five-minute-interview-outbrain-touches-over-80-of-all-us-online-users-with-help-from-cassandra


Cassandra Summit’s Bests

If you haven’t been to Cassandra Summit 2013 or you missed some presentations, now you can (re)watch them on YouTube. Jonathan Ellis put together his list of favorites here and here.

I’m posting this on Saturday as there’re a lot of interesting talks and if Cassandra is on your radar it will take a couple of weekends to go through them.

Original title and link: Cassandra Summit’s Bests (NoSQL database©myNoSQL)


Best argument for official drivers

Jonathan Ellis:

More qualitatively but perhaps even more important, this addresses the paradox of choice we’ve had in the Cassandra Java world: multiple driver choices provide another barrier to newcomers, where each must evaluate the options for applicability to his project. Having just done such an evaluation to settle on Cassandra itself, this is the last thing they want to spend time on.

And that’s the best-case scenario. More often, a fragmented landscape leads to many solutions, each of which solve a different 80% of the problem. Better to have a single, well-thought-out solution, that lets people get started writing their application immediately.

This is the best argument ever for having official drivers.

✚ In the early days and over long time it’s quite difficult for a company to offer only official drivers. But there’s a solution for that too: recommend one. And support its maintainers.

Original title and link: Best argument for official drivers (NoSQL database©myNoSQL)

via: http://www.datastax.com/dev/blog/the-native-cql-java-driver-goes-ga


Titan: Data Loading and Transactional Benchmark

The Aurelius team describing an advanced benchmark of Titan, a massive scale property graph allowing real-time traversals and updates, sponsored by Pearson, developed and run over 5 months:

The 10 terabyte, 121 billion edge graph was loaded into the cluster in 1.48 days at a rate of approximately 1.2 million edges a second with 0 failed transactions. These numbers were possible due to new developments in Titan 0.3.0 whereby graph partitioning is achieved using a domain-basedbyte order partitioner.

✚ The answer to why Titan is built on Cassandra can be found in this interview between Aurelius CTO Matthias Broecheler and DataStax co-founder Matt Pfeil:

[…] we don’t have to worry about things like replication, backup, and snap shots because all of that stuff is handled by Cassandra. We really just focus on: “How do you distribute a graph?”, “How do you represent a graph efficiently in a big table model?”, “How do you do things like etched compression and other things that are very graph specific in order to make the database fast? And, lastly, “How do to build intelligence index structures so that the graphs traversals, which are the core of any graph database, so that those are as fast as possible?”

Original title and link: Titan: Data Loading and Transactional Benchmark (NoSQL database©myNoSQL)

via: http://www.planetcassandra.org/blog/post/educating-the-planet-with-pearson


HBase migration to the new Hadoop Metrics2 system

Elliott Clarke explains a bit the work that his doing in migrating the HBase metrics to Hadoop’s Metrics2 system:

As HBase’s metrics system grew organically, Hadoop developers were making a new version of the Metrics system called Metrics2. In HADOOP-6728 and subsequent JIRAs, a new version of the metrics system was created. This new subsystem has a new name space, different sinks, different sources, more features, and is more complete than the old metrics. When the Metrics2 system was completed, the old system (aka Metrics1) was deprecated. With all of these things in mind, it was time to update HBase’s metrics system so HBASE-4050 was started. I also wanted to clean up the implementation cruft that had accumulated.

The post is more about the specific implementation details than the wide range of metrics HBase already supports and how this new system would unify and allow extending it.

Original title and link: HBase migration to the new Hadoop Metrics2 system (NoSQL database©myNoSQL)

via: https://blogs.apache.org/hbase/entry/migration_to_the_new_metrics


Introduction to HBase Mean Time to Recover (MTTR) - HBase Resiliency

A fantastic post by Nicolas Liochon and Devaraj Das looking into possible HBase failure scenarios and configurations to reduce the Mean Time to Recover:

There are no global failures in HBase: if a region server fails, all the other regions are still available. For a given data-subset, the MTTR was often considered as around ten minutes. This rule of thumb was actually coming from a common case where the recovery was taking time because it was trying to use replicas on a dead datanode. Ten minutes would be the time taken by HDFS to declare a node as dead. With the new stale mode in HDFS, it’s not the case anymore, and the recovery is now bounded by HBase alone. If you care about MTTR, with the settings mentioned here, most cases will take less than 2 minutes between the actual failure and the data being available again in another region server.

Stepping away for a bit, it looks like the overall complexity comes from the various components involved in HBase (ZooKeeper, HBase, HDFS) and their own failure detection mechanisms. If they are not correctly configured and ordered, things can get pretty ugly; ugly as in longer MTTR than one would expect.

Original title and link: Introduction to HBase Mean Time to Recover (MTTR) - HBase Resiliency (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/introduction-to-hbase-mean-time-to-recover-mttr/


Cassandra anti-patterns: Queues and queue-like datasets or when Deletes can bite

Aleksey Yeschenko has an interesting post about the impact deletes can have on Cassandra and different workaround solutions:

Specifically, tombstones will bite you if you do lots of deletes (especially column-level deletes) and later perform slice queries on rows with a lot of tombstones.

I wouldn’t call this a “you got your data model wrong”, but rather a known implementation limitation that has impact on some scenarios in which a different data model should be used; the difference, while only semantic, is that the error is not on the user.

In other words, if you use column-level deletes (or expiring columns) heavily and also need to perform slice queries over that data, try grouping columns with close “expiration date” together and getting rid of them in a single move.

Original title and link: Cassandra anti-patterns: Queues and queue-like datasets or when Deletes can bite (NoSQL database©myNoSQL)

via: http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets


The Master-Slave Architecture of HBase

Fantastic post by Matteo Bertozzi looking at HBase’s master-slave architecture:

At first glance, the Apache HBase architecture appears to follow a master/slave model where the master receives all the requests but the real work is done by the slaves. This is not actually the case, and in this article I will describe what tasks are in fact handled by the master and the slaves.

Original title and link: The Master-Slave Architecture of HBase (NoSQL database©myNoSQL)

via: https://blogs.apache.org/hbase/entry/hbase_who_needs_a_master


HBase Data Modeling Tips & Tricks - Timeshifting

Jeff Kolesky describing the data model they are using with HBase and one (strange) trick to reduce the roundtrips to the database:

The idea is to put all of the data about a single entity into a single row in HBase. When you need to run a computation that involves that entity’s data, you have quick access to it by the row key, and all of the data is stored close together on disk.

Additionally, against many suggestions from the HBase community, and general confusion about how timestamps work, we are using timestamps with logical values. Instead of just letting the region server assign a timestamp version to each cell, we are explicitly setting those values so that we can use timestamp as a true queryable dimension in our gets and scans.

In addition to the real timeseries data that is indexed using the cell timestamp, we also have other columns that store metadata about the entity.

It’s amazing how many smart and weird tricks engineers put in their production systems when having to deal with real requirements and SLAs.

Original title and link: HBase Data Modeling Tips & Tricks - Timeshifting (NoSQL database©myNoSQL)

via: http://www.heyitsopower.com/code/timeshifting-in-hbase/