NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Cassandra: All content tagged as Cassandra in NoSQL databases and polyglot persistence

Quick intro to Apache Cassandra… comic style

You can find it here. Nice job by Alberto Diego Prieto Löfkrantz.

Original title and link: Quick intro to Apache Cassandra… comic style (NoSQL database©myNoSQL)

Facebook’s Cassandra paper, annotated and compared to Apache Cassandra 2.0

The evolution from the original paper to Cassandra 2.0 in an interesting format:

The release of Apache Cassandra 2.0 is a good point to look back at the past five years of progress after Cassandra’s release as open source. Here, we annotate the Cassandra paper from LADIS 2009 with the new features and improvements that have been added since.

Original title and link: Facebook’s Cassandra paper, annotated and compared to Apache Cassandra 2.0 (NoSQL database©myNoSQL)


Considering TokuDB as an engine for timeseries data... or Cassandra or OpenTSDB

Vadim Tkachenko:

  • Provide high insertion rate
  • Provide a good compression rate to store more data on expensive SSDs
  • Engine should be SSD friendly (less writes per timeperiod to help with SSD wear)
  • Provide a reasonable response time (within ~50 ms) on SELECT queries on hot recently inserted data

Looking on these requirements I actually think that TokuDB might be a good fit for this task.

There are solutions in the NoSQL space that are optimized for this scenario: Cassandra or OpenTSDB. Indeed using one of these will have an impact on the application side.

Most of the time when the requirements dictate looking into different solutions, the easiest to estimate is the initial costs: development (nb: this doesn’t include only pure development, but also learning costs, etc.) and hardware costs.

Unfortunately many times we ignore taking into consideration long term costs:

  • maintenance costs (hardware, operations, enhancements)
  • opportunity costs (features that the current architecture won’t be able to support as being either impossible or too expensive)
  • accounting for the risks of failed initial designs (the technical debt costs)

Way too many times we optimize for the initial costs (the general excuse is that familiarity delivers faster—with the more scientific forms: time to market is essential and premature optimization is the root of all evil), while ignoring almost completely the ongoing costs.

Original title and link: Considering TokuDB as an engine for timeseries data… or Cassandra or OpenTSDB (NoSQL database©myNoSQL)


Big Data Debate: HBase or Cassandra

This debate about the pros and cons of HBase and Cassandra set up by Doug Henschen for InformationWeek and featuring Jonathan Ellis (Cassandra, DataStax) and Michael Hausenbias (MapR) will stir some strong feelings:

Michael Hausenbias: An interesting proof point for the superiority of HBase is the fact that Facebook, the creator of Cassandra, replaced Cassandra with HBase for their internal use.

Jonathan Ellis: The technical shortcomings driving HBase’s lackluster adoption fall into two major categories: engineering problems that can be addressed given enough time and manpower, and architectural flaws that are inherent to the design and cannot be fixed.

✚ One question I couldn’t answer about this dialog is why HBase-side wasn’t covered by either a HBase community member or a user. Indeed MapR has interest in HBase, but their product is not HBase.

Original title and link: Big Data Debate: HBase or Cassandra (NoSQL database©myNoSQL)


$45millions more for DataStax

Holy cow! That’s a 4 followed by a 5… with no dots in between.

  1. Derrick Harris for GigaOm: NoSQL startup DataStax raises $45M to ride Cassandra’s wave:

    Cassandra’s success with such large users has to do with its ability to handle large-scale online applications that demand steady levels of performance, DataStax CEO Billy Bosworth told me. Scalability and performance have never been among Cassandra’s shortcomings, and the database is capable of replicating data across data centers. Large companies used to choose Oracle for applications that needed these capabilities, but now that NoSQL options are around and relatively mature, companies are rethinking whether the relational database model was ever really correct for some applications in the first place.

  2. Alex Williams for TC: DataStax Readies For IPO, Raises $45M For Modern Database Platform Suited To New Data Intensive World:

    DataStax will use the funding to build out globally and invest in Apache Cassandra, the NoSQL open-source project and foundation for the company’s database distributions. The funding also signals a potential IPO for DataStax but much will depend on the direction of the markets, said CEO Billy Bosworth in an interview yesterday. “We are building the company for that direction (IPO),” he said. “A l lot depends on external factors. Internally, the company is already starting that process.”

According to my books:

  1. This is the largest round raised by a NoSQL company. It tops 10gen’s $45mil for MongoDB.
  2. This is the 3rd largest round raised in the new data market, after Cloudera’s $65mil. and Hortonworks’s $50mil. rounds.

Original title and link: $45millions more for DataStax (NoSQL database©myNoSQL)

Get up and Running with Cassandra on Google Compute Engine

On the Google Cloud Platform blog:

The guide walks you through creating your nodes (instances), setting up Java, and creating and configuring a firewall. Included in the guide are several scripts that make the configuration and setup easy to understand and execute. Once you are finished with your cluster, a simple call to a teardown script cleans up your project’s environment.

Can you speculate why Cassandra is the first NoSQL database that gets mentioned on Google’s blog? (hint: maybe this?)

Original title and link: Get up and Running with Cassandra on Google Compute Engine (NoSQL database©myNoSQL)


How do you decide what database to use for what task?

Nathan Milford of Outbrain answering the question how do you decide what database to use for what task:

We look at how the data will be queried, its size, and how it needs to be distributed. We might use things like MySQL for historical reasons and MongoDB for smaller tasks, and then Cassandra for situations where data doesn’t all fit into memory or where it spans multiple machines and possibly data centers.

This is indeed the good recipe: data access model, data size, distribution model.

Original title and link: How do you decide what database to use for what task? (NoSQL database©myNoSQL)


Cassandra Summit’s Bests

If you haven’t been to Cassandra Summit 2013 or you missed some presentations, now you can (re)watch them on YouTube. Jonathan Ellis put together his list of favorites here and here.

I’m posting this on Saturday as there’re a lot of interesting talks and if Cassandra is on your radar it will take a couple of weekends to go through them.

Original title and link: Cassandra Summit’s Bests (NoSQL database©myNoSQL)

Best argument for official drivers

Jonathan Ellis:

More qualitatively but perhaps even more important, this addresses the paradox of choice we’ve had in the Cassandra Java world: multiple driver choices provide another barrier to newcomers, where each must evaluate the options for applicability to his project. Having just done such an evaluation to settle on Cassandra itself, this is the last thing they want to spend time on.

And that’s the best-case scenario. More often, a fragmented landscape leads to many solutions, each of which solve a different 80% of the problem. Better to have a single, well-thought-out solution, that lets people get started writing their application immediately.

This is the best argument ever for having official drivers.

✚ In the early days and over long time it’s quite difficult for a company to offer only official drivers. But there’s a solution for that too: recommend one. And support its maintainers.

Original title and link: Best argument for official drivers (NoSQL database©myNoSQL)


Titan: Data Loading and Transactional Benchmark

The Aurelius team describing an advanced benchmark of Titan, a massive scale property graph allowing real-time traversals and updates, sponsored by Pearson, developed and run over 5 months:

The 10 terabyte, 121 billion edge graph was loaded into the cluster in 1.48 days at a rate of approximately 1.2 million edges a second with 0 failed transactions. These numbers were possible due to new developments in Titan 0.3.0 whereby graph partitioning is achieved using a domain-basedbyte order partitioner.

✚ The answer to why Titan is built on Cassandra can be found in this interview between Aurelius CTO Matthias Broecheler and DataStax co-founder Matt Pfeil:

[…] we don’t have to worry about things like replication, backup, and snap shots because all of that stuff is handled by Cassandra. We really just focus on: “How do you distribute a graph?”, “How do you represent a graph efficiently in a big table model?”, “How do you do things like etched compression and other things that are very graph specific in order to make the database fast? And, lastly, “How do to build intelligence index structures so that the graphs traversals, which are the core of any graph database, so that those are as fast as possible?”

Original title and link: Titan: Data Loading and Transactional Benchmark (NoSQL database©myNoSQL)


Cassandra anti-patterns: Queues and queue-like datasets or when Deletes can bite

Aleksey Yeschenko has an interesting post about the impact deletes can have on Cassandra and different workaround solutions:

Specifically, tombstones will bite you if you do lots of deletes (especially column-level deletes) and later perform slice queries on rows with a lot of tombstones.

I wouldn’t call this a “you got your data model wrong”, but rather a known implementation limitation that has impact on some scenarios in which a different data model should be used; the difference, while only semantic, is that the error is not on the user.

In other words, if you use column-level deletes (or expiring columns) heavily and also need to perform slice queries over that data, try grouping columns with close “expiration date” together and getting rid of them in a single move.

Original title and link: Cassandra anti-patterns: Queues and queue-like datasets or when Deletes can bite (NoSQL database©myNoSQL)


Kairosdb - Fast Scalable Time Series Database

kairosdb is introduced as a rewrite of the OpenTSDB written primarily for Cassandra (nb: OpenTSDB was based on HBase). In terms of what it brings new, this page lists:

  • Uses Guice to load modules.
  • Incorporates Jetty for Rest API and serving up UI.
  • Pure Java build tool (Tablesaw)
  • UI uses Flot and is client side rendered.
  • Ability to customize UI.
  • Relative time now includes month and supports leap years.
  • Modular data store interface supports:
    • HBase
    • Cassandra
    • H2 (For development)
  • Milliseconds data support when using Cassandra.
  • Rest API for querying and submitting data.
  • Build produces deployable tar, rpm and deb packages.
  • Linux start/stop service scripts.
  • Faster.
  • Made aggregations optional (easier to get raw data).
  • Added abilities to import and export data.
  • Aggregators can aggregate data for a specified period.
  • Aggregators can be stacked or “piped” together.

Source code lives on GitHub. Let’s see where it goes.

Original title and link: Kairosdb - Fast Scalable Time Series Database (NoSQL database©myNoSQL)