NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



cassandra: All content tagged as cassandra in NoSQL databases and polyglot persistence

A Big Data Trifecta: Storm, Kafka and Cassandra

Brain O’Neill details his first experiments of migrating from using JMS to Kafka in a very interesting architecture involving:

Now, Kafka is fast.  When running the Kafka Spout by itself, I easily reproduced Kafka’s claim that you can consume “hundreds of thousands of messages per second”.  When I first fired up the topology, things went well for the first minute, but then quickly crashed as the Kafka spout emitted  too fast for the Cassandra Bolt to keep up.  Even though Cassandra is fast as well, it is still orders of magnitude slower than Kafka.

Original title and link: A Big Data Trifecta: Storm, Kafka and Cassandra (NoSQL database©myNoSQL)


Cassandra at Scandit

We use Cassandra in two ways: First, it holds our product database. Second, we use it to store and analyze the scans generated by the apps that integrate the Barcode Scanner SDK. We call this Scanalytics.

Scanalytics is a web-based analytics platform that lets app developers see what happens in their app: What kind of products do their users scan? Groceries, electronics, cosmetics, etc.? Where do they scan? At home? In the retail store? And so on. All that goes into Cassandra.

The Product database has 25 million records, so you could probably do it with any database. But I’d be interested to learn how data is modeled in Scanalytics.

Original title and link: Cassandra at Scandit (NoSQL database©myNoSQL)


The Benefits of Virtual Nodes and Performance Results

Sam Overton and Tom Wilkie of Acunu explain the advantages of using virtual nodes in distributed data storage engines and the performance they’ve measure introducing virtual nodes in Acunu platform when compared with Apache Cassandra:

One of the factors that limits the amount of data that can be stored on each node is the amount of time it takes to re-replicate that data when a node fails. That time matters, because it is a period during which the cluster is more vulnerable than normal to data loss. The challenge is that the more data stored on a node, the longer it takes to re-replicate it. Therefore, to store more data per node safely, we want to reduce the time taken to return to normal. This was one of our aims with virtual nodes.

Virtual Nodes reduces the time taken to re-replicate data as it involves every node in the cluster in the operation. In contrast, Apache Cassandra v1.1 will only involve a number of nodes equal to the Replication Factor (RF) of your keyspace. What’s more, with Virtual Nodes, the cluster remains balanced after this operation - you do not need to shuffle the tokens on the other nodes to compensate for the loss!

Original title and link: The Benefits of Virtual Nodes and Performance Results (NoSQL database©myNoSQL)


EC2 Solid State Disks and Cassandra

Jonathan Ellis about using Cassandra with mixed spinning disks and SSDs:

Finally, I should point out that taking advantage of SSDs in a Cassandra cluster doesn’t have to be all or nothing. You can mix SSD and spinning disks either at the individual node level, or at the cluster level. For the former, Cassandra allows putting “hot” tables on SSD while leaving “cold” ones on spinning disks. But if you want to use a group of nodes for analytical workloads the way DataStax Enterprise does, Cassandra will also be comfortable with having just those nodes be entirely based on cheaper spinning disks, with the remaining, “realtime” nodes based on SSDs. This latter configuration is a good fit for EC2 deployments.

Original title and link: EC2 Solid State Disks and Cassandra (NoSQL database©myNoSQL)


Cassandra and Solid State Drives

A slide deck by Rick Branson explaining why and how Cassandra takes full advantage of SSDs.

Benchmarking High Performance I/O With SSD for Cassandra on AWS

Adrian Cockcroft:

The SSD based system running the same workload had plenty of IOPS left over and could also run compaction operations under full load without affecting response times. The overall throughput of the 12-instance SSD based system was CPU limited to about 20% less than the existing system, but with much lower mean and 99th percentile latency. This sizing exercise indicated that we could replace the 48 m2.4xlarge and 36 m2.xlarge with 15 hi1.4xlarge to get the same throughput, but with much lower latency.

Tons of details and data about the benchmarks Netflix ran against the new high I/O SSD-backed EC2 instances. Results are even more impressive than the IOPS numbers in Werner Vogel’s High performance I/O instances for EC2.

Original title and link: Benchmarking High Performance I/O With SSD for Cassandra on AWS (NoSQL database©myNoSQL)


Where Cassandra Really Shines

Steve Corona on Hacker News:

Where Cassandra REALLY shines and is often overlooked is ease of maintenance. Cassandra’s ability to bootstrap new nodes, replicate, reshard and handle down nodes (w/ hinted handoff) is almost magical. I use it in production and it works very reliably.

Sure, it’s got some cool big data stuff, but try doing any of those “maintenance” operations on other databases without ripping your hair out. For example, even bringing up a new MySQL slave is a huge pain in the ass, let alone doing something non-trivial like promoting a new master.

Reinforcing exactly what I emphasized as merits of NoSQL systems in is SQL or NoSQL better for programmers.

Original title and link: Where Cassandra Really Shines (NoSQL database©myNoSQL)

eBay's Cassandra Data Modeling Best Practices

Jay Patel (architect at eBay):

Our Cassandra deployment is not huge, but it’s growing at a healthy pace. In the past couple of months, we’ve deployed dozens of nodes across several small clusters spanning multiple data centers. You may ask, why multiple clusters? We isolate clusters by functional area and criticality. Use cases with similar criticality from the same functional area share the same cluster, but reside in different keyspaces.

This first post is focused on two old techniques that have been applied even with relational databases:

  1. model data around query patterns
  2. de-normalize and duplicate for read performance.

Original title and link: eBay’s Cassandra Data Modeling Best Practices (NoSQL database©myNoSQL)


From MongoDB to Cassandra: Why Atlas Platform Is Migrating

Sergio Bossa tells the story of migrating the Atlas platform from using MongoDB to Cassandra emphasizing the reasons behind their decision:

  • It works on the JVM, and we have lots of in-house experience on it.
  • It scales in terms of processing and storage capacity.
  • Its column-based data model gives us some advanced capabilities we will talk about in a few minutes.
  • Its tunable consistency levels provide greater control over high availability and consistency requirements.

As regards what made them look into a different solution:

  • We need higher resiliency to faults: MongoDB provides replica sets, but we’re experiencing lots of problems with replication lags and during replica synchronization.
  • We need higher scalability: MongoDB global lock and huge memory requirements aren’t already going to cope well with our growing data set.

Original title and link: From MongoDB to Cassandra: Why Atlas Platform Is Migrating (NoSQL database©myNoSQL)


Using R With Cassandra Through JDBC or Hive

A short post by Jake Luciani listing 2 R modules—RJDBC module and RCassandra—that enable using R with Cassandra through either the JDBC or Hive drivers.

This is a good example of what I meant by designing products with openness and integration in mind.

Original title and link: Using R With Cassandra Through JDBC or Hive (NoSQL database©myNoSQL)


Cassandra at Workware Systems: Data Model FTW

One of the stories in which the deciding factor for using Cassandra was primarily the data model and not its scalability characteristics:

We started working with relational databases, and began building things primarily with PostgreSQL at first.  But dealing with the kind of data that we do, the data model just wasn’t appropriate. We started with Cassandra in the beginning to solve one problem: we needed to persist large vector data that was updated frequently from many different sources. RDBMS’s just don’t do that very well, and the performance is really terrible for fast read operations. By contrast, Cassandra stores that type of data exceptionally well and the performance is fantastic. We went on from there and just decided to store everything in Cassandra.

Original title and link: Cassandra at Workware Systems: Data Model FTW (NoSQL database©myNoSQL)


NoSQL and Relational Databases Podcast With Mathias Meyer

EngineYard’s Ines Sombra recorded a conversation with Mathias Meyer about NoSQL databases and their evolution towards more friendlier functionality, relational databases and their steps towards non-relational models, and a bit more on what polyglot persistence means.

Mathias Meyer is one of the people I could talk for days about NoSQL and databases in general with different infrastructure toppings and he has some of the most well balanced thoughts when speaking about this exciting space—see this conversation I’ve had with him in the early days of NoSQL. I strongly encourage you to download the mp3 and listen to it.

Original title and link: NoSQL and Relational Databases Podcast With Mathias Meyer (NoSQL database©myNoSQL)