cassandra: All content tagged as cassandra in NoSQL databases and polyglot persistence
Tuesday, 7 August 2012
Latency-Consistency Analysis
A very interesting proposal and patch for enhancing nodetool to provide cluster latency-consistency analysis. From JIRA:
We’ve implemented Probabilistically Bounded Staleness, a new technique for predicting consistency-latency trade-offs within Cassandra. Our paper will appear in VLDB 2012, and, in it, we’ve used PBS to profile a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition to profiling our own Cassandra deployments. In our experience, prediction is both accurate and much more lightweight than profiling and manually testing each possible replication configuration (especially in production!).
This analysis is important for the many users we’ve talked to and heard about who use “partial quorum” operation (e.g., non-QUORUM ConsistencyLevel). Should they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, short of profiling in production, there’s no existing way to answer these questions.
Original title and link: Latency-Consistency Analysis (©myNoSQL)
Accumulo, HBase, Cassandra and Some Unanswered Questions
Cade Metz for Wired:
The bill bars the DoD from using the database unless the department can show that the software is sufficiently different from other databases that mimic BigTable. But at the same time, the bill orders the director of the NSA to work with outside organizations to merge the Accumulo security tools with alternative databases, specifically naming HBase and Cassandra.
Is this good for HBase and Cassandra? Is this good for encouraging innovation? Is this good for supporting businesses? Just a few questions I couldn’t answer myself after reading this article about the investigation initiated by the Senate into NSA’s open sourced Accumulo.
Original title and link: Accumulo, HBase, Cassandra and Some Unanswered Questions (©myNoSQL)
via: http://www.wired.com/wiredenterprise/2012/07/nsa-accumulo-google-bigtable/
Monday, 6 August 2012
A Big Data Trifecta: Storm, Kafka and Cassandra
Brain O’Neill details his first experiments of migrating from using JMS to Kafka in a very interesting architecture involving:
Now, Kafka is fast. When running the Kafka Spout by itself, I easily reproduced Kafka’s claim that you can consume “hundreds of thousands of messages per second”. When I first fired up the topology, things went well for the first minute, but then quickly crashed as the Kafka spout emitted too fast for the Cassandra Bolt to keep up. Even though Cassandra is fast as well, it is still orders of magnitude slower than Kafka.
Original title and link: A Big Data Trifecta: Storm, Kafka and Cassandra (©myNoSQL)
via: http://brianoneill.blogspot.com/2012/08/a-big-data-trifecta-storm-kafka-and.html
Sunday, 5 August 2012
Cassandra at Scandit
We use Cassandra in two ways: First, it holds our product database. Second, we use it to store and analyze the scans generated by the apps that integrate the Barcode Scanner SDK. We call this Scanalytics.
Scanalytics is a web-based analytics platform that lets app developers see what happens in their app: What kind of products do their users scan? Groceries, electronics, cosmetics, etc.? Where do they scan? At home? In the retail store? And so on. All that goes into Cassandra.
The Product database has 25 million records, so you could probably do it with any database. But I’d be interested to learn how data is modeled in Scanalytics.
Original title and link: Cassandra at Scandit (©myNoSQL)
via: http://www.datastax.com/dev/blog/the-five-minute-interview-scandit
Tuesday, 31 July 2012
The Benefits of Virtual Nodes and Performance Results
Sam Overton and Tom Wilkie of Acunu explain the advantages of using virtual nodes in distributed data storage engines and the performance they’ve measure introducing virtual nodes in Acunu platform when compared with Apache Cassandra:
One of the factors that limits the amount of data that can be stored on each node is the amount of time it takes to re-replicate that data when a node fails. That time matters, because it is a period during which the cluster is more vulnerable than normal to data loss. The challenge is that the more data stored on a node, the longer it takes to re-replicate it. Therefore, to store more data per node safely, we want to reduce the time taken to return to normal. This was one of our aims with virtual nodes.
Virtual Nodes reduces the time taken to re-replicate data as it involves every node in the cluster in the operation. In contrast, Apache Cassandra v1.1 will only involve a number of nodes equal to the Replication Factor (RF) of your keyspace. What’s more, with Virtual Nodes, the cluster remains balanced after this operation - you do not need to shuffle the tokens on the other nodes to compensate for the loss!
Original title and link: The Benefits of Virtual Nodes and Performance Results (©myNoSQL)
via: http://www.acunu.com/2/post/2012/07/virtual-nodes-performance-results.html
Friday, 20 July 2012
EC2 Solid State Disks and Cassandra
Jonathan Ellis about using Cassandra with mixed spinning disks and SSDs:
Finally, I should point out that taking advantage of SSDs in a Cassandra cluster doesn’t have to be all or nothing. You can mix SSD and spinning disks either at the individual node level, or at the cluster level. For the former, Cassandra allows putting “hot” tables on SSD while leaving “cold” ones on spinning disks. But if you want to use a group of nodes for analytical workloads the way DataStax Enterprise does, Cassandra will also be comfortable with having just those nodes be entirely based on cheaper spinning disks, with the remaining, “realtime” nodes based on SSDs. This latter configuration is a good fit for EC2 deployments.
Original title and link: EC2 Solid State Disks and Cassandra (©myNoSQL)
via: http://www.datastax.com/dev/blog/solid-state-disks-now-available-on-amazon-ec2
Cassandra and Solid State Drives
A slide deck by Rick Branson explaining why and how Cassandra takes full advantage of SSDs.
Wednesday, 18 July 2012
Benchmarking High Performance I/O With SSD for Cassandra on AWS
Adrian Cockcroft:
The SSD based system running the same workload had plenty of IOPS left over and could also run compaction operations under full load without affecting response times. The overall throughput of the 12-instance SSD based system was CPU limited to about 20% less than the existing system, but with much lower mean and 99th percentile latency. This sizing exercise indicated that we could replace the 48 m2.4xlarge and 36 m2.xlarge with 15 hi1.4xlarge to get the same throughput, but with much lower latency.
Tons of details and data about the benchmarks Netflix ran against the new high I/O SSD-backed EC2 instances. Results are even more impressive than the IOPS numbers in Werner Vogel’s High performance I/O instances for EC2.
Original title and link: Benchmarking High Performance I/O With SSD for Cassandra on AWS (©myNoSQL)
via: http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.html
Tuesday, 17 July 2012
Where Cassandra Really Shines
Where Cassandra REALLY shines and is often overlooked is ease of maintenance. Cassandra’s ability to bootstrap new nodes, replicate, reshard and handle down nodes (w/ hinted handoff) is almost magical. I use it in production and it works very reliably.
Sure, it’s got some cool big data stuff, but try doing any of those “maintenance” operations on other databases without ripping your hair out. For example, even bringing up a new MySQL slave is a huge pain in the ass, let alone doing something non-trivial like promoting a new master.
Reinforcing exactly what I emphasized as merits of NoSQL systems in is SQL or NoSQL better for programmers.
Original title and link: Where Cassandra Really Shines (©myNoSQL)
Monday, 16 July 2012
eBay's Cassandra Data Modeling Best Practices
Jay Patel (architect at eBay):
Our Cassandra deployment is not huge, but it’s growing at a healthy pace. In the past couple of months, we’ve deployed dozens of nodes across several small clusters spanning multiple data centers. You may ask, why multiple clusters? We isolate clusters by functional area and criticality. Use cases with similar criticality from the same functional area share the same cluster, but reside in different keyspaces.
This first post is focused on two old techniques that have been applied even with relational databases:
- model data around query patterns
- de-normalize and duplicate for read performance.
Original title and link: eBay’s Cassandra Data Modeling Best Practices (©myNoSQL)
via: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Thursday, 12 July 2012
From MongoDB to Cassandra: Why Atlas Platform Is Migrating
Sergio Bossa tells the story of migrating the Atlas platform from using MongoDB to Cassandra emphasizing the reasons behind their decision:
- It works on the JVM, and we have lots of in-house experience on it.
- It scales in terms of processing and storage capacity.
- Its column-based data model gives us some advanced capabilities we will talk about in a few minutes.
- Its tunable consistency levels provide greater control over high availability and consistency requirements.
As regards what made them look into a different solution:
- We need higher resiliency to faults: MongoDB provides replica sets, but we’re experiencing lots of problems with replication lags and during replica synchronization.
- We need higher scalability: MongoDB global lock and huge memory requirements aren’t already going to cope well with our growing data set.
Original title and link: From MongoDB to Cassandra: Why Atlas Platform Is Migrating (©myNoSQL)
via: http://metabroadcast.com/blog/looking-with-cassandra-into-the-future-of-atlas
Thursday, 24 May 2012
Using R With Cassandra Through JDBC or Hive
A short post by Jake Luciani listing 2 R modules—RJDBC module and RCassandra—that enable using R with Cassandra through either the JDBC or Hive drivers.
This is a good example of what I meant by designing products with openness and integration in mind.
Original title and link: Using R With Cassandra Through JDBC or Hive (©myNoSQL)
via: http://www.datastax.com/dev/blog/big-analytics-with-r-cassandra-and-hive
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling