cassandra: All content tagged as cassandra in NoSQL databases and polyglot persistence
Thursday, 24 May 2012
Using R With Cassandra Through JDBC or Hive
A short post by Jake Luciani listing 2 R modules—RJDBC module and RCassandra—that enable using R with Cassandra through either the JDBC or Hive drivers.
This is a good example of what I meant by designing products with openness and integration in mind.
Original title and link: Using R With Cassandra Through JDBC or Hive (©myNoSQL)
via: http://www.datastax.com/dev/blog/big-analytics-with-r-cassandra-and-hive
Wednesday, 16 May 2012
Cassandra at Workware Systems: Data Model FTW
One of the stories in which the deciding factor for using Cassandra was primarily the data model and not its scalability characteristics:
We started working with relational databases, and began building things primarily with PostgreSQL at first. But dealing with the kind of data that we do, the data model just wasn’t appropriate. We started with Cassandra in the beginning to solve one problem: we needed to persist large vector data that was updated frequently from many different sources. RDBMS’s just don’t do that very well, and the performance is really terrible for fast read operations. By contrast, Cassandra stores that type of data exceptionally well and the performance is fantastic. We went on from there and just decided to store everything in Cassandra.
Original title and link: Cassandra at Workware Systems: Data Model FTW (©myNoSQL)
via: http://www.datastax.com/2012/04/the-five-minute-interview-workware-systems
Thursday, 10 May 2012
NoSQL and Relational Databases Podcast With Mathias Meyer
EngineYard’s Ines Sombra recorded a conversation with Mathias Meyer about NoSQL databases and their evolution towards more friendlier functionality, relational databases and their steps towards non-relational models, and a bit more on what polyglot persistence means.
Mathias Meyer is one of the people I could talk for days about NoSQL and databases in general with different infrastructure toppings and he has some of the most well balanced thoughts when speaking about this exciting space—see this conversation I’ve had with him in the early days of NoSQL. I strongly encourage you to download the mp3 and listen to it.
Original title and link: NoSQL and Relational Databases Podcast With Mathias Meyer (©myNoSQL)
Monday, 7 May 2012
Cassandra 1.1 Released: What’s New
There are a lot of interesting new features and improvements in the newly released Cassandra 1.1 version to cover them all here, but here’s the gist of them:
- Schema improvements
- Support for compound keys
- Concurrent schema changes
- A new version of Cassandra Query Language (CQL3) supporting compound keys and wide rows
- Better and easier tuning of the key and row caches
- Support for per-table hybrid storage —mixing SSDs and spinning disks
This DataStax’s blog entry provides links to more details about all these features and the others I haven’t enumerated above.
Original title and link: Cassandra 1.1 Released: What’s New (©myNoSQL)
Wednesday, 18 April 2012
NoSQL Releases and Announcements
Catching up after almost two weeks offline is no easy task, but I hope I’ll not miss any important events, releases, or posts. But if I do, please email me.
Cassandra 1.0.9: Maintenance Release
The complete change notes for Cassandra 1.0.9 are here:
- improve index sampling performance (CASSANDRA-4023)
- always compact away deleted hints immediately after handoff (CASSANDRA-3955)
- delete hints from dropped ColumnFamilies on handoff instead of erroring out (CASSANDRA-3975)
- add CompositeType ref to the CLI doc for create/update column family (CASSANDRA-3980)
- Avoid NPE during repair when a keyspace has no CFs (CASSANDRA-3988)
- Fix division-by-zero error on get_slice (CASSANDRA-4000)
- don’t change manifest level for cleanup, scrub, and upgradesstables operations under LeveledCompactionStrategy (CASSANDRA-3989, 4112)
- fix race leading to super columns assertion failure (CASSANDRA-3957)
- ensure that directory is selected for compaction for user-defined tasks and upgradesstables (CASSANDRA-3985)
- allow custom types in CLI’s assume command (CASSANDRA-4081)
- fix totalBytes count for parallel compactions (CASSANDRA-3758)
- fix intermittent NPE in get_slice (CASSANDRA-4095)
- remove unnecessary asserts in native code interfaces (CASSANDRA-4096)
- Fix EC2 snitch incorrectly reporting region (CASSANDRA-4026)
- Shut down thrift during decommission (CASSANDRA-4086)
-
Merged from 0.8: Fix ConcurrentModificationException in gossiper (CASSANDRA-4019)
-
Pig
- support Counter ColumnFamilies (CASSANDRA-3973)
- Composite column support (CASSANDRA-3684)
-
CQL
- fix NPE on invalid CQL delete command (CASSANDRA-3755)
- Validate blank keys in CQL to avoid assertion errors (CASSANDRA-3612)
Apache Hadoop User Impersonation vulnerability
This vulnerability discovered by Cloudera’s Aaron T. Myers affects Hadoop’s versions 0.20.203.0, 0.20.204.0, 0.20.205.0, 1.0.0 to 1.0.1, and 0.23.0 to 0.23.1 where Kerberos is enabled. Complete details available here.
CouchDB 1.2.0
This is the first important release after the start of the year CouchDB hubbub with Damien Katz and Couchbase. The new version is a major release in itself deserving its own post: CouchDB 1.2.0: Performance, Security, API, Core and Replication Improvements.
Riak 1.1.2: Stabilization release
Just a maintenance release in the Riak 1.1 series. Complete release notes here.
Original title and link: NoSQL Releases and Announcements (©myNoSQL)
Tuesday, 3 April 2012
Here Is Why in Cassandra vs. HBase, Riak, CouchDB, MongoDB, It's Cassandra FTW
Brian ONeill:
Now, since choosing Cassandra, I can say there are a few other really important less tangible considerations. The first, is the code base. Cassandra has an extremely clean and well maintained code base. Jonathan and team do a fantastic job managing the community and the code. As we adopted NoSQL, the ability to extend the code-base and incorporate our own features has proven invaluable. (e.g. triggers, a REST interface, and server-side wide-row indexing)
Secondly, the community is phenomenal. That results in timely support, and solid releases on a regular schedule. They do a great job prioritizing features, accepting contributions, and cranking out features. (They are now releasing ~quarterly) We’ve all probably been part of other open source projects where the leadership is lacking, and features and releases are unpredictable, which makes your own release planning difficult. Kudos to the Cassandra team.
Everything sounds reasonable except for Riak being the “new kid on the block” and not finding support for it. Basho, where were you hidding?
Original title and link: Here Is Why in Cassandra vs. HBase, Riak, CouchDB, MongoDB, It’s Cassandra FTW (©myNoSQL)
via: http://brianoneill.blogspot.com/2012/04/cassandra-vs-couchdb-mongodb-riak-hbase.html
Monday, 2 April 2012
Cassandra: How to Upgrade an Early Cassandra Cluster -
The Scandit team shares their Cassandra upgrade process from 0.6.x to latest 1.0.x:
After extensive testing, we found that it fit our needs and decided to use the 0.6.0 release for our first roll out. Over the next 12 months, we kept upgrading our cluster until we reached 0.6.13, which was the last release in the 0.6.x branch.
In the meantime, Cassandra was evolving at an amazing speed. Many cool new features, such as secondary indices, CQL and schema support were added. Since we were very happy with our deployment, we moved a little slower and skip the 0.7.x releases. Now that 1.0.x has been around for a few months, we decided it was time to upgrade. Because the list of changes between the two versions was fairly long, we did the upgrade in two steps: First from 0.6.13 to 0.8.7 and then from 0.8.7 to 1.0.8.
Original title and link: Cassandra: How to Upgrade an Early Cassandra Cluster - (©myNoSQL)
via: http://www.scandit.com/2012/03/29/tech-how-to-upgrade-path-for-an-early-cassandra-cluster/
Tuesday, 27 March 2012
NoSQL Databases Adoption in Numbers
Source of data is Jaspersoft NoSQL connectors downloads. RedMonk published a graphic and an analysis and Klint Finley followed up with job trends:

Couple of things I don’t see mentioned in the RedMonk post:
-
if and how data has been normalized based on each connector availability
According to the post data has been collected between Jan.2011-Mar.2012 and I think that not all connectors have been available since the beginning of the period.
-
if and how marketing pushes for each connectors have been weighed in
Announcing the Hadoop connector at an event with 2000 attendees or the MongoDB connector at an event with 800 attendeed could definitely influence the results (nb: keep in mind that the largest number is less than 7000, thus 200-500 downloads triggered by such an event have a significant impact)
-
Redis and VoltDB are mostly OLTP only databases
Original title and link: NoSQL Databases Adoption in Numbers (©myNoSQL)
Wednesday, 21 March 2012
Which NoSQL Databases Are Robust to Net-Splits?
- Dynamo (key-value)
- Voldemort (key-value)
- Tokyo Cabinet (key-value)
- KAI (key-value)
- Cassandra (column-oriented/tabular)
- CouchDB (document-oriented)
- SimpleDB (document-oriented)
- Riak (document-oriented)
A couple of clarifications to the list above:
- Dynamo has never been available to the public. On the other hand DynamoDB is not exactly Dynamo
- Tokyo Cabinet is not a distributed database so it shouldn’t be in this list
- CouchDB isn’t a distributed database either, but one could argue that with its peer-to-peer replication it sits right at the border. On the other hand there’s BigCouch.
Original title and link: Which NoSQL Databases Are Robust to Net-Splits? (©myNoSQL)
Cassandra + Hadoop + Solr and Sqoop and Log4j => DataStax Enterprise 2.0
The tl;dr version is: DataStax has announced
Cassandra + Hadoop + Solr on the same cluster plus Sqoop, Log4j, and workload provisioning = DataStax Enterprise 2.0
For the longer version, there are a couple of new things worth emphasizing in this release:
- Fully integrated enterprise search
- RDBMS data migration
- Snap-in application log ingestion
- improvements to OpsCenter
- Elastic workload provisioning
Let’s take these one by one:
Fully integrated enterprise search or Solr on top of Cassandra
Cassandra distribution model is strongly inspired by Amazon Dynamo being characterized by high availability, elasticity, and fault tolerance. Solr is the search platform built on top of Lucene. Over time people learned how to scale Solr, but current approaches are far from being simple or offering an out of the box experience. Taking the Solr protocol and indexing capabilities and putting those on top of the Cassandra architecture makes a lot of sense.
Actually this has already been done in the form of Solandra (nb Solr integration in DataStax Enter. 2.0 is not based on Solandra though). For a scalable search solution there’s already ElasticSearch, but for someone running a Cassandra cluster, this looks like a useful addition to the stack.
DataStax has already showed this direction with what was called initially Brisk (or Brangelina for friends): Hadoop on top of the Cassandra cluster that became DataStax Enterprise 1.0. Solr on top of Cassandra is 2.0, but what will be the 3.0?
There are two cherries on top of this integration of Solr: easy index rebuild operations and CQL (Cassandra Query Language) access. I’ve seen XQuery translated to Lucene searches before, but I still need to see a SQL-like language translation.
As I’ve learned from Riak at Clipboard: Why Riak and How We Made Riak Search Faster, there is some complexity involved in scaling multi-matching search queries with term-based partitioning. Cassandra uses two partitioning strategies: random and order-preserving. It would be interesting to hear what partitioning strategy is used for Solr indexes. Update: I’ve got some answers so there’ll be a follow up with more details.
RDBMS data migration: it must be Sqoop
Nothing special here. You have a DataStax Enterprise cluster with some Hadoop nodes defined and you need to process data. But some of it lives in relational databases. Sqoop at rescue.
Snap-in application log ingestion: Flume or Scribe? No, it’s Log4j
When I read this bullet point my first thought was this is Flume. Or maybe Scribe. But most probably Flume. It looks like DataStax went a different route and offers log ingestion using Log4j. It’s true that Log4j or one of its flavors most probably exist in every Java project, but it still feels like an odd choice. On the other hand there’s a Cassandra plugin for Flume.
OpsCenter Enterprise 2.0
The OpsCenter is the management, monitoring, and control tool for DataStax Enterprise. The new version includes pretty much what you’d expect from an admin/monitoring tool:
- multi-cluster monitoring
- visual backup
- search monitoring
Looking back at the NoSQL administration/monitoring tools I’ve seen lately, I’m pretty sure I’ve identified a trend: they all come in various shades of black.
DataStax OpsCenter Enterprise:


Elastic workload provisioning
I’ve left at the end the feature that got me most interested into: elastic workload provisioning.
To better understand what this is, I had to go back to DataStax Enterprise 1.x where a node could be either a Cassandra node (OLTP) or a Hadoop node (processing). The new version allows quasi-dynamic node provisioning by changing the mode of a cluster (between Hadoop, Cassandra, Solr) with a stop/start operation. So given a cluster one could adjust its capacity and performance for different workloads (e.g. time-sensitive applications or temporary cluster operations).
Workload management is a feature present in most of the commercial data warehouse solutions. Even if in the very early days, DataStax Enterprise’s workload provisioning is the first take towards workload management in the NoSQL space.
Original title and link: Cassandra + Hadoop + Solr and Sqoop and Log4j => DataStax Enterprise 2.0 (©myNoSQL)
Monday, 19 March 2012
NoSQL Hosting Services
Michael Hausenblas put together a list of hosted NoSQL solutions including Amazon DynamoDB and SimpleDB, Google App Engine, Riak, Cassandra, CouchDB, MongoDB, Neo4j, and OrientDB. If you go through my posts on NoSQL hosting , you’ll find a couple more.
Original title and link: NoSQL Hosting Services (©myNoSQL)
via: http://webofdata.wordpress.com/2012/03/18/hosted-nosql/
Thursday, 8 March 2012
Scala Client for Cassandra From Twitter: Cassie
Staying in the land of recent open source data-related projects from Twitter, Ryan King:
Cassie is a Finagle and Scala-based client originally based on Coda Hale’s library.
While it is certainly stable— we use it in production to talk to a dozen clusters and over a thousand Cassandra machines— it is currently limited to the features we use in production and has a few rough edges.
For the JVM there’s also Netflix’s Cassandra client (Astyanax) available on GitHub.
Original title and link: Scala Client for Cassandra From Twitter: Cassie (©myNoSQL)
via: https://dev.twitter.com/blog/cassie-scala-client-for-cassandra
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling