cassandra: All content tagged as cassandra in NoSQL databases and polyglot persistence
EngineYard’s Ines Sombra recorded a conversation with Mathias Meyer about NoSQL databases and their evolution towards more friendlier functionality, relational databases and their steps towards non-relational models, and a bit more on what polyglot persistence means.
Mathias Meyer is one of the people I could talk for days about NoSQL and databases in general with different infrastructure toppings and he has some of the most well balanced thoughts when speaking about this exciting space—see this conversation I’ve had with him in the early days of NoSQL. I strongly encourage you to download the mp3 and listen to it.
Original title and link: NoSQL and Relational Databases Podcast With Mathias Meyer ( ©myNoSQL)
There are a lot of interesting new features and improvements in the newly released Cassandra 1.1 version to cover them all here, but here’s the gist of them:
- Schema improvements
- Support for compound keys
- Concurrent schema changes
- A new version of Cassandra Query Language (CQL3) supporting compound keys and wide rows
- Better and easier tuning of the key and row caches
- Support for per-table hybrid storage —mixing SSDs and spinning disks
This DataStax’s blog entry provides links to more details about all these features and the others I haven’t enumerated above.
Original title and link: Cassandra 1.1 Released: What’s New ( ©myNoSQL)
Catching up after almost two weeks offline is no easy task, but I hope I’ll not miss any important events, releases, or posts. But if I do, please email me.
Cassandra 1.0.9: Maintenance Release
The complete change notes for Cassandra 1.0.9 are here:
- improve index sampling performance (CASSANDRA-4023)
- always compact away deleted hints immediately after handoff (CASSANDRA-3955)
- delete hints from dropped ColumnFamilies on handoff instead of erroring out (CASSANDRA-3975)
- add CompositeType ref to the CLI doc for create/update column family (CASSANDRA-3980)
- Avoid NPE during repair when a keyspace has no CFs (CASSANDRA-3988)
- Fix division-by-zero error on get_slice (CASSANDRA-4000)
- don’t change manifest level for cleanup, scrub, and upgradesstables operations under LeveledCompactionStrategy (CASSANDRA-3989, 4112)
- fix race leading to super columns assertion failure (CASSANDRA-3957)
- ensure that directory is selected for compaction for user-defined tasks and upgradesstables (CASSANDRA-3985)
- allow custom types in CLI’s assume command (CASSANDRA-4081)
- fix totalBytes count for parallel compactions (CASSANDRA-3758)
- fix intermittent NPE in get_slice (CASSANDRA-4095)
- remove unnecessary asserts in native code interfaces (CASSANDRA-4096)
- Fix EC2 snitch incorrectly reporting region (CASSANDRA-4026)
- Shut down thrift during decommission (CASSANDRA-4086)
Merged from 0.8: Fix ConcurrentModificationException in gossiper (CASSANDRA-4019)
- support Counter ColumnFamilies (CASSANDRA-3973)
- Composite column support (CASSANDRA-3684)
- fix NPE on invalid CQL delete command (CASSANDRA-3755)
- Validate blank keys in CQL to avoid assertion errors (CASSANDRA-3612)
Apache Hadoop User Impersonation vulnerability
This vulnerability discovered by Cloudera’s Aaron T. Myers affects Hadoop’s versions 0.20.203.0, 0.20.204.0, 0.20.205.0, 1.0.0 to 1.0.1, and 0.23.0 to 0.23.1 where Kerberos is enabled. Complete details available here.
This is the first important release after the start of the year CouchDB hubbub with Damien Katz and Couchbase. The new version is a major release in itself deserving its own post: CouchDB 1.2.0: Performance, Security, API, Core and Replication Improvements.
Riak 1.1.2: Stabilization release
Original title and link: NoSQL Releases and Announcements ( ©myNoSQL)
Couple of things I don’t see mentioned in the RedMonk post:
if and how data has been normalized based on each connector availability
According to the post data has been collected between Jan.2011-Mar.2012 and I think that not all connectors have been available since the beginning of the period.
if and how marketing pushes for each connectors have been weighed in
Announcing the Hadoop connector at an event with 2000 attendees or the MongoDB connector at an event with 800 attendeed could definitely influence the results (nb: keep in mind that the largest number is less than 7000, thus 200-500 downloads triggered by such an event have a significant impact)
Redis and VoltDB are mostly OLTP only databases
Original title and link: NoSQL Databases Adoption in Numbers ( ©myNoSQL)
- Dynamo (key-value)
- Voldemort (key-value)
- Tokyo Cabinet (key-value)
- KAI (key-value)
- Cassandra (column-oriented/tabular)
- CouchDB (document-oriented)
- SimpleDB (document-oriented)
- Riak (document-oriented)
A couple of clarifications to the list above:
- Dynamo has never been available to the public. On the other hand DynamoDB is not exactly Dynamo
- Tokyo Cabinet is not a distributed database so it shouldn’t be in this list
- CouchDB isn’t a distributed database either, but one could argue that with its peer-to-peer replication it sits right at the border. On the other hand there’s BigCouch.
Original title and link: Which NoSQL Databases Are Robust to Net-Splits? ( ©myNoSQL)
The tl;dr version is: DataStax has announced
Cassandra + Hadoop + Solr on the same cluster plus Sqoop, Log4j, and workload provisioning = DataStax Enterprise 2.0
For the longer version, there are a couple of new things worth emphasizing in this release:
- Fully integrated enterprise search
- RDBMS data migration
- Snap-in application log ingestion
- improvements to OpsCenter
- Elastic workload provisioning
Let’s take these one by one:
Fully integrated enterprise search or Solr on top of Cassandra
Cassandra distribution model is strongly inspired by Amazon Dynamo being characterized by high availability, elasticity, and fault tolerance. Solr is the search platform built on top of Lucene. Over time people learned how to scale Solr, but current approaches are far from being simple or offering an out of the box experience. Taking the Solr protocol and indexing capabilities and putting those on top of the Cassandra architecture makes a lot of sense.
Actually this has already been done in the form of Solandra (nb Solr integration in DataStax Enter. 2.0 is not based on Solandra though). For a scalable search solution there’s already ElasticSearch, but for someone running a Cassandra cluster, this looks like a useful addition to the stack.
DataStax has already showed this direction with what was called initially Brisk (or Brangelina for friends): Hadoop on top of the Cassandra cluster that became DataStax Enterprise 1.0. Solr on top of Cassandra is 2.0, but what will be the 3.0?
There are two cherries on top of this integration of Solr: easy index rebuild operations and CQL (Cassandra Query Language) access. I’ve seen XQuery translated to Lucene searches before, but I still need to see a SQL-like language translation.
As I’ve learned from Riak at Clipboard: Why Riak and How We Made Riak Search Faster, there is some complexity involved in scaling multi-matching search queries with term-based partitioning. Cassandra uses two partitioning strategies: random and order-preserving. It would be interesting to hear what partitioning strategy is used for Solr indexes. Update: I’ve got some answers so there’ll be a follow up with more details.
RDBMS data migration: it must be Sqoop
Nothing special here. You have a DataStax Enterprise cluster with some Hadoop nodes defined and you need to process data. But some of it lives in relational databases. Sqoop at rescue.
Snap-in application log ingestion: Flume or Scribe? No, it’s Log4j
When I read this bullet point my first thought was this is Flume. Or maybe Scribe. But most probably Flume. It looks like DataStax went a different route and offers log ingestion using Log4j. It’s true that Log4j or one of its flavors most probably exist in every Java project, but it still feels like an odd choice. On the other hand there’s a Cassandra plugin for Flume.
OpsCenter Enterprise 2.0
The OpsCenter is the management, monitoring, and control tool for DataStax Enterprise. The new version includes pretty much what you’d expect from an admin/monitoring tool:
- multi-cluster monitoring
- visual backup
- search monitoring
Looking back at the NoSQL administration/monitoring tools I’ve seen lately, I’m pretty sure I’ve identified a trend: they all come in various shades of black.
DataStax OpsCenter Enterprise:
Elastic workload provisioning
I’ve left at the end the feature that got me most interested into: elastic workload provisioning.
To better understand what this is, I had to go back to DataStax Enterprise 1.x where a node could be either a Cassandra node (OLTP) or a Hadoop node (processing). The new version allows quasi-dynamic node provisioning by changing the mode of a cluster (between Hadoop, Cassandra, Solr) with a stop/start operation. So given a cluster one could adjust its capacity and performance for different workloads (e.g. time-sensitive applications or temporary cluster operations).
Workload management is a feature present in most of the commercial data warehouse solutions. Even if in the very early days, DataStax Enterprise’s workload provisioning is the first take towards workload management in the NoSQL space.
Original title and link: Cassandra + Hadoop + Solr and Sqoop and Log4j => DataStax Enterprise 2.0 ( ©myNoSQL)