DataStax: All content tagged as DataStax in NoSQL databases and polyglot persistence
Holy cow! That’s a 4 followed by a 5… with no dots in between.
Derrick Harris for GigaOm: NoSQL startup DataStax raises $45M to ride Cassandra’s wave:
Cassandra’s success with such large users has to do with its ability to handle large-scale online applications that demand steady levels of performance, DataStax CEO Billy Bosworth told me. Scalability and performance have never been among Cassandra’s shortcomings, and the database is capable of replicating data across data centers. Large companies used to choose Oracle for applications that needed these capabilities, but now that NoSQL options are around and relatively mature, companies are rethinking whether the relational database model was ever really correct for some applications in the first place.
DataStax will use the funding to build out globally and invest in Apache Cassandra, the NoSQL open-source project and foundation for the company’s database distributions. The funding also signals a potential IPO for DataStax but much will depend on the direction of the markets, said CEO Billy Bosworth in an interview yesterday. “We are building the company for that direction (IPO),” he said. “A l lot depends on external factors. Internally, the company is already starting that process.”
According to my books:
- This is the largest round raised by a NoSQL company. It tops 10gen’s $45mil for MongoDB.
- This is the 3rd largest round raised in the new data market, after Cloudera’s $65mil. and Hortonworks’s $50mil. rounds.
Original title and link: $45millions more for DataStax ( ©myNoSQL)
On one side:
and on the other side:
- Riak Searching: Solr-like but custom prioprietary implementation
- MongoDB text search: custom prioprietary implementation
I’m not going to argue about the pros and cons of each of these approaches, but I’m sure you already know which of these approaches I’m in favor of.
Original title and link: NoSQL and Full Text Indexing: Two Trends ( ©myNoSQL)
I’d say that raising another $25 million from Meritech Capital Partners and with the participation of existing investors Lightspeed Venture Partners and Crosslink Capital is a good enough reason for DataStax to party.
DataStax will use the funds to further enhance its Big Data platform and increase the value for current customers while driving global customer acquisition.
Congrats to DataStax and Cassandra community!
Original title and link: $25 Million in C Round for DataStax ( ©myNoSQL)
The tl;dr version is: DataStax has announced
Cassandra + Hadoop + Solr on the same cluster plus Sqoop, Log4j, and workload provisioning = DataStax Enterprise 2.0
For the longer version, there are a couple of new things worth emphasizing in this release:
- Fully integrated enterprise search
- RDBMS data migration
- Snap-in application log ingestion
- improvements to OpsCenter
- Elastic workload provisioning
Let’s take these one by one:
Fully integrated enterprise search or Solr on top of Cassandra
Cassandra distribution model is strongly inspired by Amazon Dynamo being characterized by high availability, elasticity, and fault tolerance. Solr is the search platform built on top of Lucene. Over time people learned how to scale Solr, but current approaches are far from being simple or offering an out of the box experience. Taking the Solr protocol and indexing capabilities and putting those on top of the Cassandra architecture makes a lot of sense.
Actually this has already been done in the form of Solandra (nb Solr integration in DataStax Enter. 2.0 is not based on Solandra though). For a scalable search solution there’s already ElasticSearch, but for someone running a Cassandra cluster, this looks like a useful addition to the stack.
DataStax has already showed this direction with what was called initially Brisk (or Brangelina for friends): Hadoop on top of the Cassandra cluster that became DataStax Enterprise 1.0. Solr on top of Cassandra is 2.0, but what will be the 3.0?
There are two cherries on top of this integration of Solr: easy index rebuild operations and CQL (Cassandra Query Language) access. I’ve seen XQuery translated to Lucene searches before, but I still need to see a SQL-like language translation.
As I’ve learned from Riak at Clipboard: Why Riak and How We Made Riak Search Faster, there is some complexity involved in scaling multi-matching search queries with term-based partitioning. Cassandra uses two partitioning strategies: random and order-preserving. It would be interesting to hear what partitioning strategy is used for Solr indexes. Update: I’ve got some answers so there’ll be a follow up with more details.
RDBMS data migration: it must be Sqoop
Nothing special here. You have a DataStax Enterprise cluster with some Hadoop nodes defined and you need to process data. But some of it lives in relational databases. Sqoop at rescue.
Snap-in application log ingestion: Flume or Scribe? No, it’s Log4j
When I read this bullet point my first thought was this is Flume. Or maybe Scribe. But most probably Flume. It looks like DataStax went a different route and offers log ingestion using Log4j. It’s true that Log4j or one of its flavors most probably exist in every Java project, but it still feels like an odd choice. On the other hand there’s a Cassandra plugin for Flume.
OpsCenter Enterprise 2.0
The OpsCenter is the management, monitoring, and control tool for DataStax Enterprise. The new version includes pretty much what you’d expect from an admin/monitoring tool:
- multi-cluster monitoring
- visual backup
- search monitoring
Looking back at the NoSQL administration/monitoring tools I’ve seen lately, I’m pretty sure I’ve identified a trend: they all come in various shades of black.
DataStax OpsCenter Enterprise:
Elastic workload provisioning
I’ve left at the end the feature that got me most interested into: elastic workload provisioning.
To better understand what this is, I had to go back to DataStax Enterprise 1.x where a node could be either a Cassandra node (OLTP) or a Hadoop node (processing). The new version allows quasi-dynamic node provisioning by changing the mode of a cluster (between Hadoop, Cassandra, Solr) with a stop/start operation. So given a cluster one could adjust its capacity and performance for different workloads (e.g. time-sensitive applications or temporary cluster operations).
Workload management is a feature present in most of the commercial data warehouse solutions. Even if in the very early days, DataStax Enterprise’s workload provisioning is the first take towards workload management in the NoSQL space.
Original title and link: Cassandra + Hadoop + Solr and Sqoop and Log4j => DataStax Enterprise 2.0 ( ©myNoSQL)
Cassandra as the Central Nervous System of Your Distributed Systems with Joe Stein - Powered by NoSQL
In the 4th week of the DataStax’s Cassandra NYC 2011 video series, we have Joe Stein from Medialets talking about the architecture
Before diving into the video here are some interesting data points:
- Medialets serves rich media ads
- they handle 3-4TB of daily data
- microsecond-level response times
- Cassandra is used for time series and aggregate metrics
- all MapReduce jobs written in Python. This reminded me of the recent post about the performance impact of operations in Hadoop Map phase
Major components of the Medialets’s architecture:
- Cassandra: 6 node cluster, 100k requests, single DC
- ZooKeeper: coordinates all the services on the platform
- some of the data in MySQL is replicated in Cassandra (and coordinated with ZooKeeper)
- data is fed back to MySQL
- Kafka for collecting analytics data:
- aggregates go into Cassandra
- events in Hadoop
- GROUP BY with Cassandra
- for real-time systems aggregations must be done upfront
- the way data is segmented is critical
- aggregation leads to data explosion