NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



DataStax: All content tagged as DataStax in NoSQL databases and polyglot persistence

Google Compute Engine and Data

Since announcing the GA couple of weeks ago, I’ve been noticing quite a few data related posts on the Google Compute Engine blog:

If you look at these, you’ll notice a theme: covering data from every angle; Cassandra/DSE from DataStax for OLTP, DataTorrent for stream processing, Qubole for Hadoop, MapR for their Hadoop-like solution. I can see this continuing for a while and making Google Compute Engine a strong competitor for Amazon Web Services.

One question remains though: will they be able to come up with a good integration strategy for all these 3rd party tools?

Original title and link: Google Compute Engine and Data (NoSQL database©myNoSQL)

Forbes Top 10 Most Funded Big Data Startups

  • MongoDB (formerly 10gen) $231m Document-oriented database
  • Mu Sigma $208m Data-Science-as-a-Service
  • Cloudera $141m Hadoop-based software, services and training
  • Opera Solutions $114 Data-Science-as-a-Service
  • Hortonworks $98 Hadoop-based software, services and training
  • Guavus $87 Big data analytics solution
  • DataStax $83.7 Cassandra-based big data platform
  • GoodData $75.5 Cloud-based platform and big data apps
  • Talend $61.6 App and business process integration platform
  • Couchbase $56 Document-oriented database

I’m not really sure there are any conclusions one could make based only on this data.

Original title and link: Forbes Top 10 Most Funded Big Data Startups (NoSQL database©myNoSQL)


$45millions more for DataStax

Holy cow! That’s a 4 followed by a 5… with no dots in between.

  1. Derrick Harris for GigaOm: NoSQL startup DataStax raises $45M to ride Cassandra’s wave:

    Cassandra’s success with such large users has to do with its ability to handle large-scale online applications that demand steady levels of performance, DataStax CEO Billy Bosworth told me. Scalability and performance have never been among Cassandra’s shortcomings, and the database is capable of replicating data across data centers. Large companies used to choose Oracle for applications that needed these capabilities, but now that NoSQL options are around and relatively mature, companies are rethinking whether the relational database model was ever really correct for some applications in the first place.

  2. Alex Williams for TC: DataStax Readies For IPO, Raises $45M For Modern Database Platform Suited To New Data Intensive World:

    DataStax will use the funding to build out globally and invest in Apache Cassandra, the NoSQL open-source project and foundation for the company’s database distributions. The funding also signals a potential IPO for DataStax but much will depend on the direction of the markets, said CEO Billy Bosworth in an interview yesterday. “We are building the company for that direction (IPO),” he said. “A l lot depends on external factors. Internally, the company is already starting that process.”

According to my books:

  1. This is the largest round raised by a NoSQL company. It tops 10gen’s $45mil for MongoDB.
  2. This is the 3rd largest round raised in the new data market, after Cloudera’s $65mil. and Hortonworks’s $50mil. rounds.

Original title and link: $45millions more for DataStax (NoSQL database©myNoSQL)

NoSQL and Full Text Indexing: Two Trends

On one side:

  1. DataStax with Solr
  2. MapR with LucidWorks Search (nb: Solr)

and on the other side:

  1. Riak Searching: Solr-like but custom prioprietary implementation
  2. MongoDB text search: custom prioprietary implementation

I’m not going to argue about the pros and cons of each of these approaches, but I’m sure you already know which of these approaches I’m in favor of.

Original title and link: NoSQL and Full Text Indexing: Two Trends (NoSQL database©myNoSQL)

Hadoop, Security, and DataStax Enterprise

But the eWeek article demonstrates that the same concerns [nb: about security] exist where Hadoop implementations are concerned. The article says: “It [Hadoop] was not written to support hardened security, compliance, encryption, policy enablement and risk management.”

The story goes like this: in the early days of NoSQL, when no NoSQL database had any sort of security features, people behind the projects answered: “it’s too early. we’re focusing on more important features. and you can still get around security by placing your database behind firewalls”. Today, when more and more NoSQL databases are adding security features, the story these same people are telling is quite different: “ohhh, security is critical. we don’t really see how you could run a database without these features”.

Security is always critical. And exactly the same can be said about maintaining a solid, coherent story of what you are telling your users.

Original title and link: Hadoop, Security, and DataStax Enterprise (NoSQL database©myNoSQL)


Oracle and DataStax on TechCrunch

I have some serious doubts about Alex Williams’s post on TechCrunch about the connection between the recently announced results from Oracle and DataStax. To exemplify, these paragraphs don’t make a lot of sense to me:

The reason for the drop has more to do with the enterprise acceptance of online applications more than anything else, said Datatastax CEO Billy Bosworth in an interview last week.

Does it mean that enterprises are discovering online applications now?

When companies come to Datastax, they say the number one thing they need is security, Bosworth said. They are building from day one to avoid disaster scenarios.

DataStax introduced security features just recently, so I’ll assume Billy Bosworth was actually referring to fault tolerance and resilience. What ended up in the article is a different story.

Datastax has its own challenges. It competes with Amazon Web Services and all the other NoSQL providers such as 10gen.

Once again I’ll assume the author wanted to refer to Amazon Dynamo (and RDS?), but thought it’ll read better as “Amazon Web Services”.

Actually, now that I read it twice, I realize that I shouldn’t link to it. But at least I can suggest you to waste no time with it.

Original title and link: Oracle and DataStax on TechCrunch (NoSQL database©myNoSQL)


A Quick Tour of Internal Authentication and Authorization Security in DataStax Enterprise and Apache Cassandra

Robin Schumacher describes the new security features added to Apache Cassandra and DataStax Enterprise:

This article will concentrate on the new internal authentication and authorization (or permission management) features that are part of both open source Cassandra as well as DataStax Enterprise. Authentication deals with validating incoming user connections to a database cluster, whereas authorization concerns itself with what a logged in user can do inside a database.

I’m happy to see NoSQL databases entering the space of security as this would ease their way inside enterprises. But I fear a bit the moment when the marketing message will change from “it’s too early to provide security features” to “the first enterprise grade NoSQL database”.

Original title and link: A Quick Tour of Internal Authentication and Authorization Security in DataStax Enterprise and Apache Cassandra (NoSQL database©myNoSQL)


NoSQL on MySQL: Stating the Obvious

Matthew Aslett about Couchbase’s and DataStax’s reactions to Oracle’s announcement of MySQL support of NoSQL API:

Sure, Couchbase and DataStax laid it on a bit thick, but these are corporate blog posts – it goes with the territory.

I’ve already linked and commented about these: Couchbase’s reaction and DataStax’s reaction. What I didn’t know—more accurately I should probably write “I hoped”—is that this sort of reactions come with the “corporate” badge. But I’ll keep my hope considering the exhaustive list of reactions from other NoSQL companies.

Original title and link: NoSQL on MySQL: Stating the Obvious (NoSQL database©myNoSQL)


DataStax's Reaction to MySQL 5.6: Oracle’s MySQL Misses the NoSQL Mark

Jonathan Ellis in a post about MySQL 5.6 and how Oracle got the whole NoSQL wrong, considering NoSQL is, in this exact order, about scaling, continuous availability, flexibility, performance, and queryability:

The big news for MySQL 5.6 was the inclusion of “NoSQL” features in the form of a memcached api for get and put operations.

In cases like this, it’s tough to tell whether Oracle got this so wrong deliberately to sow confusion in the market, or because they really think that’s what NoSQL is about.

I know Jonathan Ellis has always had very strong opinions about the technical superiority of Cassandra and Cassandra is indeed a very solid solution, but I’m always reluctant to calling a competitor stupid and using the myopic argument “if I’m good at X and suck at Y, then what everyone is looking for is only X”.

Original title and link: DataStax’s Reaction to MySQL 5.6: Oracle’s MySQL Misses the NoSQL Mark (NoSQL database©myNoSQL)


$25 Million in C Round for DataStax

I’d say that raising another $25 million from Meritech Capital Partners and with the participation of existing investors Lightspeed Venture Partners and Crosslink Capital is a good enough reason for DataStax to party.

DataStax will use the funds to further enhance its Big Data platform and increase the value for current customers while driving global customer acquisition.

Congrats to DataStax and Cassandra community!

Original title and link: $25 Million in C Round for DataStax (NoSQL database©myNoSQL)

Cassandra + Hadoop + Solr and Sqoop and Log4j => DataStax Enterprise 2.0

The tl;dr version is: DataStax has announced

Cassandra + Hadoop + Solr on the same cluster plus Sqoop, Log4j, and workload provisioning = DataStax Enterprise 2.0

For the longer version, there are a couple of new things worth emphasizing in this release:

  1. Fully integrated enterprise search
  2. RDBMS data migration
  3. Snap-in application log ingestion
  4. improvements to OpsCenter
  5. Elastic workload provisioning

Let’s take these one by one:

Fully integrated enterprise search or Solr on top of Cassandra

Cassandra distribution model is strongly inspired by Amazon Dynamo being characterized by high availability, elasticity, and fault tolerance. Solr is the search platform built on top of Lucene. Over time people learned how to scale Solr, but current approaches are far from being simple or offering an out of the box experience. Taking the Solr protocol and indexing capabilities and putting those on top of the Cassandra architecture makes a lot of sense.

Actually this has already been done in the form of Solandra (nb Solr integration in DataStax Enter. 2.0 is not based on Solandra though). For a scalable search solution there’s already ElasticSearch, but for someone running a Cassandra cluster, this looks like a useful addition to the stack.

DataStax has already showed this direction with what was called initially Brisk (or Brangelina for friends): Hadoop on top of the Cassandra cluster that became DataStax Enterprise 1.0. Solr on top of Cassandra is 2.0, but what will be the 3.0?

There are two cherries on top of this integration of Solr: easy index rebuild operations and CQL (Cassandra Query Language) access. I’ve seen XQuery translated to Lucene searches before, but I still need to see a SQL-like language translation.

As I’ve learned from Riak at Clipboard: Why Riak and How We Made Riak Search Faster, there is some complexity involved in scaling multi-matching search queries with term-based partitioning. Cassandra uses two partitioning strategies: random and order-preserving. It would be interesting to hear what partitioning strategy is used for Solr indexes. Update: I’ve got some answers so there’ll be a follow up with more details.

RDBMS data migration: it must be Sqoop

Nothing special here. You have a DataStax Enterprise cluster with some Hadoop nodes defined and you need to process data. But some of it lives in relational databases. Sqoop at rescue.

Snap-in application log ingestion: Flume or Scribe? No, it’s Log4j

When I read this bullet point my first thought was this is Flume. Or maybe Scribe. But most probably Flume. It looks like DataStax went a different route and offers log ingestion using Log4j. It’s true that Log4j or one of its flavors most probably exist in every Java project, but it still feels like an odd choice. On the other hand there’s a Cassandra plugin for Flume.

OpsCenter Enterprise 2.0

The OpsCenter is the management, monitoring, and control tool for DataStax Enterprise. The new version includes pretty much what you’d expect from an admin/monitoring tool:

  • multi-cluster monitoring
  • visual backup
  • search monitoring

Looking back at the NoSQL administration/monitoring tools I’ve seen lately, I’m pretty sure I’ve identified a trend: they all come in various shades of black.

DataStax OpsCenter Enterprise:

DataStax OpsCenter

Riak Control:

Riak Control

Elastic workload provisioning

I’ve left at the end the feature that got me most interested into: elastic workload provisioning.

To better understand what this is, I had to go back to DataStax Enterprise 1.x where a node could be either a Cassandra node (OLTP) or a Hadoop node (processing). The new version allows quasi-dynamic node provisioning by changing the mode of a cluster (between Hadoop, Cassandra, Solr) with a stop/start operation. So given a cluster one could adjust its capacity and performance for different workloads (e.g. time-sensitive applications or temporary cluster operations).

Workload management is a feature present in most of the commercial data warehouse solutions. Even if in the very early days, DataStax Enterprise’s workload provisioning is the first take towards workload management in the NoSQL space.

Original title and link: Cassandra + Hadoop + Solr and Sqoop and Log4j => DataStax Enterprise 2.0 (NoSQL database©myNoSQL)

Big Data Market Analysis: Vendors Revenue and Forecasts

I think this is the first extensive Big Data report I’m reading that includes enough relevant and quite exhaustive data about the majority of players in the Big Data market, plus some captivating forecasts.

As of early 2012, the Big Data market stands at just over $5 billion based on related software, hardware, and services revenue. Increased interest in and awareness of the power of Big Data and related analytic capabilities to gain competitive advantage and to improve operational efficiencies, coupled with developments in the technologies and services that make Big Data a practical reality, will result in a super-charged CAGR of 58% between now and 2017.

2011 Big Data Pure-Play Vendors Yealy Big Data Revenue

While there are many stories behind these numbers and many things to think about, here is what I’ve jotted down while studying the report:

  • it’s no surprise that “megavendors” (IBM, HP, etc.) account for the largest part of today’s Big Data market revenue
  • still, the revenue ratio of pure-players vs megavendors feels quite unbalanced: $311mil out of $5.1bil
    • the pure-player category includes: Vertica, Aster Data, Splunk, Greenplum, 1010data, Cloudera, Think Big Analytics, MapR, Digital Reasoning, Datameer, Hortonworks, DataStax, HPCC Systems, Karmasphere
    • there are a couple of names that position themselves in the Big Data market that do not show up in anywhere (e.g. 10gen, Couchbase)
  • this could lead to the conclusion that the companies that include hardware in their offer benefit of larger revenues
    • I’m wondering though what is the margin in the hardware market segment. While not having any data at hand, I think I’ve read reports about HP and Dell not doing so well due exactly to lower margins
    • see bullet point further down about revenue by hardware, software, and services
  • this could explain why so many companies are trying their hand at appliances
  • by looking at the various numbers you can see that those selling appliances usually have a large corporation behind supporting the production costs for hadware and probably the cost of the sales force
  • in the Big Data revenue by vendor you can find quite a few well-known names from the consulting segment
  • the revenue by type pie lists services as accounting for 44%, hardware for 31%, and software for 13% which might give an idea of what makes up the megavendors’ sales packages
    • most of the NoSQL database companies and Hadoop companies are mostly in the software and services segment

Great job done by the Wikibon team.

Original title and link: Big Data Market Analysis: Vendors Revenue and Forecasts (NoSQL database©myNoSQL)