NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Brisk: All content tagged as Brisk in NoSQL databases and polyglot persistence

Petabyte-Scale Hadoop Clusters

Curt Monash quoting Omer Trajman (Cloudera) in a post counting petabyte-scale Hadoop deployments:

The number of Petabyte+ Hadoop clusters expanded dramatically over the past year, with our recent count reaching 22 in production (in addition to the well-known clusters at Yahoo! and Facebook). Just as our poll back at Hadoop World 2010 showed the average cluster size at just over 60 nodes, today it tops 200. While mean is not the same as median (most clusters are under 30 nodes), there are some beefy ones pulling up that average. Outside of the well-known large clusters at Yahoo and Facebook, we count today 16 organizations running PB+ clusters running CDH across a diverse number of industries including online advertising, retail, government, financial services, online publishing, web analytics and academic research. We expect to see many more in the coming years, as Hadoop gets easier to use and more accessible to a wide variety of enterprise organizations.

First questions that bumped in my head after reading it:

  1. How many deployments DataStax’ Brisk has? How many close or over petabyte?
  2. How many clients run EMC Greenplum HD and how many are close to this scale?
  3. Same question about NetApp Hadoopler clients.
  4. Same question for MapR.

Answering these questions would give us a good overview of the Hadoop ecosystem.

Original title and link: Petabyte-Scale Hadoop Clusters (NoSQL database©myNoSQL)


Where Riak Fits? Riak’s Sweetspot

Martin Schneider (Basho) trying to answer the question in the title:

Riak can be a data store to a purpose-built enterprise app; a caching layer for an Internet app, or part of the distributed fabric and DNA of a Global app. Those are of course highly arbitrary and vague examples, but it shows how flexible Riak is as a platform.

“Can be” is not quite equivalent with being the right solution and less so with being the best solution. And Martin’s answer to this is:

For super scalable enterprise and global apps — those where the data inside is inherently valuable and dependability of the system to capture, process and store data/writes is imperative — well I see Riak outperforming any perceived competitor in the space in providing value here.

But even for these scenarios, there’s competition from solutions like Cassandra, HBase, and Hypertable — the whole spectrum of scalable storage solutions based on Google BigTable and Amazon Dynamo being covered: HBase (a BigTable implementation), Cassandra (a solution using the BigTable data model and the Dynamo distributed model), and Riak (a solution based mainly on the Amazon Dynamo paper).

While Riak presents itself as the cleanest Dynamo based solution, I would venture to say that both Cassandra and HBase come to table with some interesting characteristics that cannot be ignored:

  1. Strong communities and community driven development processes — both HBase and Cassandra are top Apache Foundation projects
  2. Excellent integration with Hadoop, the leading batch processing solution. DataStax, the company offering services for Cassandra, went the extra-mile of creating a custom Hadoop solution, Brisk, making this integration even better.

Bottom line, I don’t think we can declare a winner in this space and I believe all three solutions will stay around for a while competing for every scenario requiring dependability of the system to capture, process and store data.

Original title and link: Where Riak Fits? Riak’s Sweetspot (NoSQL databases © myNoSQL)

Hadoop Ecosystem: EMC, NetApp, Mellanox, SnapLogic, DataStax

GigaOm and RWW have coverage of the 5 Hadoop-related announcements:

  • DataStax Brisk: Hadoop and Hive on Cassandra
  • NetApp Hadoop Shared DAS
  • Mellanox Hadoop-Direct

    increase throughput in Hadoop clusters via its ConnectX-2 adapters with Hadoop Direct

  • SnapLogic SnapReduce

    SnapReduce transforms SnapLogic data integration pipelines directly into MapReduce tasks, making Hadoop processing much more accessible and resulting in optimal Hadoop cluster utilization.

  • EMC GreenplumHD

    Greenplum HD combines the Hadoop analytics platform with Greenplum’s database technology.

Ways to look at it:

  • 2 large corporations getting into Hadoop
  • 2 software solutions, 3 hardware solutions
  • 1 open source project, 4 commercial products or
  • 4 companies wanting to make a profit from Hadoop without contributing back to the community

Original title and link: Hadoop Ecosystem: EMC, NetApp, Mellanox, SnapLogic, DataStax (NoSQL databases © myNoSQL)

DataStax Hadoop on Cassandra Brisk Released

DataStax kept its promise and released Brisk: the Hadoop and Hive distribution using Cassandra, also known as Brangelina.

According to the official documentation, Brisk key advantages:

  • No single point of failure
  • streamlined setup and operations
  • analytics without ETL
  • full integration with DataStax OpsCenter

Brisk Architecture

Useful links:

Original title and link: DataStax Hadoop on Cassandra Brisk Released (NoSQL databases © myNoSQL)

Brisk: The Brangelina of Big Data

Now that’s a title: The Brangelina of Big Data: Cassandra mates with Hadoop. Open source celebrity supercouple. The article is a genealogy tree: Hadoop, Hive, Cassandra, DataStax.

Original title and link: Brisk: The Brangelina of Big Data (NoSQL databases © myNoSQL)

Cassandra + Hadoop = Brisk by DataStax

I just heard the announcement DataStax, the company offering Cassandra services, made about Brisk a Hadoop and Hive distribution built on top of Cassandra:

Brisk provides integrated Hadoop MapReduce, Hive and job and task tracking capabilities, while providing an HDFS-compatible storage layer powered by Cassandra.

Brisk was announced officially during the MapReduce panel at Structure Big Data event. But it looks like others have already had a chance to hear about Brisk — is there something that I should be doing to hear the “unofficial” announcements?

DataStax has also made available a whitepaper: “Evolving Hadoop into a Low-Latency Data Infrastructure: Unifying Hadoop, Hive and Apache Cassandra for Real-time and Analytics” that you can download from here

Original title and link: Cassandra + Hadoop = Brisk by DataStax (NoSQL databases © myNoSQL)