NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



DataStax: All content tagged as DataStax in NoSQL databases and polyglot persistence

NoSQL Applications Panel Video

Hey, it looks like the NoSQL applications panel I’ve moderated at QCon SF 2011 went live minutes ago on InfoQ. Featuring Andy Gross (Basho), Frank Weigel (Couchbase), Matt Pfeil (DataStax), Michael Stack (StumbleUpon), Jared Rosoff (10gen), and yours truly.

Drop everything and start watching it now! I promise you’ll love every second of it[1].

  1. It misses my opening jokes though  

Original title and link: NoSQL Applications Panel Video (NoSQL database©myNoSQL)

8 Most Interesting Companies for Hadoop’s Future

Filtering and augmenting a Q&A on Quora:

  1. Cloudera: Hadoop distribution, Cloudera Enterprise, Services, Training
  2. Hortonworks: Apache Hadoop major contributions, Services, Training
  3. MapR: Hadoop distribution, Services, Training
  4. HPCC Systems: massive parallel-processing computing platform
  5. HStreaming: real-time data processing and analytics capabilities on top of Hadoop
  6. DataStax: DataStax Enterprise, Apache Cassandra based platform accepting real-time input from online applications, while offering analytic operations, powered by Hadoop
  7. Zettaset: Enterprise Data Analytics Suite built on Hadoop
  8. Hadapt: analytic platform based on Apache Hadoop and relational DBMS technology

I’ve left aside names like IBM, EMC, Informatica, which are doing a lot of integration work.

Original title and link: 8 Most Interesting Companies for Hadoop’s Future (NoSQL database©myNoSQL)

Petabyte-Scale Hadoop Clusters

Curt Monash quoting Omer Trajman (Cloudera) in a post counting petabyte-scale Hadoop deployments:

The number of Petabyte+ Hadoop clusters expanded dramatically over the past year, with our recent count reaching 22 in production (in addition to the well-known clusters at Yahoo! and Facebook). Just as our poll back at Hadoop World 2010 showed the average cluster size at just over 60 nodes, today it tops 200. While mean is not the same as median (most clusters are under 30 nodes), there are some beefy ones pulling up that average. Outside of the well-known large clusters at Yahoo and Facebook, we count today 16 organizations running PB+ clusters running CDH across a diverse number of industries including online advertising, retail, government, financial services, online publishing, web analytics and academic research. We expect to see many more in the coming years, as Hadoop gets easier to use and more accessible to a wide variety of enterprise organizations.

First questions that bumped in my head after reading it:

  1. How many deployments DataStax’ Brisk has? How many close or over petabyte?
  2. How many clients run EMC Greenplum HD and how many are close to this scale?
  3. Same question about NetApp Hadoopler clients.
  4. Same question for MapR.

Answering these questions would give us a good overview of the Hadoop ecosystem.

Original title and link: Petabyte-Scale Hadoop Clusters (NoSQL database©myNoSQL)


Hadoop Ecosystem: EMC, NetApp, Mellanox, SnapLogic, DataStax

GigaOm and RWW have coverage of the 5 Hadoop-related announcements:

  • DataStax Brisk: Hadoop and Hive on Cassandra
  • NetApp Hadoop Shared DAS
  • Mellanox Hadoop-Direct

    increase throughput in Hadoop clusters via its ConnectX-2 adapters with Hadoop Direct

  • SnapLogic SnapReduce

    SnapReduce transforms SnapLogic data integration pipelines directly into MapReduce tasks, making Hadoop processing much more accessible and resulting in optimal Hadoop cluster utilization.

  • EMC GreenplumHD

    Greenplum HD combines the Hadoop analytics platform with Greenplum’s database technology.

Ways to look at it:

  • 2 large corporations getting into Hadoop
  • 2 software solutions, 3 hardware solutions
  • 1 open source project, 4 commercial products or
  • 4 companies wanting to make a profit from Hadoop without contributing back to the community

Original title and link: Hadoop Ecosystem: EMC, NetApp, Mellanox, SnapLogic, DataStax (NoSQL databases © myNoSQL)

DataStax Hadoop on Cassandra Brisk Released

DataStax kept its promise and released Brisk: the Hadoop and Hive distribution using Cassandra, also known as Brangelina.

According to the official documentation, Brisk key advantages:

  • No single point of failure
  • streamlined setup and operations
  • analytics without ETL
  • full integration with DataStax OpsCenter

Brisk Architecture

Useful links:

Original title and link: DataStax Hadoop on Cassandra Brisk Released (NoSQL databases © myNoSQL)

Amazon EC2 Cassandra Cluster with DataStax AMI

This AMI does the following:

  • installs Cassandra 0.7.4 on a Ubuntu 10.10 image
  • configures emphemeral disks in raid0, if applicable (EBS is a bad fit for Cassandra
  • configures Cassandra to use the root volume for the commitlog and the ephemeral disks for data files
  • configures Cassandra to use the local interface for intra-cluster communication
  • configures all Cassandra nodes with the same seed for gossip discovery

Note the “EBS is a bad fit for Cassandra”. That’s what Adrian Cockcroft explains in Multi-tenancy and Cloud Storage Performance.

Original title and link: Amazon EC2 Cassandra Cluster with DataStax AMI (NoSQL databases © myNoSQL)


Brisk: The Brangelina of Big Data

Now that’s a title: The Brangelina of Big Data: Cassandra mates with Hadoop. Open source celebrity supercouple. The article is a genealogy tree: Hadoop, Hive, Cassandra, DataStax.

Original title and link: Brisk: The Brangelina of Big Data (NoSQL databases © myNoSQL)

Cassandra + Hadoop = Brisk by DataStax

I just heard the announcement DataStax, the company offering Cassandra services, made about Brisk a Hadoop and Hive distribution built on top of Cassandra:

Brisk provides integrated Hadoop MapReduce, Hive and job and task tracking capabilities, while providing an HDFS-compatible storage layer powered by Cassandra.

Brisk was announced officially during the MapReduce panel at Structure Big Data event. But it looks like others have already had a chance to hear about Brisk — is there something that I should be doing to hear the “unofficial” announcements?

DataStax has also made available a whitepaper: “Evolving Hadoop into a Low-Latency Data Infrastructure: Unifying Hadoop, Hive and Apache Cassandra for Real-time and Analytics” that you can download from here

Original title and link: Cassandra + Hadoop = Brisk by DataStax (NoSQL databases © myNoSQL)

New Tools in the NoSQL and Big Data Market

DataStax OpsCenter for Apache Cassandra

DataStax (ex-Riptano) announced yesterday their tool for managing including sophisticated visualizations of the cluster, comprehensive management and configuration, monitoring and operating enterprise Cassandra applications named OpsCenter.

DataStax OpsCenter for Apache Cassandra will require a subscription, but a developer version, not to be used in production, will be made available too.

Call me an idealist, but I would have suggested a different than Gold/Silver/Bronze or Mission-Critical/Premier model:

  • 1-5 nodes: free (nb: good kharma)
  • 6-low tens of nodes: moderately priced package
  • premier: everything else

EMC Greenplum Community Edition

After acquiring Greenplum[1], EMC is making available a community edition:

[…] the new EMC Greenplum Community Edition removes the cost barrier to entry for big data power tools empowering large numbers of developers, data scientists, and other data professionals. This free set of tools enables the community to not only better understand their data, gain deeper insights and better visualize insights, but to also contribute and participate in the development of next-generation tools and solutions. With the Community Edition stack, developers can build complex applications to collect, analyze and operationalize big data leveraging best of breed big data tools including the Greenplum Database with its in-database analytic processing capabilities.

I couldn’t find the details of the community edition license, but instead I’ve found this:

The software is only intended for research, development and experiments, with license purchases required for commercial uses.

About the (marketing) rationale behind this release you can read more on Chuck Hollis’, Global Marketing CTO, blog

  1. Greenplum: a shared-nothing MPP architecture from the ground up for BI and analytical processing using commodity hardwar  

Original title and link: New Tools in the NoSQL and Big Data Market (NoSQL databases © myNoSQL)

Riptano Becomes DataStax

But they lost the rhino logo:

Riptano DataStax old logo

I broke the news of Riptano creation in April. And this is the second renaming in the industry.

Original title and link: Riptano Becomes DataStax (NoSQL databases © myNoSQL)