NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



MapR: All content tagged as MapR in NoSQL databases and polyglot persistence

8 Most Interesting Companies for Hadoop’s Future

Filtering and augmenting a Q&A on Quora:

  1. Cloudera: Hadoop distribution, Cloudera Enterprise, Services, Training
  2. Hortonworks: Apache Hadoop major contributions, Services, Training
  3. MapR: Hadoop distribution, Services, Training
  4. HPCC Systems: massive parallel-processing computing platform
  5. HStreaming: real-time data processing and analytics capabilities on top of Hadoop
  6. DataStax: DataStax Enterprise, Apache Cassandra based platform accepting real-time input from online applications, while offering analytic operations, powered by Hadoop
  7. Zettaset: Enterprise Data Analytics Suite built on Hadoop
  8. Hadapt: analytic platform based on Apache Hadoop and relational DBMS technology

I’ve left aside names like IBM, EMC, Informatica, which are doing a lot of integration work.

Original title and link: 8 Most Interesting Companies for Hadoop’s Future (NoSQL database©myNoSQL)

Hadoop Market Competition: comScore From Cloudera to MapR

Mike Brown (comScore CTO):

We could capitalize the purchase [of MapR] with an annual maintenance charge versus a yearly cost per node. NFS allowed our enterprise systems to easily access the data in the cluster.

Some interesting bits:

  • comScore runs a 1000+ self-hosted Hadoop cluster
  • comScore migrated from Cloudera to MapR in 2 days
    • the migration was accomplished by copying and reloading data
    • depending on the size of stored data, a better approach would a rolling migration—
  • comScore MapR’s Direct Access NFS feature, which exposes Hadoop Distributed File System (HDFS) data as NFS files which can then be easily mounted, modified or overwritten
  • comScore will continue to use Cloudera for training purposes
    • Question: what is the advantage of paying two providers and maintaining two different clusters?

As previewed by Cloudera-Hortonworks exchanges, the competition on the Hadoop market is becoming fierce. But at least this story involves companies that are actively involved in innovating and improving Hadoop. Not those that just want to monetize it.

Original title and link: Hadoop Market Competition: comScore From Cloudera to MapR (NoSQL database©myNoSQL)


Hadoop, HPCC, MapR and the TeraSort Benchmark

Just in, from LexisNexis:

HPCC Systems 4 nodes cluster sorts 100 gigabytes in 98 seconds and is 25% faster than a 20 nodes Hadoop cluster.

Results achieved in December 2011 show that an HPCC Systems four node Thor cluster took only 98 seconds to complete a Terasort with a job size of 100 gigabytes (GB) on a cluster five times smaller than Hadoop. The HPCC Systems four node cluster was comprised of one (1) Dell PowerEdge C6100 2U server with Intel® Xeon® processors E5675 series, 48GB of memory, and 6 x 146GB SAS HDD’s. The Dell C6100 houses four nodes inside the 2U enclosure. The previous leader ran the same Terasort benchmark in 130 seconds on a 20-node Hadoop cluster using equivalent node hardware. HPCC Systems is an Open Source, enterprise-proven Big Data analytics-processing platform.

Thus Armando Escalante (SVP and CTO of LexisNexis Risk Solutions and head of HPCC Systems) concludes:

These results demonstrate that HPCC Systems is a leader in Big Data processing

Now switching to a post on MapR’s blog:

Recently a world record was claimed for a Hadoop benchmark. […] We were surprised to see that this world record was for a TeraSort benchmark on a 100GB of data. TeraSort is a standard benchmark and the name is derived from “sorting a terabyte”.  Any record claims for sorting a 100GB dataset across a 20 node cluster with 10 times as much memory is comical. The test is named TeraSort not GigaSort.

Original title and link: Hadoop, HPCC, MapR and the TeraSort Benchmark (NoSQL database©myNoSQL)

Hortonworks Data Platform: Hortonworks’ Hadoop Distribution

Announcement came out today[1]:

Hortonworks Data Platform, powered by Apache Hadoop — As we began to interact with enterprises and ecosystem partners, the one constant was the need for a base distribution of Apache Hadoop that is 100% open source and that contains the essential components used with every Hadoop installation.  A distribution was needed to provide an easy to install, tightly integrated and well tested set of servers and tools. As we interacted with potential partners, we also heard the message loud and clear that they wanted open and secure APIs to easily integrate and extend Hadoop. We believe we have succeeded on both fronts. The Hortonworks Data Platform is such an open source distribution.  It is powered by Apache Hadoop and includes the essential Hadoop components, plus some that make it more manageable, open and extensible. Our distribution is based on Hadoop 0.20.205, the first Apache Hadoop release that supports security and HBase.  It also includes some new APIs, such as WebHDFS and those in Ambari and HCatalog, which will make it easy for our partners to integrate their products with Apache Hadoop. For those new to Ambari, it is an open source Apache project that will bring improved installation and management to Hadoop. HCatalog is a metadata management service for simplifying the sharing of data between Hadoop and other data systems. We are releasing Hortonworks Data Platform initially as a limited technology preview with plans to open it up to the public in early 2012.

The fight is on–even if for now the tone is still polite. And if we are adding to the mix MapR and LexisNexis’ HPCC, not to mention the armies of marketers and sales coming from Oracle, IBM, EMC, NetApp, etc. this actually smells like war.

Edward Ribeiro apty commented: “This reminds me of Linux distros war circa 2001”.

  1. The emphasis in the text is mine to underline the most important aspects of the announcement.  

Original title and link: Hortonworks Data Platform: Hortonworks’ Hadoop Distribution (NoSQL database©myNoSQL)

Datameer Is the First BI/Analytics Platform Built Natively on Hadoop

Brian Smith (Datameer Regional Director of Sales):

DAS is an open book at every stage of the data pipeline, with plug and play support at each phase – integration, analysis and visualization. Under the covers, DAS generates Java/MapReduce code that runs natively on the Hadoop cluster. All current Hadoop distros are supported – we’re Switzerland when it comes to platform support for Apache, Cloudera, MapR, IBM and the rest, we run all of it in a browser on Windows, Mac and Linux.

As always I won’t comment on statements referring to “first” or “best”. But I find Brian Smith’s assessment of the Hadoop economics very accurate:

The economics are compelling — Hadoop is moving out costly analytic databases and warehouses, driving IT to re-look at ADBMS sales cycles, shifting IT dollars and vendor roadmaps, and generally wreaking havoc in the traditional vendor community. We’ve gone from one or two distributions to nine in the last year! And, literally every vendor in the BI/DBMS space has a Hadoop connector, the latest being the recent Oracle announcement. Everybody is on board this train — All this based upon the premise of unlimited scale and data variety at a fraction of traditional costs.  Technical challenges exist, but its clear that there’s a sea change.

Original title and link: Datameer Is the First BI/Analytics Platform Built Natively on Hadoop (NoSQL database©myNoSQL)


MapR Raises $20m for Friendly Hadoop Distribution

We’ve already heard of MapR’s positioning as a Cloudera competitor for providing a friendly Hadoop distribution. But it was just another company among the many others trying to get into the Big Data with Hadoop space.

MapR goal sounds very similar to Cloudera’s: providing a tweaked enterprise friendly Hadoop distribution. But this round of funding, amounting to $20 million which places it close the Cloudera’s last round of funding of $25 million, comes as a confirmation from the VC space of this business model.

Here’s a quote from a VentureBean post covering this news:

MapR’s software takes Hadoop and makes it safer and more efficient for enterprises to use. It does this providing support for Linux HA, support for random read/write storage (or the ability to overwrite data) and a native network framework system. The NFS was created by running Hadoop on top of MapR’s storage system, as opposed to Linux.

MapR Hadoop distribution

If we are adding to this Yahoo’s spinoff focused on Hadoop, HortonWorks, I think it’s quite safe to conclude that we will see a lot of interesting things happening faster in the Hadoop world.

Original title and link: MapR Raises $20m for Friendly Hadoop Distribution (NoSQL database©myNoSQL)

Petabyte-Scale Hadoop Clusters

Curt Monash quoting Omer Trajman (Cloudera) in a post counting petabyte-scale Hadoop deployments:

The number of Petabyte+ Hadoop clusters expanded dramatically over the past year, with our recent count reaching 22 in production (in addition to the well-known clusters at Yahoo! and Facebook). Just as our poll back at Hadoop World 2010 showed the average cluster size at just over 60 nodes, today it tops 200. While mean is not the same as median (most clusters are under 30 nodes), there are some beefy ones pulling up that average. Outside of the well-known large clusters at Yahoo and Facebook, we count today 16 organizations running PB+ clusters running CDH across a diverse number of industries including online advertising, retail, government, financial services, online publishing, web analytics and academic research. We expect to see many more in the coming years, as Hadoop gets easier to use and more accessible to a wide variety of enterprise organizations.

First questions that bumped in my head after reading it:

  1. How many deployments DataStax’ Brisk has? How many close or over petabyte?
  2. How many clients run EMC Greenplum HD and how many are close to this scale?
  3. Same question about NetApp Hadoopler clients.
  4. Same question for MapR.

Answering these questions would give us a good overview of the Hadoop ecosystem.

Original title and link: Petabyte-Scale Hadoop Clusters (NoSQL database©myNoSQL)


EMC Partners with MapR for Greenplum HD Enterprise Edition

EMC plans to bring MapR’s proprietary replacement for the Hadoop Distributed File System to its enterprise-ready Apache Hadoop Greenplum HD:

Because MapR’s file system is more efficient than HDFS, users will achieve two to five times the performance over standard Hadoop nodes in a cluster, according to Schroeder. That translates into being able to use about half the number of nodes typically required in a cluster, he said.

“Hadoop nodes cost about $4,000 per node depending on configuration. If you add in power costs, HVAC, switching, and rackspace, you’ll probably double that,” Schroeder said. “Our product can immediately save you $4,000 and over 8 years it’ll save you $8000 per node.”

In terms of what MapR is bringing to the table, the article mentions MapR’s improvements to Apache Hadoop:

  • multiple channels to data through NFS protocol
  • a re-architected NameNode for high availability
  • eliminated single points of failure and automated jobs failover
  • data mirroring and snapshot capabilities
  • wide area replication

Filing this to the announced section of the Hadoop-related solutions list.

Original title and link: EMC Partners with MapR for Greenplum HD Enterprise Edition (NoSQL databases © myNoSQL)


Mapr: a Competitor to Hadoop Leader Cloudera

They are said to be building a proprietary replacement for the Hadoop Distributed File System that’s allegedly three times faster than the current open-source version. It comes with snapshots and no NameNode single point of failure (SPOF), and is supposed to be API-compatible with HDFS, so it can be a drop-in replacement.

Where can one get Mapr product from?

Considering Yahoo is now focusing on Apache Hadoop and their plans for the next generation Hadoop MapReduce, I wouldn’t hold my breath for Mapr improvements.

Original title and link: Mapr: a Competitor to Hadoop Leader Cloudera (NoSQL databases © myNoSQL)