NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



MapR: All content tagged as MapR in NoSQL databases and polyglot persistence

MapR Claims Title as De Facto Standard for Hadoop

Maureen O’Gara:

The champagne has been flowing over at MapR since Google announced the integration of its Distribution for Hadoop with Google Compute Engine, the start-up’s second big win in a row.

Indeed, MapR on Amazon Elastic MapReduce and Google Compute Engine are two very important events in the life of MapR and for the Hadoop ecosystem in general. But there’s still a long way from these to being a de facto standard.

Original title and link: MapR Claims Title as De Facto Standard for Hadoop (NoSQL database©myNoSQL)


Cloudera or MapR for Hadoop Distribution?

A couple of links covering various aspects of this question:

  1. Quora thread covering this subject
  2. Joe Stein’s Hadoop distribution bake-off and my experience with Cloudera and MapR
  3. How I’d choose a Hadoop distribution
  4. MapR claims title as de facto standard for Hadoop

If you have other good references answering the question of what Hadoop distribution to choose please leave a comment.

Original title and link: Cloudera or MapR for Hadoop Distribution? (NoSQL database©myNoSQL)

The Hadoop Ecosystem Relationships

Excellent infographic about the relationships in the Hadoop market created with Datameer:


A while ago I’ve created a Google Spreadsheet in which I’ve tried to track all these relationships, but going through PR announcements wasn’t really my thing. Now there’s a CSV file with all this data.

Original title and link: The Hadoop Ecosystem Relationships (NoSQL database©myNoSQL)


MapR Hadoop Distribution on Amazon Elastic MapReduce

Another very interesting news for the Hadoop space, this time coming from Amazon and MapR announcing support for the MapR Hadoop distribution on Amazon Elastic MapReduce:

MapR introduces enterprise-focused features for Hadoop such as high availability, data snapshotting, cluster mirroring across AZs, and NFS mounts. Combined with Amazon Elastic MapReduce’s managed Hadoop environment, seamless integration with other AWS services, and hourly pricing with no upfront fees or long-term commitments, Amazon EMR with the MapR Distribution for Hadoop offers customers a powerful tool for generating insights from their data.

Following the logic of the Amazon Relational Database Services which started with MySQL, the most popular and open source database and then added support for the commercial, but also very popular Oracle and SQL Server, what does this announcement tell us? It’s either that Amazon has got a lot of requests for MapR or that some very big AWS customers have mentioned MapR in their talks with Amazon. I go with the second option.

Original title and link: MapR Hadoop Distribution on Amazon Elastic MapReduce (NoSQL database©myNoSQL)

Pricing for Hadoop Support: Cloudera, Hortonworks, MapR

Found the following bits in a post on The Register by Timothy Prickett Morgan:

While Cloudera and MapR are charging $4,000 per node for their enterprise-class Hadoop distributions (including their proprietary extensions and tech support), Hortonworks doesn’t have any proprietary extensions and is living off of the support contracts for the HDP 1.0 stack. […] Hortonworks is not providing its full list price, but for a starter ten-node cluster, you can get a standard support contract for $12,000 per year.

Hortonworks’s pricing looks a bit aggressive, but this could be explained by the fact that Hortonworks Data Platform 1.0 was made available only this week.

For running Hadoop in the cloud, there’s also Amazon Elastic MapReduce whose pricing was always clear. And Amazon has recently announced support for MapR Hadoop distribution on Elastic MapReduce.

Original title and link: Pricing for Hadoop Support: Cloudera, Hortonworks, MapR (NoSQL database©myNoSQL)

Looking to Stay Ahead of Hortonworks and MapR in the Hadoop Market, Cloudera Delivers High Availability, Better Security, and Easier System Management

Compare the title, which is the subtitle of the InformationWeek post, with this paragraph which reflects the reality:

Both Cloudera and Hortonworks will be distributing open source software from Apache’s Hadoop 2.3 release, which includes upgrades aimed at high-availability and improved security. The release includes a hot-failover for the NameNode (metadata server) of the Hadoop Distributed File System (HDFS), which has long been a single point of failure.

Cloudera is indeed one of the biggest Hadoop contributors and a company that have helped a lot proving and thus popularizing Hadoop through their packaging of open source Hadoop ecosystem components paired with their management tool (Cloudera Manager). But NameNode high availability and security improvements are part of the Apache Hadoop source code.

Original title and link: Looking to Stay Ahead of Hortonworks and MapR in the Hadoop Market, Cloudera Delivers High Availability, Better Security, and Easier System Management (NoSQL database©myNoSQL)


Big Data Market Analysis: Vendors Revenue and Forecasts

I think this is the first extensive Big Data report I’m reading that includes enough relevant and quite exhaustive data about the majority of players in the Big Data market, plus some captivating forecasts.

As of early 2012, the Big Data market stands at just over $5 billion based on related software, hardware, and services revenue. Increased interest in and awareness of the power of Big Data and related analytic capabilities to gain competitive advantage and to improve operational efficiencies, coupled with developments in the technologies and services that make Big Data a practical reality, will result in a super-charged CAGR of 58% between now and 2017.

2011 Big Data Pure-Play Vendors Yealy Big Data Revenue

While there are many stories behind these numbers and many things to think about, here is what I’ve jotted down while studying the report:

  • it’s no surprise that “megavendors” (IBM, HP, etc.) account for the largest part of today’s Big Data market revenue
  • still, the revenue ratio of pure-players vs megavendors feels quite unbalanced: $311mil out of $5.1bil
    • the pure-player category includes: Vertica, Aster Data, Splunk, Greenplum, 1010data, Cloudera, Think Big Analytics, MapR, Digital Reasoning, Datameer, Hortonworks, DataStax, HPCC Systems, Karmasphere
    • there are a couple of names that position themselves in the Big Data market that do not show up in anywhere (e.g. 10gen, Couchbase)
  • this could lead to the conclusion that the companies that include hardware in their offer benefit of larger revenues
    • I’m wondering though what is the margin in the hardware market segment. While not having any data at hand, I think I’ve read reports about HP and Dell not doing so well due exactly to lower margins
    • see bullet point further down about revenue by hardware, software, and services
  • this could explain why so many companies are trying their hand at appliances
  • by looking at the various numbers you can see that those selling appliances usually have a large corporation behind supporting the production costs for hadware and probably the cost of the sales force
  • in the Big Data revenue by vendor you can find quite a few well-known names from the consulting segment
  • the revenue by type pie lists services as accounting for 44%, hardware for 31%, and software for 13% which might give an idea of what makes up the megavendors’ sales packages
    • most of the NoSQL database companies and Hadoop companies are mostly in the software and services segment

Great job done by the Wikibon team.

Original title and link: Big Data Market Analysis: Vendors Revenue and Forecasts (NoSQL database©myNoSQL)


Hadoop Namenode High Availability Merged to HDFS Trunk

As I’m slowly recovering after a severe poisoning that I initially ignored but finally put me to bed for almost a week, I’m going to post some of the most interesting articles I’ve read while resting.

Hadoop Namenode’s single point of failure has always been mentioned as one of the weaknesses of Hadoop and also as a differentiator of other Hadoop-based commercial offerings. But now the Namenode HA branch was merged into trunk and while it will take a couple of cicles to complete the tests, this will become soon part of the Hadoop distribution.

Here’s Jitendra Pandey announcement on Hortonworks’s blog:

Significant enhancements were completed to make HOT Failover work:

  • Configuration changes for HA
  • Notion of active and standby states were added to the Namenode
  • Client-side redirection
  • Standby processing journal from Active
  • Dual block reports to Active and Standby

In a follow up post to Gartner’s article Apache Hadoop 1.0 Doesn’t Clear Up Trunks and Branches Questions. Do Distributions?, the advantage of using custom distributions will slowly vanish and the open source version will be the one you’ll want to have in production.

Original title and link: Hadoop Namenode High Availability Merged to HDFS Trunk (NoSQL database©myNoSQL)

5 Top Misconceptions about Big Data and Hadoop

The MapR team analyzes the top 5 misconceptions in the Big Data/Hadoop market:

  1. Big Data is not simply about massive amounts of data — petabytes and beyond. Big Data represents a paradigm shift.
  2. Since Hadoop is a funny name and somewhat new to people they assume it must be risky.
  3. Another misconception about Hadoop, is that it is a batch process.
  4. Perhaps the biggest misconception is that Hadoop is a single, monolithic, component.
  5. With respect to open source, the question about a distribution is not a simple binary “open” or “closed”.

The first 4 points are indeed how things are seen from the outside.

While I do understand the nuance introduced by the last point—allowing to plug MapR—, things are black and white: it is either open source or not. But that’s just one dimension of the various components of the Hadoop stack. What really matters is how well a component integrates with the rest of the stack. The questions to be asked are: does it maintain the same interfaces? what’s the cost of replacing it? does it allow to use a 3rd party component? does it force me to get special components or hardware?

Original title and link: 5 Top Misconceptions about Big Data and Hadoop (NoSQL database©myNoSQL)


12 Hadoop Vendors to Watch in 2012

My list of 8 most interesting companies for the future of Hadoop didn’t try to include anyone having a product with the Hadoop word in it. But the list from InformationWeek does. To save you 15 clicks, here’s their list:

  • Amazon Elastic MapReduce
  • Cloudera
  • Datameer
  • EMC (with EMC Greenplum Unified Analytics Platform and EMC Data Computing Appliance)
  • Hadapt
  • Hortonworks
  • IBM (InfoSphere BigInsights)
  • Informatica (for HParser)
  • Karmasphere
  • MapR
  • Microsoft
  • Oracle

Original title and link: 12 Hadoop Vendors to Watch in 2012 (NoSQL database©myNoSQL)

MapR’s Map-Reduce Ready Disitributed File System Patent Filing

Here’s the abstract of the patent filing submitted by MapR’s for a Map-Reduce Ready Distributed File System:

A map-reduce compatible disitrubuted file system that consists of successive component layers that each provide the basis on which the next layer is built provides transactional read-write -update semantics with file chunk replication and huge file-create rates. A primitive storage layer (storage pools) knits together raw block stores and provides a storage mechanism for containers and transaction logs. Storage pools are manipulated by individual file servers. Containers provide the fundamental basis for data replication, relocation, and transactional updates. A container location database allows containers to be found among all file servers, as well as defining precedence among replicas of containers to organize transactional updates of container contents. Volumes facilitate control of data placement, creation of snapshots and mirrors, and retention of a variety of control and policy information. Key-value stores relate keys to data for such purposes as directories, container location maps, and offset maps in compressed files.

You can get the complete PDF from here.

Original title and link: MapR’s Map-Reduce Ready Disitributed File System Patent Filing (NoSQL database©myNoSQL)

Partnerships in the Hadoop Market

Just a quick recap:

Amazon doesn’t partner with anyone for their Amazon Elastic Map Reduce. And IBM is walking alone with the software-only InfoSphere BigInsights.

Original title and link: Partnerships in the Hadoop Market (NoSQL database©myNoSQL)