NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Elastic MapReduce: All content tagged as Elastic MapReduce in NoSQL databases and polyglot persistence

Using Elastic MapReduce as a generic Hadoop cluster manager

Steve McPherson for the AWS Blog:

Despite the name Elastic MapReduce, the service goes far beyond batch- oriented processing. Clusters in EMR have a flexible and rich cluster- management framework that users can customize to run any Hadoop ecosystem application such as low-latency query engines like Hbase (with Phoenix), Impala, Spark/Shark and machine learning frameworks like Mahout. These additional components can be installed using Bootstrap Actions or Steps.

Operational simplicity is a critical aspect for the early days of many companies when large hardware investments and time are so important. Amazon is building a huge data ecosystem to convince its users to stay even afterwards (the more data you put in, the more difficult it’s to move it out later).

Original title and link: Using Elastic MapReduce as a generic Hadoop cluster manager (NoSQL database©myNoSQL)


Amazon Web Services Annual Revenue Estimation

Over the weekend, Christopher Mims has published an article in which he derives a figure for Amazon Web Services’s annual revenue: $2.4 billions:

Amazon is famously reticent about sales figures, dribbling out clues without revealing actual numbers. But it appears the company has left enough hints to, finally, discern how much revenue it makes on its cloud computing business, known as Amazon Web Services, which provides the backbone for a growing portion of the internet: about $2.4 billion a year.

There’s no way to decompose this number into the revenue of each AWS solution. For the data space I’d be interested into:

  1. S3 revenues. This is the space Basho’s Riak CS competes into.

    After writing my first post about Riak CS, I’ve learned that in Japan, the same place where Riak CS is run by Yahoo! new cloud storage, Gemini Mobile Technologies has been offering to local ISPs a similar S3-service built on top of Cassandra.

  2. Redshift is pretty new and while I’m not aware of immediate competitors (what am I missing?), I don’t think it accounts for a significant part of this revenue. Even if some of the early users, like AirBnb, report getting very good performance and costs from it.

    Redshift is powered by ParAccell, which, over the weekend, has been acquired by Actian.

  3. Amazon Elastic MapReduce. This is another interesting space from which Microsoft wants a share with its Azure HDInsight developed in collaboration with Hortonworks.

    In this space there’s also MapR and Google Compute combination which seem to be extremely performant.

  4. Interestingly Amazon is making money also from some of the competitors of its Amazon Dynamo and RDS services. The advantage of owning the infrastructure.

Original title and link: Amazon Web Services Annual Revenue Estimation (NoSQL database©myNoSQL)

The Architecture of a Credit Card Analysis Platform: Using Project Voldemort, Elastic MapReduce, Pangool

Ivan de Prado and Pere Ferrera on

The solution we developed has an infrastructure cost of just a few thousands of dollars per month thanks to the use of the cloud (AWS), Hadoop and Voldemort.


This is one of the few projects outside LinkedIn that I know of that uses Project Voldemort. Plus the Voldemort backend storage is configured to use BerkleyDB.

Original title and link: The Architecture of a Credit Card Analysis Platform: Using Project Voldemort, Elastic MapReduce, Pangool (NoSQL database©myNoSQL)


MapR Hadoop Distribution on Amazon Elastic MapReduce

Another very interesting news for the Hadoop space, this time coming from Amazon and MapR announcing support for the MapR Hadoop distribution on Amazon Elastic MapReduce:

MapR introduces enterprise-focused features for Hadoop such as high availability, data snapshotting, cluster mirroring across AZs, and NFS mounts. Combined with Amazon Elastic MapReduce’s managed Hadoop environment, seamless integration with other AWS services, and hourly pricing with no upfront fees or long-term commitments, Amazon EMR with the MapR Distribution for Hadoop offers customers a powerful tool for generating insights from their data.

Following the logic of the Amazon Relational Database Services which started with MySQL, the most popular and open source database and then added support for the commercial, but also very popular Oracle and SQL Server, what does this announcement tell us? It’s either that Amazon has got a lot of requests for MapR or that some very big AWS customers have mentioned MapR in their talks with Amazon. I go with the second option.

Original title and link: MapR Hadoop Distribution on Amazon Elastic MapReduce (NoSQL database©myNoSQL)

Notes on the Hadoop and HBase Markets

Curt Monash shares what he heard from his customers:

  • Over half of Cloudera’s customers (nb 100 subscription customers) use HBase
  • Hortonworks thinks a typical enterprise Hadoop cluster has 20-50 nodes, with 50-100 already being on the large side.
  • There are huge amounts of Elastic MapReduce/Hadoop processing in the Amazon cloud. Some estimates say it’s the majority of all Amazon Web Services processing.

Original title and link: Notes on the Hadoop and HBase Markets (NoSQL database©myNoSQL)


What Are the Pros and Cons of Running Cloudera’s Distribution for Hadoop vs Amazon Elastic MapReduce Service?

Old Quora question, but still very relevant. Top response from Jeff Hammerbacher:

Elastic MapReduce Pros:

  • Dynamic MapReduce cluster sizing.
  • Ease of use for simple jobs via their proprietary web console.
  • Great documentation.
  • Integrates nicely with other Amazon Web Services.

Cloudera Distribution for Hadoop:

  • CDH is open source; you have access to the source code and can inspect it for debugging purposes and make modifications as required.
  • CDH can be run on a number of public or private clouds using an open source framework, Whirr, so you’re not tied to a single cloud provider
  • With CDH, you can move your cluster to dedicated hardware with little disruption when the economics make sense. Most non-trivial applications will benefit from this move.
  • CDH packages a number of open source projects that are not included with EMR: Sqoop, Flume, HBase, Oozie, ZooKeeper, Avro, and Hue. You have access to the complete platform composed of data collection, storage, and processing tools.
  • CDH packages a number of critical bug fixes and features and the most recent stable releases, so you’re usually using a more stable and feature-rich product.
  • You can purchase support and management tools for CDH via Cloudera Enterprise.
  • CDH uses the open source Oozie framework for workflow management. EMR implemented a proprietary “job flow” system before major Hadoop users standardized on Oozie for workload management.
  • CDH uses the open source Hue framework for its user interface. If you require new features from your web interface, you can easily implement them using the Hue SDK.
  • CDH includes a number of integrations with other software components of the data management stack, including Talend, Informatica, Netezza, Teradata, Greenplum, Microstrategy, and others. […]
  • CDH has been designed and deployed in common Linux environments and you can use standard tools to debug your programs. […]

Make sure you also read Hadoop in the Cloud: Pros and Cons which addresses (almost) the same question.

A Twitter-style answer to this question would be: “Control and customization vs Automated and Managed Service”. 80 characters left to add your own perspective.

Original title and link: What Are the Pros and Cons of Running Cloudera’s Distribution for Hadoop vs Amazon Elastic MapReduce Service? (NoSQL database©myNoSQL)

A Guide to Elastic MapReduce and Hadoop Streaming for Astrophysicists

Arfon Smith1:

A couple of months ago I wrote about how the astrophysics community should place more value on those individuals building tools for their community - the informaticians. One example of a tool that I don’t think is particularly well known in many areas of research is the Apache Hadoop software framework.

Hadoop is a great tool but it can be fiddly to configure. With Elastic MapReduce you can focus on the design of your map/reduce workflow rather than figuring out how to get your cluster setup. Next I’m planning on making some small changes to software used by radio astronomers to find astrophysical sources in data cubes of the sky to make it work with Hadoop Streaming - bring it on SKA!

Clearly Hadoop has issues. Meanwhile it helps local communities to plan for snow removal, geophysicists find oil in the oceans, and who knows exactly how many other similar problematic implementations are out there.

Peter Skomoroch

  1. Arfon Smith is Director of Citizen Science at The Adler Planetarium where I build citizen science projects for The Zooniverse 

Original title and link: A Guide to Elastic MapReduce and Hadoop Streaming for Astrophysicists (NoSQL database©myNoSQL)


Polyglot persistence at Pinterest: Redis, Membase, MySQL

Pinterest architecture

I’ve created the diagram above based on this very brief answer on Quora:

We use python + heavily-modified Django at the application layer.  Tornado and (very selectively) node.js as web-servers.  Memcached and membase / redis for object- and logical-caching, respectively.  RabbitMQ as a message queue.  Nginx, HAproxy and Varnish for static-delivery and load-balancing.  Persistent data storage using MySQL.  MrJob on EMR for map-reduce.

Data from October 2011 showed Pinterest having over 3 million users generating 400+ million pageviews. There are plently of questions to be answered though:

  1. what is node.js used for? what is RabbitMQ used for?

    Note: the whole section in the diagram about node.js and RabbitMQ is speculative.

  2. is Amazon Elastic MapReduce used for clickstream analysis only (log based analysis) or more than that?

  3. how is data loaded in the Amazon cloud?

    Note: if Amazon Elastic MapReduce is used only for analyzing logs, these are probably uploaded regularly on Amazon S3.

  4. why the need for both Redis and Membase?

Original title and link: Polyglot persistence at Pinterest: Redis, Membase, MySQL (NoSQL database©myNoSQL)

Get them by the data

Gavin Clarke and Chris Mellor about AWS Storage Gateway:

Once you’ve got them by the data, of course, their hearts and minds will follow, and Amazon’s using the AWS Storage Gateway beta as a sampler for the rest of its compute cloud.

The Storage Gateway is another piece, together with S3, DynamoDB, SimpleDB, Elastic MapReduce, in Amazon’s great strategical puzzle of a complete polyglot platform.

Original title and link: Get them by the data (NoSQL database©myNoSQL)


Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support

Starting today customers can view graphs of 23 job flow metrics within the EMR Console by selecting the Monitoring tab in the Job Flow Details page. These metrics are pushed CloudWatch every five minutes at no cost to you and include information on:

  • Job flow progress including metrics on the number of map and reduce tasks running and remaining in your job flow and the number of bytes read and written to S3 and HDFS.
  • Job flow contention including metrics on HDFS utilization, map and reduce slots open, jobs running, and the ratio between map tasks remaining and map slots.
  • Job flow health including metrics on whether your job flow is idle, if there are missing data blocks, and if there are any dead nodes.

That’s like free pr0n for operations teams.

On a different note, I’ve noticed that the Hadoop stack (Hadoop, Hive, Pig) on Amazon Elastic MapReduce is based on second to last versions, which says that extensive testing is performed on Amazon side before rolling new versions out:

Original title and link: Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support (NoSQL database©myNoSQL)


Using Amazon Elastic MapReduce With DynamoDB: NoSQL Tutorials

Adam Gray[1]:

In this article, I’ll demonstrate how EMR can be used to efficiently export DynamoDB tables to S3, import S3 data into DynamoDB, and perform sophisticated queries across tables stored in both DynamoDB and other storage services such as S3.

If you put together Amazon S3, Amazon DynamoDB, Amazon RDS, and Amazon Elastic MapReduce, you have a complete polyglot persistence solution in the cloud[2].

  1. Adam Gray is Product Manager on the Elastic MapReduce Team  

  2. Complete in the sense of core building blocks.  

Original title and link: Using Amazon Elastic MapReduce With DynamoDB: NoSQL Tutorials (NoSQL database©myNoSQL)


12 Hadoop Vendors to Watch in 2012

My list of 8 most interesting companies for the future of Hadoop didn’t try to include anyone having a product with the Hadoop word in it. But the list from InformationWeek does. To save you 15 clicks, here’s their list:

  • Amazon Elastic MapReduce
  • Cloudera
  • Datameer
  • EMC (with EMC Greenplum Unified Analytics Platform and EMC Data Computing Appliance)
  • Hadapt
  • Hortonworks
  • IBM (InfoSphere BigInsights)
  • Informatica (for HParser)
  • Karmasphere
  • MapR
  • Microsoft
  • Oracle

Original title and link: 12 Hadoop Vendors to Watch in 2012 (NoSQL database©myNoSQL)