mapreduce: All content tagged as mapreduce in NoSQL databases and polyglot persistence
Wednesday, 1 February 2012
Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support
Starting today customers can view graphs of 23 job flow metrics within the EMR Console by selecting the Monitoring tab in the Job Flow Details page. These metrics are pushed CloudWatch every five minutes at no cost to you and include information on:
- Job flow progress including metrics on the number of map and reduce tasks running and remaining in your job flow and the number of bytes read and written to S3 and HDFS.
- Job flow contention including metrics on HDFS utilization, map and reduce slots open, jobs running, and the ratio between map tasks remaining and map slots.
- Job flow health including metrics on whether your job flow is idle, if there are missing data blocks, and if there are any dead nodes.
That’s like free pr0n for operations teams.
On a different note, I’ve noticed that the Hadoop stack (Hadoop, Hive, Pig) on Amazon Elastic MapReduce is based on second to last versions, which says that extensive testing is performed on Amazon side before rolling new versions out:
- Hadoop: 0.20.205 precursor of Hadoop 1.0.0 supports append and security, but doesn’t have RAID, symlinks or MR2
- Hive: 0.7.1 (precursor of latest 0.8.0)
- Pig: 0.9.1 (precursor of latest 0.9.2)
Original title and link: Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support (©myNoSQL)
Friday, 27 January 2012
MapReduce With Hadoop: What Happens During Mapping
An interesting look at what happens during the map phase in Hadoop and the impact of emitting key-value pairs:
- a direct negative impact on the map time and CPU usage, due to more serialization
- an indirect negative impact on CPU due to more spilling and additional deserialization in the combine step
- a direct impact on the map task, due to more intermediate files, which makes the final merge more expensive

The main point of the dynaTrace blog post is that even if Hadoop makes it easy to throw more hardware at a problem, wasting resources with bad code in MapReduce tasks comes with a noticeable and measurable cost.
Original title and link: MapReduce With Hadoop: What Happens During Mapping (©myNoSQL)
via: http://blog.dynatrace.com/2012/01/25/about-the-performance-of-map-reduce-jobs/
Monday, 23 January 2012
MapR’s Map-Reduce Ready Disitributed File System Patent Filing
Here’s the abstract of the patent filing submitted by MapR’s for a Map-Reduce Ready Distributed File System:
A map-reduce compatible disitrubuted file system that consists of successive component layers that each provide the basis on which the next layer is built provides transactional read-write -update semantics with file chunk replication and huge file-create rates. A primitive storage layer (storage pools) knits together raw block stores and provides a storage mechanism for containers and transaction logs. Storage pools are manipulated by individual file servers. Containers provide the fundamental basis for data replication, relocation, and transactional updates. A container location database allows containers to be found among all file servers, as well as defining precedence among replicas of containers to organize transactional updates of container contents. Volumes facilitate control of data placement, creation of snapshots and mirrors, and retention of a variety of control and policy information. Key-value stores relate keys to data for such purposes as directories, container location maps, and offset maps in compressed files.
You can get the complete PDF from here.
Original title and link: MapR’s Map-Reduce Ready Disitributed File System Patent Filing (©myNoSQL)
Thursday, 19 January 2012
Pros and Cons of Using MapReduce With Distributed Key-Value Stores: HBase, Cassandra, Riak
Old Quora question with very good answers.
- (pro) can (potentially) query live data
- (pro) can (conceptually) be highly efficient at joining data sets that are identically sharded on the join key (the joins can be pushed down into the key-value store itself)
- (con) full scans (the most common pattern for map-reduce) is most likely to be much faster with raw file system access
- (con) because of the better decoupling of computation and storage in the GFS+Map-Reduce model - tolerating hot spots (resulting from MR jobs) is much easier
- (con) key-value stores are rarely arranged to have schemas optimized for analytics
Original title and link: Pros and Cons of Using MapReduce With Distributed Key-Value Stores: HBase, Cassandra, Riak (©myNoSQL)
Wednesday, 4 January 2012
Claim Chowder: Microsoft’s Dryad Technology to Take on Google’s MapReduce
In Dec.2010, Joab Jackson writes for IDG News Service: Microsoft’s Dryad technology to take on Google’s MapReduce. Just 11 months later, in Nov.2011, Doug Henschen writes for the same IDG News Service: Microsoft Ditches Dryad, Focuses On Hadoop - Software.
Nothing wrong with Microsoft decision. Same cannot be said though about the titles and articles published by the IDG News Service network.
Original title and link: Claim Chowder: Microsoft’s Dryad Technology to Take on Google’s MapReduce (©myNoSQL)
Tuesday, 20 December 2011
Paper: The Efficiency of MapReduce in Parallel External Memory
For mathematically inclined MapReduce/Hadoop researchers a paper (PDF) by Gero Greiner and Riko Jacob (Institute of Theoretical Computer Science) :
In this, we present upper and lower bounds on the parallel I/O-complexity that are matching up to constant factors for the shuffle step. The shuffle step is the single communication phase where all information of one MapReduce invocation gets transferred from map workers to reduce workers.
[…] we can show that current implementations of the MapReduce model as a framework are almost optimal in the sense of worst-case asymptotic parallel I/O-complexity. This further yields a simple method to consider the external memory performance of an algorithm expressed in MapReduce.
On the one hand, this shows how much complexity can be “hidden” for an algorithm expressed in MapReduce compared to PEM. On the other hand, our results bound the worst-case performance loss of the MapReduce approach in terms of I/O-efficiency.
Original title and link: Paper: The Efficiency of MapReduce in Parallel External Memory (©myNoSQL)
Monday, 19 December 2011
How to Run a MapReduce Job Against Common Crawl Data Using Amazon Elastic MapReduce
Steve Salevan’s 7 step guide to setting up, compiling, deploying, and running a basic MapReduce job.
When Google unveiled its MapReduce algorithm to the world in an academic paper in 2004, it shook the very foundations of data analysis. By establishing a basic pattern for writing data analysis code that can run in parallel against huge datasets, speedy analysis of data at massive scale finally became a reality, turning many orthodox notions of data analysis on their head.
Google published the paper. Yahoo open sourced this. And Amazon is offering (unlimited) resources.
Update: The Hacker News thread where the main question answered is what other corporations are using MapReduce (besides the Internet companies). The answer is unfortunately extremely short: too many to be able to enumerate them all.
Original title and link: How to Run a MapReduce Job Against Common Crawl Data Using Amazon Elastic MapReduce (©myNoSQL)
Wednesday, 7 December 2011
Hadoop/MapReduce on Cassandra Using Ruby and REST
Brian O’Neill:
In an effort to make Hadoop/MapReduce on Cassandra more accessible, we added a REST layer to Virgil that allows you to run map reduce jobs written in Ruby against column families in Cassandra by simply posting the ruby script to a URL. This greatly reduces the skill set required to write and deploy the jobs, and allows users to rapidly develop analytics for data store in Cassandra.
Smart: 10. Security: ?. Utility: 10.
Original title and link: Hadoop/MapReduce on Cassandra Using Ruby and REST (©myNoSQL)
via: http://brianoneill.blogspot.com/2011/12/hadoopmapreduce-on-cassandra-using-ruby.html
Tuesday, 6 December 2011
Combining Hadoop MapReduce and MPI for Terascale Learning
Trying to combine MPI and Hadoop MapReduce for eliminating the drawbacks in each of them:
- MPI: The Allreduce function. The starting state for AllReduce is n nodes each with a number, and the end state is all nodes having the sum of all numbers.
- MapReduce: Conceptual simplicity. One easy to understand function is enough.
- MPI: No need to refactor code. You just sprinkle allreduce in a few locations in your single machine code.
- MapReduce: Data locality. We just hijack the MapReduce infrastructure to execute a map-only job where each process executes on the node with the data.
- MPI: Ability to use local storage (or RAM). Hadoop itself gobbles large amounts of RAM by default because it uses Java. And, in any case, you don’t have an effective large scale learning algorithm if it dies every time the data on a single node exceeds available RAM. Instead, you want to create a temporary file on the local disk and allow it to be cached in RAM by the OS, if that’s possible.
- MapReduce: Automatic cleanup of local resources. Temporary files are automatically nuked.
- MPI: Fast optimization approaches remain within the conceptual scope. Allreduce, because it’s a function call, does not conceptually limit online learning approaches as discussed below. MapReduce conceptually forces statistical query style algorithms. In practice, this can be walked around, but that’s annoying.
- MapReduce: Robustness. We don’t captures all the robustness of MapReduce which can succeed even during a gunfight in the datacenter. But we don’t generally need that: it’s easy to use Hadoop’s speculative execution approach to deal with the slow node problem and use delayed initialization to get around all startup failures giving you something with >99% success rate on a running time reliable to within a factor of 2.
Original title and link: Combining Hadoop MapReduce and MPI for Terascale Learning (©myNoSQL)
Monday, 5 December 2011
MapReduce vs Parallel DBMS: Where Does Map Reduce Shine
From Jim Kaskade’s great post about MapReduce’s advantages:
One of the big attractive qualities of the MR programming model (and maybe it’s key attraction to the new generation of data scientists and application programmers) is its simplicity; an MR program consists of only two functions – Map and Reduce – written to process key/value data pairs. Therefore, the model is easy to use, even for programmers without experience with parallel and distributed systems.
It also hides the details of parallelization, fault-tolerance, locality optimization, and load balancing.
Original title and link: MapReduce vs Parallel DBMS: Where Does Map Reduce Shine (©myNoSQL)
Looking for a Map Reduce Language
Java, Cascading, Pipes - C++, Hive, Pig, Rhipe, Dumbo, Cascalog… which one of these should you use for writing Map Reduce code?
Antonio Piccolboni takes them up for a test:
At the end of this by necessity incomplete and unscientific language and library comparison, there is a winner and there isn’t. There isn’t because language comparison is always multidimensional and subjective but also because the intended applications are very different. On the other hand, looking for a general purpose, moderately elegant, not necessarily most efficient, not necessarily mature language for exploration purposes, Rhipe seems to fit the bill pretty nicely.
Original title and link: Looking for a Map Reduce Language (©myNoSQL)
via: http://blog.piccolboni.info/2011/04/looking-for-map-reduce-language.html
Friday, 2 December 2011
Design Patterns for Efficient Graph Algorithms in MapReduce: Paper and Video
One of the most cited limitations of Hadoop is graph processing.
This problem has been approached in a few different ways until now. Google’s graph processing framework Pregel, which has some major differences compared to MapReduce, is one of them. There are also some MapReduce implementations for graph processing. Last, but not least different approaches are being tried for scaling graph databases.
Jimmy Lin and Michael Schatz have published in 2010 a paper on the subject of Design patterns for efficient graph algorithms in MapReduce (pdf):
Graphs are analyzed in many important contexts, including ranking search results based on the hyperlink structure of the world wide web, module detection of protein-protein interaction networks, and privacy analysis of social networks. Many graphs of interest are difficult to analyze because of their large size, often spanning millions of vertices and billions of edges. As such, researchers have increasingly turned to distributed solutions. In particular, MapReduce has emerged as an enabling technology for large-scale graph processing. However, existing best practices for MapReduce graph algorithms have significant shortcomings that limit performance, especially with respect to partitioning, serial- izing, and distributing the graph. In this paper, we present three design patterns that address these issues and can be used to accelerate a large class of graph algorithms based on message passing, exemplified by PageRank. Experiments show that the application of our design patterns reduces the running time of PageRank on a web graph with 1.4 billion edges by 69%.
After the break you can find a video of Jimmy Lin talking about current best practices in designing large-scale graph algorithms in MapReduce and how to avoid some of the shortcomings, especially those related to partitioning, serializing, and distributing the graph. He shows three enhanced design patterns applicable to large class of graph algorithms that address many of these deficiencies.
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling