mapreduce: All content on NoSQL databases and projects about mapreduce, featuring the best daily NoSQL articles, news, and links on mapreduce
Thursday, 2 September 2010
Riak: Sort by with MapReduce ☞
Alexander Sicular:
The focus of this post is to show you how to do the equivalent of the sql “SORT BY date DESC” using Riak’s map/reduce interface. Due to Riak’s schemaless, document focused nature Riak lacks internal indexing and by extension, native sorting capabilities.
Complete code included (and embedded below):
A couple of links you’ll probably find useful before/after reading the article:
- Riak has improved the fetching of keys in a bucket, that making MapReduce on buckets directly not so expensive
- Even if there are some saying MapReduce is complicated, take a look at how to translate SQL to MapReduce or this MapReduce explanation in simple terms
- A complete guide to MapReduce with Riak
Original title and link for this post: Riak: Sort by with MapReduce (published on the NoSQL blog: myNoSQL)
CouchDB and MongoDB: Querying ☞
Andrew Glover:
Both MongoDB and CouchDB are document-oriented datastores. They both work with JSON documents. They both are usually thrown into the NoSQL bucket. They’re both hip. But that’s where the similarities, for the most part, stop.
When it comes to queries, both couldn’t be any more different.
They differ even in the implementation and behavior of MapReduce.
Original title and link for this post: CouchDB and MongoDB: Quering (published on the NoSQL blog: myNoSQL)
Tuesday, 24 August 2010
MapReduce with MongoDB and Python ☞
Complete example of using MongoDB MapReduce with PyMongo:
In this post, I’ll present a demonstration of a map-reduce example with MongoDB and server side JavaScript. Based on the fact that I’ve been working with this technology recently, I thought it would be useful to present here a simple example of how it works and how to integrate with Python.
Original title and link for this post: MapReduce with MongoDB and Python (published on the NoSQL blog: myNoSQL)
Saturday, 21 August 2010
Riak Map/Reduce Queries in Clojure ☞
Over this week I’ve been working on a proof of concept to see if it’s possible to use Clojure as the map/reduce language for Riak, in the same way now we can use Javascript and Erlang for that purpose. To accomplish that I needed a way to call Clojure code from Erlang. So I set up a very simple server in Clojure that runs as an Erlang node using Closerl.
Theoretically nice… practically I’d say there is a fundamental problem with this idea (different than the ones listed in the article). Map and reduce functions are supposed to run on the nodes hosting the data[1]
. If you need to wire this data is like implementing mapreduce on your application so the data locality property is lost. Not to mention that adding another variable to the equation (the JVM) your distributed system will become more sensible to failures.
- As mentioned in this question about Riak MapReduce, currently Riak runs only the map functions on all nodes, while reduce function is run on the node receiving the request. (↩)
Original title and link for this post: Riak Map/Reduce Queries in Clojure (published on the NoSQL blog: myNoSQL)
Thursday, 19 August 2010
Getting Started with Hadoop ☞
Good intro material about Hadoop (and a bit of Hive):
One design pattern that both Google and Facebook share is the ability to distribute computations among large clusters of machines that all share a common data source. The pattern is called Map/Reduce, and Hadoop is an open source implementation of this. This article is an introduction to Hadoop. Even if you donʼt currently have a massive scaling issue, it can be worthwhile to become familiar with Map/Reduce as a concept, and playing with Hadoop is a good way to do that.
If you are new to map/reduce and Hadoop, keep also in mind that many NoSQL databases — Riak, CouchDB, MongoDB to name a few — are able to run natively map/reduce jobs.
Getting Started with Hadoop originally posted on the NoSQL blog: myNoSQL
Wednesday, 18 August 2010
Howl: Unifying Metadata Layer for Hive and Pig ☞
Yet another contribution from Yahoo!:
Common metadata layer for Hadoop’s Map Reduce, Pig, and Hive
Howl: Unifying Metadata Layer for Hive and Pig originally posted on the NoSQL blog: myNoSQL
Thursday, 12 August 2010
Hadoop: The Problem of Many Small Files ☞
On why storing small files in HDFS is inefficient and how to solve this issue using Hadoop Archive:
When there are many small files stored in the system, these small files occupy a large portion of the namespace. As a consequence, the disk space is underutilized because of the namespace limitation. In one of our production clusters, there are 57 millions files of sizes less than 128 MB, which means that these files contain only one block. These small files use up 95% of the namespace but only occupy 30% of the disk space.
Hadoop: The Problem of Many Small Files originally posted on the NoSQL blog: myNoSQL
Wednesday, 11 August 2010
Big Data and the Need for New Approaches to Data Integration ☞
I’d say Dave Linthicum got some things wrongly:
First is the ability to manage large data sets more efficiently than with traditional relational technology as done in the past. The methodology is to leverage an approach called MapReduce.
MapReduce is about processing data, but you got to store that data first.
The “Map” portion of MapReduce is the master node that accepts the request and divides it among any number of worker nodes. The “Reduce” portion means that the master node considers the results from the worker nodes and combines them to determine the answer to the request. The power of this architecture is the simplistic nature of MapReduce, meaning it’s both easy to understand and to implement.
???
It is clear to me that using the cloud’s ability to provide massive amounts of commodity computing power, on-demand, when combined with a database architecture that will exploit that power means data processing power on scales we have never seen at these low price points.
This is still something I’m not yet convinced of. Processing in the cloud is indeed a good option. But data must be available on the cloud. And in the case of big data either storing it or moving it to the cloud doesn’t seem to be the best alternative.
Big Data and the Need for New Approaches to Data Integration originally posted on the NoSQL blog: myNoSQL
Friday, 23 July 2010
Tutorial: MapReduce with Riak
While we’ve talked in the past about Riak and MapReduce support and Sean Cribbs’s Riak tutorial is covering it too, the following video covers exclusively MapReduce with Riak.
Enjoy the video and slides:
Wednesday, 21 July 2010
Comparing Pregel and MapReduce ☞
Following his post on graph processing, Ricky Ho explains the major difference between Pregel and MapReduce applied to graph processing:
Since Pregel model retain worker state (the same worker is responsible for the same set of nodes) across iteration, the graph can be loaded in memory once and reuse across iterations. This will reduce I/O overhead as there is no need to read and write to disk at each iteration. For fault resilience, there will be a periodic check point where every worker write their in-memory state to disk.
Also, Pregel (with its stateful characteristic), only send local computed result (but not the graph structure) over the network, which implies the minimal bandwidth consumption.
If you need to summarize that even further it is basically:
- reducing I/O as much as possible
- ensuring data locality
Saturday, 10 July 2010
Cloudera Adds HBase to CDH ☞
Cloudera talks about the addition of HBase to the Cloudera’s Distribution of Hadoop announced during the Hadoop summit:
Analysis of continuously updated data: With data access methods available for all major languages, it’s simple to interface data-generating applications like web crawlers, log collectors, or web applications to write into HBase. For example, the next generation of the Nutch web crawler stores its data in HBase. Once the data generators insert the data, HBase enables MapReduce analysis on either the latest data or a snapshot at any recent timestamp.
User-facing analytics: These user-facing data applications rely on the ability not just to compute the models, but also to make the computed data available for latency-sensitive lookup operations. For these applications, it’s simple to integrate HBase as the destination for a MapReduce job. Extremely efficient incremental and bulk loading features allow the operational data to be updated while simultaneously serving traffic to latency-sensitive workloads. Compared with alternative data stores, the tight integration with other Hadoop projects as well as the consolidation of infrastructure are tangible benefits.
While it may look like the main reason for the HBase addition is its perfect integration with Hadoop, other NoSQL databases like Cassandra and Hypertable are also on improving their integration with tools like Hive, Hadoop, etc..
Thursday, 8 July 2010
Hadoop and HBase Status Updates after Hadoop Summit
As you can expect after such a large summit, there are tons of updates coming in.
For now I’ve selected two, but if you find others as interesting please share them with us.
James Hamilton using a colleague’s report:
Key Takeaways
- Yahoo and Facebook operate the world largest Hadoop clusters, 4,000/2,300 nodes with 70/40 petabytes respectively. They run full cluster replicas to assure availability and data durability.
- Yahoo released Hadoop security features with Kerberos integration which is most useful for long running multitenant Hadoop clusters.
- Cloudera released paid enterprise version of Hadoop with cluster management tools and several dB connectors and announced support for Hadoop security.
- Amazon Elastic MapReduce announced expand/shrink cluster functionality and paid support.
- Many Hadoop users use the service in conjunction with NoSQL DBs like Hbase or Cassandra.
Tim Sells has an extensive report on HBase status:
The next version will be 0.90. It will be a reliability release, but also includes performance gains. The version change will break from hadoop version numbers. 0.90 was chosen as there’s a belief it is maturing towards a 1.0 release.
The main points I picked up are:
- New batch importing allows writing hfiles directly and then just telling hbase where they are.
- Taking advantage of appends in hdfs for genuine durability.
- The namenode single point of failure is being addressed, facebook is planning to release their HA namenode.
- Replication between clusters. Allows cross data center replication. Eventually consistent.
- Tighter integration with zookeeper through a master rewrite.
- Significant work to have less temperamental behaviour during compaction and splits.
- Facebook are planning to release their distribution of hadoop and their highly available namenode.

