ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Efficient Large-Scale Graph Analysis with Hadoop

Michael Schatz[1]:

Otherwise, the main technical challenge of this design is the graph structure must be available at each step of the iterative algorithm, but in the design above we only distribute the mutable values (partial PageRank value, partial search path, etc). This challenge is normally resolved by encoding the graph structure and mutable values as tuples that are separately tagged to indicate if the record should be interpreted as a node or a special mutable value. This way the map function can read in a node as input, emit messages for neighboring nodes using the neighboring node ids as the keys, and also reemit the node tuple with the current node id as the key. Then, as usual, the shuffle phase collects key-value pairs with the same key, which effectively collects together a node tuple with all the messages destined for that node for every node in the graph. The reduce function then processes each node tuple with associated messages, computes an updated value, and saves away the updated node with the complete graph structure for the next round of computation.

I guess this is exactly the reason Google came up with Pregel, which even if somehow similar to MapReduce is optimized for graph processing. While we don’t have access (yet?) at Google’s implementation, there’s an attempt to build an open source version: Phoebus, Erlang-based implementation of Pregel.


  1. Michael Schatz is an assistant professor in the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory  ()

Original title and link: Efficient Large-Scale Graph Analysis with Hadoop (NoSQL databases © myNoSQL)

via: http://www.cloudera.com/blog/2010/11/do-the-schimmy-efficient-large-scale-graph-analysis-with-hadoop/