Efficient Large-Scale Graph Analysis with Hadoop
Michael Schatz[1]
:
Otherwise, the main technical challenge of this design is the graph structure must be available at each step of the iterative algorithm, but in the design above we only distribute the mutable values (partial PageRank value, partial search path, etc). This challenge is normally resolved by encoding the graph structure and mutable values as tuples that are separately tagged to indicate if the record should be interpreted as a node or a special mutable value. This way the map function can read in a node as input, emit messages for neighboring nodes using the neighboring node ids as the keys, and also reemit the node tuple with the current node id as the key. Then, as usual, the shuffle phase collects key-value pairs with the same key, which effectively collects together a node tuple with all the messages destined for that node for every node in the graph. The reduce function then processes each node tuple with associated messages, computes an updated value, and saves away the updated node with the complete graph structure for the next round of computation.
I guess this is exactly the reason Google came up with Pregel, which even if somehow similar to MapReduce is optimized for graph processing. While we don’t have access (yet?) at Google’s implementation, there’s an attempt to build an open source version: Phoebus, Erlang-based implementation of Pregel.
- Michael Schatz is an assistant professor in the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory (↩)
Original title and link: Efficient Large-Scale Graph Analysis with Hadoop (NoSQL databases © myNoSQL)