Design Patterns for Efficient Graph Algorithms in MapReduce: Paper and Video
One of the most cited limitations of Hadoop is graph processing.
This problem has been approached in a few different ways until now. Google’s graph processing framework Pregel, which has some major differences compared to MapReduce, is one of them. There are also some MapReduce implementations for graph processing. Last, but not least different approaches are being tried for scaling graph databases.
Jimmy Lin and Michael Schatz have published in 2010 a paper on the subject of Design patterns for efficient graph algorithms in MapReduce (pdf):
Graphs are analyzed in many important contexts, including ranking search results based on the hyperlink structure of the world wide web, module detection of protein-protein interaction networks, and privacy analysis of social networks. Many graphs of interest are difficult to analyze because of their large size, often spanning millions of vertices and billions of edges. As such, researchers have increasingly turned to distributed solutions. In particular, MapReduce has emerged as an enabling technology for large-scale graph processing. However, existing best practices for MapReduce graph algorithms have significant shortcomings that limit performance, especially with respect to partitioning, serial- izing, and distributing the graph. In this paper, we present three design patterns that address these issues and can be used to accelerate a large class of graph algorithms based on message passing, exemplified by PageRank. Experiments show that the application of our design patterns reduces the running time of PageRank on a web graph with 1.4 billion edges by 69%.
After the break you can find a video of Jimmy Lin talking about current best practices in designing large-scale graph algorithms in MapReduce and how to avoid some of the shortcomings, especially those related to partitioning, serializing, and distributing the graph. He shows three enhanced design patterns applicable to large class of graph algorithms that address many of these deficiencies.
There’s also a slidedeck of Jimmy Lin and Michael Schatz’s presentation at Hadoop Summit (PDF).
Original title and link: Design Patterns for Efficient Graph Algorithms in MapReduce: Paper and Video (©myNoSQL)