graph processing: All content tagged as graph processing in NoSQL databases and polyglot persistence
Papers from VLDB 2012 are starting to surface. Authored by a Chinese team, the “Efficient Subgraph Matching on Billion Node Graphs” paper is introducing a new algorithm optimized for large scale graphs:
We present a novel algorithm that supports efficient subgraph matching for graphs deployed on a distributed memory store. Instead of relying on super-linear indices, we use efficient graph exploration and massive parallel computing for query processing. Our experimental results demonstrate the feasibility of performing subgraph matching on web-scale graph data.
Comparison of space and time complexity of other subgraph matching algorithms:
Marco A. Rodriguez in Exploring Wikipedia with Gremlin Graph Traversals:
There are numerous ways in which Wikipedia can be represented as a graph. The articles and the href hyperlinks between them is one way. This type of graph is known a single-relational graph because all the edges have the same meaning — a hyperlink. A more complex rendering could represent the people discussed in the articles as “people-vertices” who know other “people-vertices” and that live in particular “city-vertices” and work for various “company-vertices” — so forth and so on until what emerges is a multi-relational concept graph. For the purpose of this post, a middle ground representation is used. The vertices are Wikipedia articles and Wikipedia categories. The edges are hyperlinks between articles as well as taxonomical relations amongst the categories.
Imagine the reachness of the model you’d achieve when every piece of data and metadata would become a vertex or an edge. It’s not just the wealth of data but also the connectivity. Time would be the only missing dimension.
Original title and link: The Richness of the Graph Model: The Sky Is the Limit ( ©myNoSQL)
Even if my first post about the Micosoft research graph database Trinity is back from March last year, I haven’t heard much about it since. Based on my tip, Klint Finley published an interesting speculation about Trinity, Dryad, Probase, and Bing. Since then though, Microsoft moved away from using Dryad to Hadoop and I’m still not sure about the status of the Trinity project. But I have found a paper about the Trinity graph engine authored by Bin Shao, Haixun Wang, Yatao Li. You can read it or download it after the break.
We introduce Trinity, a memory-based distributed database and computation platform that supports online query processing and offline analytics on graphs. Trinity leverages graph access patterns in online and offline computation to optimize the use of main memory and communication in order to deliver the best performance. With Trinity, we can perform efficient graph analytics on web-scale, billion-node graphs using dozens of commodity machines, while existing platforms such as MapReduce and Pregel require hundreds of machines. In this paper, we analyze several typical and important graph applications, including search in a so- cial network, calculating Pagerank on a web graph, and sub-graph matching on web-scale graphs without using index, to demonstrate the strength of Trinity.
- Max de Marzi is lately my favorite source for graph data visualization posts
- Even if the diagram looks amazing I’m wondering if it would scale for larger data sets
- Even if I gave it some thought, I’m still not sure how graph databases can record historical relationship/the evolution of relationships in a graph. If you have any ideas I’d love to hear.
Original title and link: Neo4j and D3.js: Visualizing Connections Over Time ( ©myNoSQL)
This problem has been approached in a few different ways until now. Google’s graph processing framework Pregel, which has some major differences compared to MapReduce, is one of them. There are also some MapReduce implementations for graph processing. Last, but not least different approaches are being tried for scaling graph databases.
Jimmy Lin and Michael Schatz have published in 2010 a paper on the subject of Design patterns for efficient graph algorithms in MapReduce (pdf):
Graphs are analyzed in many important contexts, including ranking search results based on the hyperlink structure of the world wide web, module detection of protein-protein interaction networks, and privacy analysis of social networks. Many graphs of interest are difficult to analyze because of their large size, often spanning millions of vertices and billions of edges. As such, researchers have increasingly turned to distributed solutions. In particular, MapReduce has emerged as an enabling technology for large-scale graph processing. However, existing best practices for MapReduce graph algorithms have significant shortcomings that limit performance, especially with respect to partitioning, serial- izing, and distributing the graph. In this paper, we present three design patterns that address these issues and can be used to accelerate a large class of graph algorithms based on message passing, exemplified by PageRank. Experiments show that the application of our design patterns reduces the running time of PageRank on a web graph with 1.4 billion edges by 69%.
After the break you can find a video of Jimmy Lin talking about current best practices in designing large-scale graph algorithms in MapReduce and how to avoid some of the shortcomings, especially those related to partitioning, serializing, and distributing the graph. He shows three enhanced design patterns applicable to large class of graph algorithms that address many of these deﬁciencies.