Graph Databases: The graph model and processing
A new must read article from Ricky Ho on graph databases in which he covers the basics of the graph model and some graph algorithms.
I found many of the graph algorithms follows a general processing pattern. There are multiple rounds of (sequential) processing iterations. Within each iteration, there are a set of active nodes that perform local processing in parallel.
The article refers to Neo4j and Gremlin, but doesn’t mention InfoGrid (note: there is a comment that adds some details to the article from the InfoGrid perspective though).
Neo4j provide a restricted, single-threaded graph traversal model
- At each round, the set of active nodes is always a single node
- The set of active nodes of next round is determined by the traversal policy (breath or depth-first), but is still a single node
- It offers a callback function to determine whether this node should be included in the result set
Gremlin, on the other hand, provides an interactive graph traversal model where user can step through each iteration. It uses an XPath like syntax to express the navigation.
- The node is expressed as Node(id, inE, outE, properties)
- The arc is expressed as Arc(id, type, inV, outV, properties)
The article also covers how graph algorithms (like topological sort, minimum spanning tree, single source shortest path) can benefit (or not) from a mapreduce implementations:
For example, graph algorithms with a breath-first search nature fits better into parallel computing paradigm with those that has a depth-first search nature. On the other hand, perform search at all nodes fits better in parallel computing than perform search at a particular start node.