graph processing: All content tagged as graph processing in NoSQL databases and polyglot persistence
Friday, 18 January 2013
On Graph Computing: Practical Applications and Graph Computing Technologies
Marko A. Rodriguez in a mustreadmustbookmarkmustprint article about graphs, graph processing, their applicability, and related technologies:
The concept of a graph has been around since the dawn of mechanical computing and for many decades prior in the domain of pure mathematics. Due in large part to this golden age of databases, graphs are becoming increasingly popular in software engineering. Graph databases provide a way to persist and process graph data. However, the graph database is not the only way in which graphs can be stored and analyzed. Graph computing has a history prior to the use of graph databases and has a future that is not necessarily entangled with typical database concerns. There are numerous graph technologies that each have their respective benefits and drawbacks. Leveraging the right technology at the right time is required for effective graph computing.
Original title and link: On Graph Computing: Practical Applications and Graph Computing Technologies ( ©myNoSQL)
via: http://markorodriguez.com/2013/01/09/ongraphcomputing/
Thursday, 3 January 2013
Detecting Cycles in a Network Graph With MapReduce
We’ll maintain two types of graph data:
 A set of link files where each line represent an edge in the graph. This file will be static.
 A set of path files where each line represent a path from one node to another node. This file will be updated in each round of map/reduce.
The general idea is to grow the path file until either the path cannot grow further, or the path contain a cycle, we’ll use two global flags to keep track of that: “change” flag to keep track of whether new path is discovered, and “cycle” flag to keep track whether a cycle is detected.
Isn’t this approach very inneficient (by not being able to account or reuse already processed paths)?
Original title and link: Detecting Cycles in a Network Graph With MapReduce ( ©myNoSQL)
via: http://horicky.blogspot.com/2013/01/mapreducedetectingcyclesinnetwork.html
Thursday, 30 August 2012
Paper: Efficient Subgraph Matching on Billion Node Graphs
Papers from VLDB 2012 are starting to surface. Authored by a Chinese team, the “Efficient Subgraph Matching on Billion Node Graphs” paper is introducing a new algorithm optimized for large scale graphs:
We present a novel algorithm that supports efficient subgraph matching for graphs deployed on a distributed memory store. Instead of relying on superlinear indices, we use efficient graph exploration and massive parallel computing for query processing. Our experimental results demonstrate the feasibility of performing subgraph matching on webscale graph data.
Comparison of space and time complexity of other subgraph matching algorithms:
Tuesday, 20 March 2012
The Richness of the Graph Model: The Sky Is the Limit
Marco A. Rodriguez in Exploring Wikipedia with Gremlin Graph Traversals:
There are numerous ways in which Wikipedia can be represented as a graph. The articles and the href hyperlinks between them is one way. This type of graph is known a singlerelational graph because all the edges have the same meaning — a hyperlink. A more complex rendering could represent the people discussed in the articles as “peoplevertices” who know other “peoplevertices” and that live in particular “cityvertices” and work for various “companyvertices” — so forth and so on until what emerges is a multirelational concept graph. For the purpose of this post, a middle ground representation is used. The vertices are Wikipedia articles and Wikipedia categories. The edges are hyperlinks between articles as well as taxonomical relations amongst the categories.
Imagine the reachness of the model you’d achieve when every piece of data and metadata would become a vertex or an edge. It’s not just the wealth of data but also the connectivity. Time would be the only missing dimension.
Original title and link: The Richness of the Graph Model: The Sky Is the Limit ( ©myNoSQL)
Friday, 16 March 2012
Neo4j and the Java Universal Network/Graph Framework
Max De Marzi^{1}:
In the world of graph databases, one such stock room is the Java Universal Network/Graph Framework(JUNG) which contains a cache of algorithms from graph theory, data mining, and social network analysis, such as routines for clustering, decomposition, optimization, random graph generation, statistical analysis, and calculation of network distances, flows, and importance measures (centrality, PageRank, HITS, etc.).
Update: there’s a second part, in which De Marzi looks into visualizing graphs with Node Quilt:

I’ve already told you that Max de Marzi became my favorite read on graph database subjects. There’s only one thing I don’t like, but it’s his content. ↩
Original title and link: Neo4j and the Java Universal Network/Graph Framework ( ©myNoSQL)
via: http://architects.dzone.com/articles/howimplementjavauniversal
Thursday, 15 March 2012
Using Graph Theory to Predict Basketball Teams Rankings
A directed network is simply a connection of nodes (representing teams) and arrows connecting teams called directed edges. Every time a team defeated another, an arrow was drawn from the losing team’s node to the winning team’s node to represent this game.
Basketball and beers.
Original title and link: Using Graph Theory to Predict Basketball Teams Rankings ( ©myNoSQL)
via: http://blog.biophysengr.net/2012/03/eigenbracket2012usinggraphtheoryto.html
Wednesday, 14 March 2012
NoSQL Paper: The Trinity Graph Engine
Even if my first post about the Micosoft research graph database Trinity is back from March last year, I haven’t heard much about it since. Based on my tip, Klint Finley published an interesting speculation about Trinity, Dryad, Probase, and Bing. Since then though, Microsoft moved away from using Dryad to Hadoop and I’m still not sure about the status of the Trinity project. But I have found a paper about the Trinity graph engine authored by Bin Shao, Haixun Wang, Yatao Li. You can read it or download it after the break.
We introduce Trinity, a memorybased distributed database and computation platform that supports online query processing and offline analytics on graphs. Trinity leverages graph access patterns in online and offline computation to optimize the use of main memory and communication in order to deliver the best performance. With Trinity, we can perform efficient graph analytics on webscale, billionnode graphs using dozens of commodity machines, while existing platforms such as MapReduce and Pregel require hundreds of machines. In this paper, we analyze several typical and important graph applications, including search in a so cial network, calculating Pagerank on a web graph, and subgraph matching on webscale graphs without using index, to demonstrate the strength of Trinity.
Friday, 9 March 2012
Neo4j and D3.js: Visualizing Connections Over Time
Another great graph data visualization using Neo4j and D3.js from Max De Marzi:
 Max de Marzi is lately my favorite source for graph data visualization posts
 Even if the diagram looks amazing I’m wondering if it would scale for larger data sets
 Even if I gave it some thought, I’m still not sure how graph databases can record historical relationship/the evolution of relationships in a graph. If you have any ideas I’d love to hear.
Original title and link: Neo4j and D3.js: Visualizing Connections Over Time ( ©myNoSQL)
Thursday, 8 March 2012
Big GraphProcessing Library From Twitter: Cassovary
Cassovary is designed from the ground up to efficiently handle graphs with billions of edges. It comes with some common node and graph data structures and traversal algorithms. A typical usage is to do largescale graph mining and analysis.
If you are reading this you’ve most probably heard of Pregel—if you didn’t then you should check out the Pregel: a system for largescale graph processing paper and then how Pregel and MapReduce compare—and also the 6 Pregel inspired frameworks.
The Cassovary project page introduces it as:
Cassovary is a simple “big graph” processing library for the JVM. Most JVMhosted graph libraries are flexible but not space efficient. Cassovary is designed from the ground up to first be able to efficiently handle graphs with billions of nodes and edges. A typical example usage is to do large scale graph mining and analysis of a big network. Cassovary is written in Scala and can be used with any JVMhosted language. It comes with some common data structures and algorithms.
I’m not sure yet if:
 Cassovary works with any graphy data source or requires FlockDB—which is more of a persisted graph than a graph database
 Cassovary is inspired by Pregel in any ways or if it’s addressing a limited problem space (similarly to FlockDB)
Update: Pankaj Gupta helped clarify the first question (and probably part of the second too):
At Twitter we use flockdb as our realtime graphdb, and export daily for use in cassovary, but any store could be used.
Original title and link: Big GraphProcessing Library From Twitter: Cassovary ( ©myNoSQL)
via: http://engineering.twitter.com/2012/03/cassovarybiggraphprocessinglibrary.html
Monday, 6 February 2012
Calculating a Graph's Degree Distribution Using R MapReduce over Hadoop
Marko Rodriguez is experimenting with R on Hadoop and one of his exercises is calculating a graph’s degree distribution. I confess I had to use Wikipedia for reminding what’s the definition of a node degree:
 The degree of a node in a network (sometimes referred to incorrectly as the connectivity) is the number of connections or edges the node has to other nodes. The degree distribution P(k) of a network is then defined to be the fraction of nodes in the network with degree k.
 The degree distribution is very important in studying both real networks, such as the Internet and social networks, and theoretical networks.
As an imagination exercise think of a graph database that’s actively maintaining an internal degree distribution and uses it to suggest or partition the graph. Would that work?
Original title and link: Calculating a Graph’s Degree Distribution Using R MapReduce over Hadoop ( ©myNoSQL)
via: http://groups.google.com/group/gremlinusers/browse_thread/thread/db50a72f92a26e06
Friday, 2 December 2011
Design Patterns for Efficient Graph Algorithms in MapReduce: Paper and Video
One of the most cited limitations of Hadoop is graph processing.
This problem has been approached in a few different ways until now. Google’s graph processing framework Pregel, which has some major differences compared to MapReduce, is one of them. There are also some MapReduce implementations for graph processing. Last, but not least different approaches are being tried for scaling graph databases.
Jimmy Lin and Michael Schatz have published in 2010 a paper on the subject of Design patterns for efficient graph algorithms in MapReduce (pdf):
Graphs are analyzed in many important contexts, including ranking search results based on the hyperlink structure of the world wide web, module detection of proteinprotein interaction networks, and privacy analysis of social networks. Many graphs of interest are difficult to analyze because of their large size, often spanning millions of vertices and billions of edges. As such, researchers have increasingly turned to distributed solutions. In particular, MapReduce has emerged as an enabling technology for largescale graph processing. However, existing best practices for MapReduce graph algorithms have significant shortcomings that limit performance, especially with respect to partitioning, serial izing, and distributing the graph. In this paper, we present three design patterns that address these issues and can be used to accelerate a large class of graph algorithms based on message passing, exemplified by PageRank. Experiments show that the application of our design patterns reduces the running time of PageRank on a web graph with 1.4 billion edges by 69%.
After the break you can find a video of Jimmy Lin talking about current best practices in designing largescale graph algorithms in MapReduce and how to avoid some of the shortcomings, especially those related to partitioning, serializing, and distributing the graph. He shows three enhanced design patterns applicable to large class of graph algorithms that address many of these deﬁciencies.
Thursday, 3 November 2011
6 PregelInspired Frameworks
A quick overview of 6 Pregelinspired frameworks (Apache Hama, GoldenOrb, Apache Giraph, Phoebus, Signal/Collect, and HipG):
So, to summarize, what Hama, GoldenOrb and Giraph have in common is: Java platform, Apache License (and incubation), BSP computation. What they differ for: Hama offers BSP primitives not graph processing API (so it sits at a lower level), GoldenOrb provides Pregel’s API but requires the deployment of additional software to your existing Hadoop infrastructure, Giraph provides Pregel’s API (and is kind of complete at the current state) and doesn’t require additional infrastructure.
Original title and link: 6 PregelInspired Frameworks ( ©myNoSQL)
via: http://blog.acaro.org/entry/googlepregeltheriseoftheclones
Most Popular Articles
 Translate SQL to MongoDB MapReduce
 Tutorial: Getting Started With Cassandra
 CouchDB vs MongoDB: An attempt for a More Informed Comparison
 Cassandra @ Twitter: An Interview with Ryan King
 A Couple of Nice GUI Tools for MongoDB
 NoSQL benchmarks and performance evaluations
 Ehcache: Distributed Cache or NoSQL Store?
 Document Databases Compared: CouchDB, MongoDB, RavenDB
 Quick Review of Existing Graph Databases
 NoSQL Data Modeling