NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



graph processing: All content tagged as graph processing in NoSQL databases and polyglot persistence

Storing, processing, and computing with graphs

Marko Rodriguez is on the roll with yet another fantastic article about graphs:

To the adept, graph computing is not only a set of technologies, but a way of thinking about the world in terms of graphs and the processes therein in terms of traversals. As data is becoming more accessible, it is easier to build richer models of the environment. What is becoming more difficult is storing that data in a form that can be conveniently and efficiently processed by different computing systems. There are many situations in which graphs are a natural foundation for modeling. When a model is a graph, then the numerous graph computing technologies can be applied to it.

✚ If you missed it, the other recent article I’m referring to is “Knowledge representation and reasoning with graph databases

Original title and link: Storing, processing, and computing with graphs (NoSQL database©myNoSQL)


On Graph Computing: Practical Applications and Graph Computing Technologies

Marko A. Rodriguez in a must-read-must-bookmark-must-print article about graphs, graph processing, their applicability, and related technologies:

The concept of a graph has been around since the dawn of mechanical computing and for many decades prior in the domain of pure mathematics. Due in large part to this golden age of databases, graphs are becoming increasingly popular in software engineering. Graph databases provide a way to persist and process graph data. However, the graph database is not the only way in which graphs can be stored and analyzed. Graph computing has a history prior to the use of graph databases and has a future that is not necessarily entangled with typical database concerns. There are numerous graph technologies that each have their respective benefits and drawbacks. Leveraging the right technology at the right time is required for effective graph computing.

Original title and link: On Graph Computing: Practical Applications and Graph Computing Technologies (NoSQL database©myNoSQL)


Detecting Cycles in a Network Graph With MapReduce

We’ll maintain two types of graph data:

  1. A set of link files where each line represent an edge in the graph. This file will be static.
  2. A set of path files where each line represent a path from one node to another node. This file will be updated in each round of map/reduce.

The general idea is to grow the path file until either the path cannot grow further, or the path contain a cycle, we’ll use two global flags to keep track of that: “change” flag to keep track of whether new path is discovered, and “cycle” flag to keep track whether a cycle is detected.

Isn’t this approach very inneficient (by not being able to account or reuse already processed paths)?

Original title and link: Detecting Cycles in a Network Graph With MapReduce (NoSQL database©myNoSQL)


Paper: Efficient Subgraph Matching on Billion Node Graphs

Papers from VLDB 2012 are starting to surface. Authored by a Chinese team, the “Efficient Subgraph Matching on Billion Node Graphs” paper is introducing a new algorithm optimized for large scale graphs:

We present a novel algorithm that supports efficient subgraph matching for graphs deployed on a distributed memory store. Instead of relying on super-linear indices, we use efficient graph exploration and massive parallel computing for query processing. Our experimental results demonstrate the feasibility of performing subgraph matching on web-scale graph data.

Comparison of space and time complexity of other subgraph matching algorithms:

Subgraph Matching Methods

The Richness of the Graph Model: The Sky Is the Limit

Marco A. Rodriguez in Exploring Wikipedia with Gremlin Graph Traversals:

There are numerous ways in which Wikipedia can be represented as a graph. The articles and the href hyperlinks between them is one way. This type of graph is known a single-relational graph because all the edges have the same meaning — a hyperlink. A more complex rendering could represent the people discussed in the articles as “people-vertices” who know other “people-vertices” and that live in particular “city-vertices” and work for various “company-vertices” — so forth and so on until what emerges is a multi-relational concept graph. For the purpose of this post, a middle ground representation is used. The vertices are Wikipedia articles and Wikipedia categories. The edges are hyperlinks between articles as well as taxonomical relations amongst the categories.

Imagine the reachness of the model you’d achieve when every piece of data and metadata would become a vertex or an edge. It’s not just the wealth of data but also the connectivity. Time would be the only missing dimension.

Original title and link: The Richness of the Graph Model: The Sky Is the Limit (NoSQL database©myNoSQL)

Neo4j and the Java Universal Network/Graph Framework

Max De Marzi1:

In the world of graph databases, one such stock room is the Java Universal Network/Graph Framework(JUNG) which contains a cache of algorithms from graph theory, data mining, and social network analysis, such as routines for clustering, decomposition, optimization, random graph generation, statistical analysis, and calculation of network distances, flows, and importance measures (centrality, PageRank, HITS, etc.).

Update: there’s a second part, in which De Marzi looks into visualizing graphs with Node Quilt:

Node Quilt

  1. I’ve already told you that Max de Marzi became my favorite read on graph database subjects. There’s only one thing I don’t like, but it’s his content. 

Original title and link: Neo4j and the Java Universal Network/Graph Framework (NoSQL database©myNoSQL)


Using Graph Theory to Predict Basketball Teams Rankings

A directed network is simply a connection of nodes (representing teams) and arrows connecting teams called directed edges.  Every time a team defeated another, an arrow was drawn from the losing team’s node to the winning team’s node to represent this game.

Basketball and beers.

Original title and link: Using Graph Theory to Predict Basketball Teams Rankings (NoSQL database©myNoSQL)


NoSQL Paper: The Trinity Graph Engine

Even if my first post about the Micosoft research graph database Trinity is back from March last year, I haven’t heard much about it since. Based on my tip, Klint Finley published an interesting speculation about Trinity, Dryad, Probase, and Bing. Since then though, Microsoft moved away from using Dryad to Hadoop and I’m still not sure about the status of the Trinity project. But I have found a paper about the Trinity graph engine authored by Bin Shao, Haixun Wang, Yatao Li. You can read it or download it after the break.

We introduce Trinity, a memory-based distributed database and computation platform that supports online query processing and offline analytics on graphs. Trinity leverages graph access patterns in online and offline computation to optimize the use of main memory and communication in order to deliver the best performance. With Trinity, we can perform efficient graph analytics on web-scale, billion-node graphs using dozens of commodity machines, while existing platforms such as MapReduce and Pregel require hundreds of machines. In this paper, we analyze several typical and important graph applications, including search in a so- cial network, calculating Pagerank on a web graph, and sub-graph matching on web-scale graphs without using index, to demonstrate the strength of Trinity.

Neo4j and D3.js: Visualizing Connections Over Time

Another great graph data visualization using Neo4j and D3.js from Max De Marzi:

Graph data visualization of connections over time

  • Max de Marzi is lately my favorite source for graph data visualization posts
  • Even if the diagram looks amazing I’m wondering if it would scale for larger data sets
  • Even if I gave it some thought, I’m still not sure how graph databases can record historical relationship/the evolution of relationships in a graph. If you have any ideas I’d love to hear.

Original title and link: Neo4j and D3.js: Visualizing Connections Over Time (NoSQL database©myNoSQL)

Big Graph-Processing Library From Twitter: Cassovary

Cassovary is designed from the ground up to efficiently handle graphs with billions of edges. It comes with some common node and graph data structures and traversal algorithms. A typical usage is to do large-scale graph mining and analysis.

If you are reading this you’ve most probably heard of Pregel—if you didn’t then you should check out the Pregel: a system for large-scale graph processing paper and then how Pregel and MapReduce compare—and also the 6 Pregel inspired frameworks.

The Cassovary project page introduces it as:

Cassovary is a simple “big graph” processing library for the JVM. Most JVM-hosted graph libraries are flexible but not space efficient. Cassovary is designed from the ground up to first be able to efficiently handle graphs with billions of nodes and edges. A typical example usage is to do large scale graph mining and analysis of a big network. Cassovary is written in Scala and can be used with any JVM-hosted language. It comes with some common data structures and algorithms.

I’m not sure yet if:

  1. Cassovary works with any graphy data source or requires FlockDB—which is more of a persisted graph than a graph database
  2. Cassovary is inspired by Pregel in any ways or if it’s addressing a limited problem space (similarly to FlockDB)

Update: Pankaj Gupta helped clarify the first question (and probably part of the second too):

At Twitter we use flockdb as our real-time graphdb, and export daily for use in cassovary, but any store could be used.

Original title and link: Big Graph-Processing Library From Twitter: Cassovary (NoSQL database©myNoSQL)


Calculating a Graph's Degree Distribution Using R MapReduce over Hadoop

Marko Rodriguez is experimenting with R on Hadoop and one of his exercises is calculating a graph’s degree distribution. I confess I had to use Wikipedia for reminding what’s the definition of a node degree:

  1. The degree of a node in a network (sometimes referred to incorrectly as the connectivity) is the number of connections or edges the node has to other nodes. The degree distribution P(k) of a network is then defined to be the fraction of nodes in the network with degree k.
  2. The degree distribution is very important in studying both real networks, such as the Internet and social networks, and theoretical networks.

As an imagination exercise think of a graph database that’s actively maintaining an internal degree distribution and uses it to suggest or partition the graph. Would that work?

Original title and link: Calculating a Graph’s Degree Distribution Using R MapReduce over Hadoop (NoSQL database©myNoSQL)


Design Patterns for Efficient Graph Algorithms in MapReduce: Paper and Video

One of the most cited limitations of Hadoop is graph processing.

This problem has been approached in a few different ways until now. Google’s graph processing framework Pregel, which has some major differences compared to MapReduce, is one of them. There are also some MapReduce implementations for graph processing. Last, but not least different approaches are being tried for scaling graph databases.

Jimmy Lin and Michael Schatz have published in 2010 a paper on the subject of Design patterns for efficient graph algorithms in MapReduce (pdf):

Graphs are analyzed in many important contexts, including ranking search results based on the hyperlink structure of the world wide web, module detection of protein-protein interaction networks, and privacy analysis of social networks. Many graphs of interest are difficult to analyze because of their large size, often spanning millions of vertices and billions of edges. As such, researchers have increasingly turned to distributed solutions. In particular, MapReduce has emerged as an enabling technology for large-scale graph processing. However, existing best practices for MapReduce graph algorithms have significant shortcomings that limit performance, especially with respect to partitioning, serial- izing, and distributing the graph. In this paper, we present three design patterns that address these issues and can be used to accelerate a large class of graph algorithms based on message passing, exemplified by PageRank. Experiments show that the application of our design patterns reduces the running time of PageRank on a web graph with 1.4 billion edges by 69%.

After the break you can find a video of Jimmy Lin talking about current best practices in designing large-scale graph algorithms in MapReduce and how to avoid some of the shortcomings, especially those related to partitioning, serializing, and distributing the graph. He shows three enhanced design patterns applicable to large class of graph algorithms that address many of these deficiencies.