NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



pregel: All content tagged as pregel in NoSQL databases and polyglot persistence

Big Graph-Processing Library From Twitter: Cassovary

Cassovary is designed from the ground up to efficiently handle graphs with billions of edges. It comes with some common node and graph data structures and traversal algorithms. A typical usage is to do large-scale graph mining and analysis.

If you are reading this you’ve most probably heard of Pregel—if you didn’t then you should check out the Pregel: a system for large-scale graph processing paper and then how Pregel and MapReduce compare—and also the 6 Pregel inspired frameworks.

The Cassovary project page introduces it as:

Cassovary is a simple “big graph” processing library for the JVM. Most JVM-hosted graph libraries are flexible but not space efficient. Cassovary is designed from the ground up to first be able to efficiently handle graphs with billions of nodes and edges. A typical example usage is to do large scale graph mining and analysis of a big network. Cassovary is written in Scala and can be used with any JVM-hosted language. It comes with some common data structures and algorithms.

I’m not sure yet if:

  1. Cassovary works with any graphy data source or requires FlockDB—which is more of a persisted graph than a graph database
  2. Cassovary is inspired by Pregel in any ways or if it’s addressing a limited problem space (similarly to FlockDB)

Update: Pankaj Gupta helped clarify the first question (and probably part of the second too):

At Twitter we use flockdb as our real-time graphdb, and export daily for use in cassovary, but any store could be used.

Original title and link: Big Graph-Processing Library From Twitter: Cassovary (NoSQL database©myNoSQL)


6 Pregel-Inspired Frameworks

A quick overview of 6 Pregel-inspired frameworks (Apache Hama, GoldenOrb, Apache Giraph, Phoebus, Signal/Collect, and HipG):

So, to summarize, what Hama, GoldenOrb and Giraph have in common is: Java platform, Apache License (and incubation), BSP computation. What they differ for: Hama offers BSP primitives not graph processing API (so it sits at a lower level), GoldenOrb provides Pregel’s API but requires the deployment of additional software to your existing Hadoop infrastructure, Giraph provides Pregel’s API (and is kind of complete at the current state) and doesn’t require additional infrastructure.

Original title and link: 6 Pregel-Inspired Frameworks (NoSQL database©myNoSQL)


Paper: Graph Based Statistical Analysis of Network Traffic

Published by a group from Los Alamos National Lab (Hristo Djidjev, Gary Sandine, Curtis Storlie, Scott Vander Wiel):

We propose a method for analyzing traffic data in large computer networks such as big enterprise networks or the Internet. Our approach combines graph theoretical representation of the data and graph analysis with novel statistical methods for discovering pattern and timerelated anomalies. We model the traffic as a graph and use temporal characteristics of the data in order to decompose it into subgraphs corresponding to individual sessions, whose characteristics are then analyzed using statistical methods. The goal of that analysis is to discover patterns in the network traffic data that might indicate intrusion activity or other malicious behavior.

The embedded PDF and download link after the break.

GoldenOrb: Ravel Google Pregel Implementation Released

Announced back in March, Ravel has finally released GoldenOrb an implementation of the Google Pregel paper—if you are not familiar with Google Pregel check the Pregel: Graph Processing at Large-Scale and Ricky Ho’s comparison of Pregel and MapReduce.

Until Ravel’s GoldenOrb the only experimental implementation of Pregel was the Erlang-based Phoebus. GoldenOrb was released under the Apache License v2.0 and is available on GitHub.

GoldenOrb is a cloud-based open source project for massive-scale graph analysis, built upon best-of-breed software from the Apache Hadoop project modeled after Google’s Pregel architecture.

Original title and link: GoldenOrb: Ravel Google Pregel Implementation Released (NoSQL database©myNoSQL)

Graph Databases: Distributed Traversal Engines

Marko A.Rodriguez:

In the distributed traversal engine model, a traversal is represented as a flow of messages between elements of the graph. Generally, each element (e.g. vertex) is operating independently of the other elements. Each element is seen as its own processor with its own (usually homogenous) program to execute. Elements communicate with each other via message passing. When no more messages have been passed, the traversal is complete and the results of the traversal are typically represented as a distributed data structure over the elements. Graph databases of this nature tend to use the Bulk Synchronous Parallel model of distributed computing. Each step is synchronized in a manner analogous to a clock cycle in hardware. Instances of this model include Agrapa, Pregel, Trinity, GoldenOrb, and others.

None of these graph databases offers distributed traversal engines.

Original title and link: Graph Databases: Distributed Traversal Engine (NoSQL databases © myNoSQL)


Ravel Hopes to Open-Source Graph Databases

Ravel, an Austin, Texas-based company, wants to provide a supported, open-source version of Google’s Pregel software called GoldenOrb to handle large-scale graph analytics.

Is it a new graph database or a Pregel implementation? Watch the interview for yourself and tell me what do you think it is?


Pregel: Graph Processing at Large-Scale

Good preso about Pregel:

The slides talk about:

  • Pregel compute model
  • Pregel C++ API
  • implementation details
  • fault tolerance
  • workers, master, and aggregators

As mentioned before Pregel is MapReduce for graphs. And besides Google’s implementation we’ll probably never see, there’s Phoebus, an Erlang implementation of Pregel.

Original title and link: Pregel: Graph Processing at Large-Scale (NoSQL databases © myNoSQL)

Phoebus: Erlang-based Implementation of Google’s Pregel

Chad DePue about Phoebus, the first (?) open source implementation of Google’s Pregel algorithm:

Essentially, Phoebus makes calculating data for each vertex and edge in parallel possible on a cluster of nodes. Makes me wish I had a massively large graph to test it with.

Developed by Arun Suresh (Yahoo!), the project ☞ page includes a bullet description of the Pregel computational model:

  • A Graph is partitioned into a groups of Records.
  • A Record consists of a Vertex and its outgoing Edges (An Edge is a Tuple consisting of the edge weight and the target vertex name).
  • A User specifies a ‘Compute’ function that is applied to each Record.
  • Computation on the graph happens in a sequence of incremental Super Steps.
  • At each Super step, the Compute function is applied to all ‘active’ vertices of the graph.
  • Vertices communicate with each other via Message Passing.
  • The Compute function is provided with the Vertex record and all Messages sent to the Vertex in the previous SuperStep.
  • A Compute funtion can:
    • Mutate the value associated to a vertex
    • Add/Remove outgoing edges.
    • Mutate Edge weight
    • Send a Message to any other vertex in the graph.
    • Change state of the vertex from ‘active’ to ‘hold’.
  • At the begining of each SuperStep, if there are no more active vertices -and- if there are no messages to be sent to any vertex, the algorithm terminates.
  • A User may additionally specify a ‘MaxSteps’ to stop the algorithm after a some number of super steps.
  • A User may additionally specify a ‘Combine’ funtion that is applied to the all the Messages targetted at a Vertex before the Compute function is applied to it.

While it sounds similar to mapreduce, Pregel is optimized for graph operations, by reducing I/O, ensuring data locality, but also preserving processing state between phases.

Original title and link: Phoebus: Erlang-based Implementation of Google’s Pregel (NoSQL databases © myNoSQL)

Comparing Pregel and MapReduce

Following his post on graph processing, Ricky Ho explains the major difference between Pregel and MapReduce applied to graph processing:

Since Pregel model retain worker state (the same worker is responsible for the same set of nodes) across iteration, the graph can be loaded in memory once and reuse across iterations. This will reduce I/O overhead as there is no need to read and write to disk at each iteration. For fault resilience, there will be a periodic check point where every worker write their in-memory state to disk.

Also, Pregel (with its stateful characteristic), only send local computed result (but not the graph structure) over the network, which implies the minimal bandwidth consumption.

If you need to summarize that even further it is basically:

  • reducing I/O as much as possible
  • ensuring data locality


On Graph Processing

Ricky Ho explains these two fundamental graph papers

The execution model is based on BSP (Bulk Synchronous Processing) model. In this model, there are multiple processing units proceeding in parallel in a sequence of “supersteps”. Within each “superstep”, each processing units first receive all messages delivered to them from the preceding “superstep”, and then manipulate their local data and may queue up the message that it intends to send to other processing units. This happens asynchronously and simultaneously among all processing units. The queued up message will be delivered to the destined processing units but won’t be seen until the next “superstep”. When all the processing unit finishes the message delivery (hence the synchronization point), the next superstep can be started, and the cycle repeats until the termination condition has been reached.

Pregel execution model

Note that Google’s Pregel is at the very high level quite similar to Google’s MapReduce.


Two Must Read Graph Papers

Firstly, “Constructions from Dots and Lines” by Marko A. Rodriguez and Peter Neubauer[1], available in PDF format ☞ here:

The ability for a graph to denote objects and their relationships to one another allow for a surprisingly large number of things to be modeled as a graph. From the dependencies that link software packages to the wood beams that provide the framing to a house, most anything has a corresponding graph representation. However, just because it is possible to represent something as a graph does not nec- essarily mean that its graph representation will be useful. If a modeler can leverage the plethora of tools and algorithms that store and process graphs, then such a mapping is worthwhile. This article explores the world of graphs in computing and exposes situations in which graphical models are beneficial.

Second, the much awaited “Pregel: a system for large-scale graph processing” by G.Malewicz at all is now available on ☞ ACM portal (thanks Claudio Martella[2] for the tip) :

Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs - in some cases billions of vertices, trillions of edges - poses challenges to their efficient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertex-centric approach is flexible enough to express a broad set of algorithms. The model has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.

Now is time to read them and think about the interesting problem of scaling graph databases.

  1. Marko A. Rodriguez and Peter Neubauer have also authored the paper: The Graph Traversal Pattern  ()
  2. Claudio has already posted an article on Pregel: ☞ Google Pregel is out. But what is Pregel?  ()

An Interesting Problem: Scaling Graph Databases

One of the problems mentioned when discussing relational databases scalability is that handling storage enforced relationships, ACID and scale do not play well together. In the NoSQL space there is a category of storage solutions that uses highly interconnected data: graph databases. (note also that some of these graph databases are also transactional).

Lately there have been quite a few interesting discussions related to scaling graph databases. Alex Averbuch is working on a sharding Neo4j thesis and his recent post presents some of the possible solutions. Alex’s article is a very good starting point for anyone interesting in scaling graph databases.

Then there is also this article on InfoGrid‘s blog that is presenting a different web-like solution based on a custom protocol: XPRISO: eXtensible Protocol for the Replication, Integration and Synchronization of distributed Objects. While I haven’t had the chance to dig deeper into InfoGrid suggested approach there was one thing that caught my attention right away: while the association with web-scale is definitely an interesting idea, having specific knowledge of the nodes location and having to use custom API for it doesn’t seem to be the best solution. Basically the web addressed this by having URIs for each reachable resource (InfoGrid should try a similar idea, get rid of the different API for accessing local vs remote nodes, etc.)

Update: make sure you check the comment thread for more details about InfoGrid perspective on scaling graph databases.

Oren Eini concludes in his post:

After spending some time thinking about it, I came to the conclusion that I can’t envision any general way to solve the problem. Oh, I can think of several ways of reduce the problem:

  • Batching cross machine queries so we only perform them at the close of each breadth first step.
  • Storing multiple levels of associations (So “users/ayende” would store its relations but also “users/ayende”’s relation and “users/arik”’s relations).

While I haven’t had enough time to think about this topic, my gut feeling is that possible solutions are to be found in the space of a combination of using unique identifiers for distributed nodes and a mapreduce-like approach. I cannot stop wondering if this is not what Google’s Pregel is doing (nb I should have read the paper (pdf) firstly).