Erlang: All content tagged as Erlang in NoSQL databases and polyglot persistence
Chad DePue about Phoebus, the first (?) open source implementation of Google’s Pregel algorithm:
Essentially, Phoebus makes calculating data for each vertex and edge in parallel possible on a cluster of nodes. Makes me wish I had a massively large graph to test it with.
Developed by Arun Suresh (Yahoo!), the project ☞ page includes a bullet description of the Pregel computational model:
- A Graph is partitioned into a groups of Records.
- A Record consists of a Vertex and its outgoing Edges (An Edge is a Tuple consisting of the edge weight and the target vertex name).
- A User specifies a ‘Compute’ function that is applied to each Record.
- Computation on the graph happens in a sequence of incremental Super Steps.
- At each Super step, the Compute function is applied to all ‘active’ vertices of the graph.
- Vertices communicate with each other via Message Passing.
- The Compute function is provided with the Vertex record and all Messages sent to the Vertex in the previous SuperStep.
- A Compute funtion can:
- Mutate the value associated to a vertex
- Add/Remove outgoing edges.
- Mutate Edge weight
- Send a Message to any other vertex in the graph.
- Change state of the vertex from ‘active’ to ‘hold’.
- At the begining of each SuperStep, if there are no more active vertices -and- if there are no messages to be sent to any vertex, the algorithm terminates.
- A User may additionally specify a ‘MaxSteps’ to stop the algorithm after a some number of super steps.
- A User may additionally specify a ‘Combine’ funtion that is applied to the all the Messages targetted at a Vertex before the Compute function is applied to it.
While it sounds similar to mapreduce, Pregel is optimized for graph operations, by reducing I/O, ensuring data locality, but also preserving processing state between phases.
Original title and link: Phoebus: Erlang-based Implementation of Google’s Pregel (NoSQL databases © myNoSQL)
Rusty Klophaus (@rklophaus) published a ☞ fantastic recap of the Erlang Factory London event. There were two parts that caught my attention summarizing Justin Sheehy’s presentation on Riak architecture and Ulf Wiger’s presentation on Mnesia.
There are eight distinct layers involved in reading/writing Riak data:
- The Client Application using Riak
- The client-side HTTP API or Protocol Buffers API that talks to the Riak cluster
- The server-side Riak Client containing the combined backing code for both APIs
- The Dynamo Model FSMs that interact with nodes using Dynamo style quorum behavior and conflict resolution
- Riak Core provides the fundamental distribution of the system (not covered in the talk)
- The VNode Master that runs on every physical node, and coordinates incoming interaction with individual VNodes
- Individual VNodes (Virtual Nodes) which are treated as lightweight local abstractions over K/V storage
- The swappable Storage Engine that persists data to disk
Mnesia and NoSQL
- Deployed commercially for over 10 years
- Comparable performance to current top performers clustered SQL space
- Scalable to 50 nodes
- Distributed transactions with loose time limits (in other words, appropriate for transactions across remote clusters)
- Built-in support for sharding (fragments)
- Incremental backup
The downsides are:
- Erlang only interface
- Tables limited to 2GB
- Deadlock prevention scales poorly
- Network partitions are not automatically handled, must recombine tables automatically
CouchDB, the document database built on Erlang, was also present at the event, but I couldn’t find a report about the talk or the slides.
If you are somehow familiar with Erlang you already know that Mnesia is a distributed database system that was designed with the following goals in mind:
- Fast real-time key/value lookup
- Complicated non real-time queries mainly for operation and maintenance
- Distributed data due to distributed applications
- High fault tolerance
- Dynamic re-configuration
- Complex objects
Even if the presentation is not so great (see below), Rickard Cardell’s experiments of using Tokyo Cabinet and CouchDB as Mnesia backends sound like a new and interesting usecase for NoSQL solutions.
The guys from Basho, producers of Riak, have just posted about a new way of running Riak running in a fully self-contained embedded node environment.
In case you are wondering what’s so interesting about this, the video below will walk you from Riak’s compilation to the different benefits of running Riak in this mode. The one that sounded important to me is that now thanks to the lack of dependencies starting more nodes (note: on similar hardware) will be much easier. And as I was writing in The New Dimension of NoSQL Scalability: Complexity, reducing operational complexity is very important for the NoSQL systems.