presentation: All content tagged as presentation in NoSQL databases and polyglot persistence
- what is FluidDB: a platform for the web of things, each represented by an openly writable “social” object
- why FluidDB: most of the information nowadays lives inside walled gardens, so its difficult to make real use of it. I especially enjoyed this slide explaining the problem with closed information:
- how to use FluidDB: all applications use the same FluidDB database through a RESTful API
Two great slide decks on schema design, #riak, and #ruby
Well, I’ve added one myself so make it three great Riak presentations. You can definitely use them as reference material:
Riak: A friendly key/value store for the web by Bruce Williams
Schema design for Riak by Sean Cribbs
There’s also a nice ☞ Q&A post covering a couple of very interesting topics:
- what’s the cost of listing keys in Riak and the impact on MapReduce
- modeling relationships with large numbers of associations
- caching of intermediate results for link-walking and map phase
- notification mechanisms
Riak and Ruby by Grant Schofield
We’ve already seen the analytics MongoDB case study before when looking how Eventbrite is tracking page views with MongoDB, but also in a MongoDB-based real time web traffic visualization tool called Hummingbird.
But Jared Rosoff’s presentation contains a series of slides which are identifying possible issues in each scaling approach:
- single database
- master-slave database
- sharded database
- key-value stores
- key-value store with Hadoop for reporting
The only part I don’t really understand is how is using Hadoop
more complex than scaling MongoDB:
Maybe someone could explain?
Meanwhile, Jared Rosoff’s complete slidedeck below.
There’s one aspect of Riak’s MapReduce that I’ve always wondered about: why the reduce phase is run only on a single node?
As you can see in the images below — extracted from Jon Meredith’s Riak in Ten Minutes embedded below — the map phase is distributed on all machines having the target data. But the reduce phase is run only on the machine that triggered the processing.
There can be quite a few problems with this approach:
- saturating the network
- overwhelming the node with data and processing
Is this just a temporary solution? Or are there good reasons for this behavior?
While I usually don’t believe in learning X in Y lessons, Jon Meredith’s presentation is a good intro to Riak. Think of it as a summary of Kevin Smith’s 209 slides introducing Riak or Sean Cribbs’s 145 on Riak and Ripple or even for the excellent 2 hours Riak Tutorial — in case you haven’t checked these then you should definitely start with this one as it will give you the basics so you can dive deeper.
My 5 top favorites:
- Mathias Meyer: ☞ NoSQL - The Definitive Guide
- Rusty Klophaus: ☞ Riak from small to large Mon (pdf). New: video of the presentation is available here
- Mathias Stearn: ☞ Mongo DB - the new ‘M’ in your LAMP stack (pdf)
- Peter Neubauer: ☞ 5 cool problems you can solve with neo4j
- Doug Judd: ☞ Hypertable - The Ultimate Scaling Machine. New: video of the presentation is available here
What are yours?
- how can you run multiple map and/or reduce phases in your data processing?
- how can you better coordinate the data processing execution flow for more complex scenarios?
- how can you perform additional work between map/reduce phases?
Addressing these new challenges is the goal of the ☞ Cascading project:
Cascading is a feature rich API for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster.
Christopher Curtin’s slides embedded below are offering a good overview of what can be achieved using Cascading (starting with slide 20).
A very informative presentation by Benjamin Black on Cassandra indexing:
There are so many interesting things to learn from these slides. Benjamin is briefly introducing the main Cassandra terms — if you are not familiar with them you can read more in this Cassandra tutorial — and moves to explain how column sorting and partitioning strategies should be used. Also to mention, some really quotable fragments from the deck:
Relational stores are schema oriented. Start from your schema & work forwards
Column stores are query oriented. Start from your queries & work backwards
Cassandra is an index construction kit
While most of Francisco Treacy’s (@frank06) “An Introduction to node.js and Riak” presentation is focusing on the advantages of event-based architectures, it also shows how to integrate node.js and Riak using ☞ riak-js, a node.js library for Riak that takes advantage of the friendly HTTP-based Riak protocol
There are a couple of other interesting things that can be learned from this slide deck. For example the cost of I/O:
simply described afterwards:
In other words, reaching RAM is like going from here to the Red Light District. Accessing the network is like going to the moon.
Update: thanks to a comment on this post, here is what Googler Jeff Dean presented on the cost of I/O:
But as Frank mentions, there are some risks while working with cutting-edge technologies:
- Cutting-edge technologies are not bug-free
- Riak still has some rough edges (some in terms of performance)
- node.js is approaching its first stable version
- asynchronous JS code can get “boomerang-shaped”
- Cassandra data model
- Cassandra API
- consistency model
- the Hector Java client
- gossip, consistency hashing and consistency levels
You’ll find much more details about these in our getting started with Cassandra tutorial, but bullet point format is usefull sometimes.
-  Note: this is a Google doc. (↩)