There’s one aspect of Riak’s MapReduce that I’ve always wondered about: why the reduce phase is run only on a single node?
As you can see in the images below — extracted from Jon Meredith’s Riak in Ten Minutes embedded below — the map phase is distributed on all machines having the target data. But the reduce phase is run only on the machine that triggered the processing.
There can be quite a few problems with this approach:
- saturating the network
- overwhelming the node with data and processing
Is this just a temporary solution? Or are there good reasons for this behavior?
While I usually don’t believe in learning X in Y lessons, Jon Meredith’s presentation is a good intro to Riak. Think of it as a summary of Kevin Smith’s 209 slides introducing Riak or Sean Cribbs’s 145 on Riak and Ripple or even for the excellent 2 hours Riak Tutorial — in case you haven’t checked these then you should definitely start with this one as it will give you the basics so you can dive deeper.