Question about Riak MapReduce

by Alex Popescu

There’s one aspect of Riak’s MapReduce that I’ve always wondered about: why the reduce phase is run only on a single node?

As you can see in the images below — extracted from Jon Meredith’s Riak in Ten Minutes embedded below — the map phase is distributed on all machines having the target data. But the reduce phase is run only on the machine that triggered the processing.

There can be quite a few problems with this approach:

  • saturating the network
  • overwhelming the node with data and processing

Is this just a temporary solution? Or are there good reasons for this behavior?




While I usually don’t believe in learning X in Y lessons, Jon Meredith’s presentation is a good intro to Riak. Think of it as a summary of Kevin Smith’s 209 slides introducing Riak or Sean Cribbs’s 145 on Riak and Ripple or even for the excellent 2 hours Riak Tutorial — in case you haven’t checked these then you should definitely start with this one as it will give you the basics so you can dive deeper.