has recently given a presentation about a set of upcoming improvements to the Riak Map/Reduce implementation.
From the current status where:
- Map phase executes in parallel with data locality
- Reduce phase executes on the node where the job was submitted
- Results are not cached or stored
Basho guys are working to improve the behavior for the following 2 issues:
- Mapping beats up nodes and is inefficient for large buckets => write a real query scheduler that can
- group keys into batches
- use replicas for better cluster utilization
- Querying data is expensive when all you have are map/reduce functions => integrate key filtering operations into the MapReduce pipeline
And for the future there are scheduled more improvements:
- distributing the reduce phase
- allowing external MapReduce processes
While I’ve always mentioned the possible distributed reduce phase improvement, I like even more the ones they are currently working on.
The complete slidesdeck embedded below:
- Kevin Smith: Basho Technologies engineer (↩)