We recently open-sourced a number of internal tools we’ve built to help our
engineers write high-performance Cascading code as the cascading_ext
project. Today I’m going to to talk about a tool we use to improve the
performance of asymmetric joins—joins where one data set in the join
contains significantly more records than the other, or where many of the
records in the larger set don’t share a common key with the smaller set.
In the relational world there’s the Hash join.
Original title and link: BloomJoin: BloomFilter + CoGroup for Cascading ( ©myNoSQL)