Hive Top-K Optimization
A simple optimization of top-k queries that can make a huge difference: going from the default behavior of:
- sifting through all the data (necessary),
- sorting it all (necessary),
- writing all the results to disk (unnecessary—saving all the
limitresults from eachmapis enough), and - having the reducer process again all the data (unnecessary—the previous step already reduced the amount of data down to the
limit* number_of_partitions).
For reference a top-k query is:
SELECT * FROM T ORDER BY a DESC LIMIT 10
Original title and link: Hive Top-K Optimization (©myNoSQL)
via: http://www.qubole.com/blog/index.php/top-k-optimization/