Matthew Hayes introduces a very interesting new framework from LinkedIn
Hourglass is designed to make computations over sliding windows more
efficient. For these types of computations, the input data is partitioned in
some way, usually according to time, and the range of input data to process
is adjusted as new data arrives. Hourglass works with input data that is
partitioned by day, as this is a common scheme for partitioning temporal
Hourglass is available on GitHub.
We have found that two types of sliding window
computations are extremely common in practice:
- Fixed-length: the length of the window is set to some
constant number of days and the entire window moves
forward as new data becomes available. Example: a daily
report summarizing the the number of visitors to a site
from the past 30 days.
- Fixed-start: the beginning of the window stays constant,
but the end slides forward as new input data becomes
available. Example: a daily report summarizing all
visitors to a site since the site launched.
Original title and link: LinkedIn’s Hourglass: Incremental data processing in Hadoop