A brief but very clear explanation of the benefits of using Cascalog-checkpoints by Paul Lam:
Building Cascading/Cascalog queries can be visualised as assembling pipes to connect a flow of data. Imagine that you have Flow A and B. Flow B uses the result from A along with other bits. Thus, Flow B is dependent on A. Typically, if a MapReduce job fail for whatever reason, you simply fix what’s wrong and start the job all over again. But what if Flow A takes hours to run (which is common for a MR job) and the error happened in Flow B? Why re-do all that processing for Flow A if we know that it finished successfully?
Original title and link: Cascalog-Checkpoint: Fault-Tolerant MapReduce Topologies ( ©myNoSQL)