Jon Natkins (WibiData) has a guest post on Cloudera’s blog about using “Apache Oozie for scheduling recurring Hadoop jobs“.
CDH, Cloudera’s open-source distribution of Apache Hadoop and related projects, includes a framework called Apache Oozie that can be used to design complex job workflows and coordinate them to occur at regular intervals. In this how-to, you’ll review a simple Oozie coordinator job, and learn how to schedule a recurring job in Hadoop. The example involves adding new data to a Hive table every hour, using Oozie to schedule the execution of recurring Hive scripts.
If you are going through this tutorial you’ll notice that it is not about only about Hadoop and Oozie, but also jars, Hive and XML. And some more XML. The article ends up having 10 “pages”1 to explain how the task is done.
This is the sort of complexity of the Hadoop environment that vendors in the space are talking when promoting their own proprietary, oftentimes commercial, products. Even if I don’t like agreeing with it, it is true.
The current state of affairs in the Hadoop space is that pretty much everything is possible, but the complexity of getting results varies way too much. Open source companies like Cloudera and Hortonworks use this as an advantage (or excuse) to sell their training and consulting services, and tools. But by not addressing this sort of complexity in a more strategic, coordinated way2 they’re also exposing themselves to the risks of losing their market share to companies that will focus on transforming the doable into something repeatably easy and friendly.
I counted the number of times I’ve clicked Page down to get to the end. ↩
"more strategic, coordinated way" sounds like BS, but I couldn’t find a better formulation for saying that they should bury the money hatchet and work together to make sure no other company is going to steal their lunch, just because it has more budget and focus. ↩