Cloudera Distribution for Hadoop will include PIG, Hive and why it matters
Cloudera distributes an easy to install pre-packaged version of Hadoop that includes various bug fixes and optimizations. Yesterday they have announced the availability of a new version called ☞ CDH2 (nb Cloudera Distribution for Hadoop), but also the first beta of the upcoming version that will include support for Pig and Hive, the tools that help you put your NoSQL data to work.
But why is this important? While NoSQL solutions are helping us tackle problems like
- cost[1] and complexity[2] and productivity
- availability, scalability
- storing huge amounts of data[3]
none of these are really the end goals. While I don’t feel comfortable disagreeing with Google’s chief scientist, Peter Norvig:
We don’t have better algorithms than anyone else. We just have more data.
I don’t really think it is only about the data, but rather the intel that can be built around the data. And that’s exactly what tools like Hadoop and PIG and Hive will help us achieve.
We have a system in place based on shared mysql + memcache but its quickly becoming prohibitively costly (in terms of manpower) to operate.
References
-
[1]
In the interview about Cassandra usage at Twitter, Ryan mentioned: (↩)
We have a system in place based on shared mysql + memcache but its quickly becoming prohibitively costly (in terms of manpower) to operate.
- [2] Scalability is not only about size, but also complexity: The new dimension of NoSQL scalability: complexity (↩)
- [3] Why NoSQL is here to stay? (↩)