Fantastic post on Yahoo! Hadoop blog presenting a series of scenarios where using Pig and Hive makes things a lot better:
The widespread use of Pig at Yahoo! has enabled the migration of our data factory processing to Hadoop. With the adoption of Hive, we will be able to move much of our data warehousing to Hadoop as well.
Having the data factory and the data warehouse on the same system will lower data-loading time into the warehouse — as soon as the factory is finished, it is available in the warehouse.
It will also enable us to share — across both the factory and the warehouse — metadata, monitoring, and management tools; support and operations teams; and hardware.
So we are excited to add Hive to our toolkit, and look forward to using both these tools together as we lean on Hadoop to do more and more of our data processing.
The use cases mentioned in the post:
- data preparation and presentation:
Given the different workloads and different users for each phase, we have found that different tools work best in each phase. Pig (combined with a workflow system such as Oozie) is best suited for the data factory, and Hive for the data warehouse.
- data factories: pipelines (Pig + Oozie), iterative processing (Pig), research (Pig)
- data warehouse: business-intelligence analysis and ad-hoc queries
In both cases, the relational model and SQL are the best fit. Indeed, data warehousing has been one of the core use cases for SQL through much of its history. It has the right constructs to support the types of queries and tools that analysts want to use. And it is already in use by both the tools and users in the field.
The Hadoop subproject Hive provides a SQL interface and relational model for Hadoop.
Yahoo! gets way to little credit for its work on bigdata and its contributions to the open source.
Original title and link for this post: Pig and Hive at Yahoo! (published on the NoSQL blog: myNoSQL)