hive: All content tagged as hive in NoSQL databases and polyglot persistence
Monday, 14 March 2011
Hadoop, Hive and Redis for Foursquare Analytics
Foursquare’s move from querying the production databases to a data analytics system using Hadoop and Hive with Redis playing the role of a cache:
- Provide an easy-to-use end-point to run data exploration queries (using SQL and simple web-forms).
- Cache the results of queries (in a database) to power reports, so that the data is available to everyone, whenever it is needed.
- Allow our hadoop cluster to be totally dynamic without having to move data around (we shut it down at night and on weekends).
- Add new data in a simple way (just put it in Amazon S3!).
- Analyse data from several data sources (mongodb, postgres, log-files).

One of the most often heard complains about NoSQL databases is about their reduced querying capabilities. Running reports and analysis against the production servers is only going to work when you have little data and the set of queries is limitted and stable over time. Otherwise you’ll want to run these against a copy of your data to avoid bringing down production databases and avoid corrupting data.
Original title and link: Hadoop, Hive and Redis for Foursquare Analytics (NoSQL databases © myNoSQL)
Monday, 28 February 2011
Cloudera’s Distribution for Apache Hadoop version 3 Beta 4
New version of Cloudera’s Hadoop distro — complete release notes available here:
CDH3 Beta 4 also includes new versions of many components. Highlights include:
- HBase 0.90.1, including much improved stability and operability.
- Hive 0.7.0rc0, including the beginnings of authorization support, support for multiple databases, and many other new features.
- Pig 0.8.0, including many new features like scalar types, custom partitioners, and improved UDF language support.
- Flume 0.9.3, including support for Windows and improved monitoring capabilities.
- Sqoop 1.2, including improvements to usability and Oracle integration.
- Whirr 0.3, including support for starting HBase clusters on popular cloud platforms.
Plus many scalability improvements contributed by Yahoo!.
Cloudera’s CDH is the most popular Hadoop distro bringing together many components of the Hadoop ecosystem. Yahoo remains the main innovator behind Hadoop.
Original title and link: Cloudera’s Distribution for Apache Hadoop version 3 Beta 4 (NoSQL databases © myNoSQL)
via: http://www.cloudera.com/blog/2011/02/cdh3-beta-4-now-available
Saturday, 5 February 2011
The Backstory of Yahoo and Hadoop
We currently have nearly 100 people working on Apache Hadoop and related projects, such as Pig, ZooKeeper, Hive, Howl, HBase and Oozie. Over the last 5 years, we’ve invested nearly 300 person-years into these projects. […] Today Yahoo runs on over 40,000 Hadoop machines (>300k cores). They are used by over a thousand regular users from our science and development teams. Hadoop is at the center of our research in search, advertising, spam detection, personalization and many other topics.
I assume there’s no surpise to anyone I’m a big fan of Yahoo! open source initiatives.
Original title and link: The Backstory of Yahoo and Hadoop (NoSQL databases © myNoSQL)
via: http://developer.yahoo.com/blogs/hadoop/posts/2011/01/the-backstory-of-yahoo-a
Monday, 24 January 2011
Hive and HBase in Toad for Cloud Demo
Jeremiah Peschka put together two short videos demoing Toad for Cloud Eclipse plugin with Hive and HBase. Those complaining about lack of SQL in NoSQL databases should check it out. On a different note though, I did express a few concerns about such a tool related to the complexity and performance of building the indirection layer and supporting operations that are not native to target system. I’d add to these the fact that some NoSQL databases are continuously adding features that can radically change the way this tool performs (e.g. Cassandra 0.7 will feature secondary indexes).
Hive and Toad for Cloud
HBase and Toad for Cloud
Looking at how slow the tool performs and the fact that it doesn’t have any sort of results pagination seems to be a confirmation of some of the concerns expressed above. On the other hand it is kind of difficult to understand a tool by just watching a video.
Original title and link: Hive and HBase in Toad for Cloud Demo (NoSQL databases © myNoSQL)
Monday, 10 January 2011
Amazon Elastic MapReduce Updates
Updates from Amazon including upgraded Hive, multipart upload, optimized JDBC drivers:
- Support for S3’s Large Objects and Multipart Upload
Amazon Elastic MapReduce supports this feature too allowing MapREduce to behin the upload before the Hadoop task is finished
- Upgraded Hive Support
Currently you can run both Hive 0.5 and 0.7 concurrenty in the same cluster
- JDBC Drivers for Hive
Optimized JDBC drivers for Hive.
Original title and link: Amazon Elastic MapReduce Updates (NoSQL databases © myNoSQL)
Thursday, 11 November 2010
Quick Reference: Hadoop Tools Ecosystem
Just a quick reference of the continuously growing Hadoop tools ecosystem.
Hadoop
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
HDFS
A distributed file system that provides high throughput access to application data.
MapReduce
A software framework for distributed processing of large data sets on compute clusters.
Amazon Elastic MapReduce
Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
aws.amazon.com/elasticmapreduce/
Cloudera Distribution for Hadoop (CDH)
Cloudera’s Distribution for Hadoop (CDH) sets a new standard for Hadoop-based data management platforms.
ZooKeeper
A high-performance coordination service for distributed applications. ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.
HBase
A scalable, distributed database that supports structured data storage for large tables.
Avro
A data serialization system. Similar to ☞ Thrift and ☞ Protocol Buffers.
Sqoop
Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:
- Imports individual tables or entire databases to files in HDFS
- Generates Java classes to allow you to interact with your imported data
- Provides the ability to import from SQL databases straight into your Hive data warehouse
Flume
Flume is a distributed, reliable, and available service for efficiently moving large amounts of data soon after the data is produced.
archive.cloudera.com/cdh/3/flume/
Hive
Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.
Pig
A high-level data-flow language and execution framework for parallel computation. Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
Oozie
Oozie is a workflow/coordination service to manage data processing jobs for Apache Hadoop. It is an extensible, scalable and data-aware service to orchestrate dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce).
Cascading
Cascading is a Query API and Query Planner used for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster.
Cascalog
Cascalog is a tool for processing data on Hadoop with Clojure in a concise and expressive manner. Cascalog combines two cutting edge technologies in Clojure and Hadoop and resurrects an old one in Datalog. Cascalog is high performance, flexible, and robust.
github.com/nathanmarz/cascalog
HUE
Hue is a graphical user interface to operate and develop applications for Hadoop. Hue applications are collected into a desktop-style environment and delivered as a Web application, requiring no additional installation for individual users.
You can read more about HUE on ☞ Cloudera blog.
Chukwa
Chukwa is a data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
Mahout
A Scalable machine learning and data mining library.
Integration with Relational databases
- Oracle
- Hadoop connector for Oracle Ora-Oop
- Hadoop and Oracle Parallel Processing
Integration with Data Warehouses
The only list I have is MapReduce, RDBMS, and Data Warehouse, but I’m afraid it is quite a bit old. So maybe someone could help me update it.
Anything else? Once we validate this list, I’ll probably have to move it on the NoSQL reference
Original title and link: Quick Reference: Hadoop Tools Ecosystem (NoSQL databases © myNoSQL)
Thursday, 16 September 2010
Cloudera: All Your Big Data Are Belong to Us
Matt Asay (GigaOm):
Where Cloudera shines, however, is in taking these different contributions and making Hadoop relevant for enterprise IT[8], where data mining has waxed and waned over the years. […] Cloudera, in other words, is banking on the complexity of Hadoop to drive enterprise IT to its own Cloudera Enteprise tools.
Additionally, I think what Cloudera is “selling” is a good set of tools — Hadoop, HBase, Hive, Pig, Oozie, Sqoop, Flume, Zookeeper, Hue — put together based on their expertise in the field.
Original title and link: Cloudera: All Your Big Data Are Belong to Us (NoSQL databases © myNoSQL)
via: http://cloud.gigaom.com/2010/09/14/cloudera-all-your-big-data-are-belong-to-us/
Wednesday, 1 September 2010
MongoDB or Hadoop?
Posted on the MongoDB mailing list:
I have about 500M log file entries each representing an “ad impression” (we are an advertising company). Each “hit” has about 50 attributes to it (example: Country, State, City, Adsize, Browser, OS, etc) .. I want to load all 500M into some form of database and then run queries against this set.
As you could expect MongoDB is considered as a possibility. But I’d call that a biased vendor advise. I’ll be blunt: invest in your future by using Hadoop and Pig. Hive may fit too.
Original title and link for this post: MongoDB or Hadoop? (published on the NoSQL blog: myNoSQL)
Monday, 30 August 2010
Pig and Hive at Yahoo!
Fantastic post on Yahoo! Hadoop blog presenting a series of scenarios where using Pig and Hive makes things a lot better:
The widespread use of Pig at Yahoo! has enabled the migration of our data factory processing to Hadoop. With the adoption of Hive, we will be able to move much of our data warehousing to Hadoop as well. Having the data factory and the data warehouse on the same system will lower data-loading time into the warehouse — as soon as the factory is finished, it is available in the warehouse. It will also enable us to share — across both the factory and the warehouse — metadata, monitoring, and management tools; support and operations teams; and hardware. So we are excited to add Hive to our toolkit, and look forward to using both these tools together as we lean on Hadoop to do more and more of our data processing.
The use cases mentioned in the post:
- data preparation and presentation:
Given the different workloads and different users for each phase, we have found that different tools work best in each phase. Pig (combined with a workflow system such as Oozie) is best suited for the data factory, and Hive for the data warehouse.
- data factories: pipelines (Pig + Oozie), iterative processing (Pig), research (Pig)
- data warehouse: business-intelligence analysis and ad-hoc queries
In both cases, the relational model and SQL are the best fit. Indeed, data warehousing has been one of the core use cases for SQL through much of its history. It has the right constructs to support the types of queries and tools that analysts want to use. And it is already in use by both the tools and users in the field. The Hadoop subproject Hive provides a SQL interface and relational model for Hadoop.
Yahoo! gets way to little credit for its work on bigdata and its contributions to the open source.
Original title and link for this post: Pig and Hive at Yahoo! (published on the NoSQL blog: myNoSQL)
via: http://developer.yahoo.net/blogs/hadoop/2010/08/pig_and_hive_at_yahoo.html
Tuesday, 17 August 2010
Howl: Unifying Metadata Layer for Hive and Pig
Yet another contribution from Yahoo!:
Common metadata layer for Hadoop’s Map Reduce, Pig, and Hive
Howl: Unifying Metadata Layer for Hive and Pig originally posted on the NoSQL blog: myNoSQL
Monday, 9 August 2010
NoSQL Databases and Data Warehousing
I didn’t know data warehousing strictly imposes a relational model:
From a philosophical standpoint, my largest problem with NoSQL databases is that they don’t respect relational theory. In short, they aren’t meant to deal with sets of data, but lists. Relational algebra was created to deal with the large sets of data and have them interact. Reporting and analytics rely on that.
I’d bet people building and using Hive, Pig, Flume and other data warehousing tools would disagree with Eric Hewitt.
NoSQL Databases and Data Warehousing originally posted on the NoSQL blog: myNoSQL
Friday, 23 July 2010
Hive integrations at Facebook: HBase & RCFile
Recently I’ve written about more integration for Hive mentioning that Facebook is working on integrating Hive with HBase.
Yahoo! Developer Network Blog has ☞ published the video of John Sichi and Yongqiang He of Facebook presenting about Hive and HBase integration at Hadoop summit[1]
:
- You can find some of the news coming from Hadoop 2010 Summit in this post Hadoop and HBase status updates after Hadoop Summit (↩)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling