ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Membase Amazon SimpleDB MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

hive: All content tagged as hive in NoSQL databases and polyglot persistence

The components and their functions in the Hadoop ecosystem

Edd Dumbill enumerates the various components of the Hadoop ecosystem:

Hadoop ecosystem

My quick reference of the Hadoop ecosystem is including a couple of other tools that are not in this list, with the exception of Ambari and HCatalog which were released later.

Original title and link: The components and their functions in the Hadoop ecosystem (NoSQL database©myNoSQL)


Visualizing Hadoop data with Tableau Software and Cloudera Connector for Tableau

Put together one of the most impressive visualization tools, Tableau Software, with one of the best solutions for big data, Hadoop, and you’ll probably get some astonishing results.

Cloudera Connector for Tableau

Credit Cloudera.

While Tableau Software works with structured data only, with this connector it gets access to Hive through HiveQL.

Original title and link: Visualizing Hadoop data with Tableau Software and Cloudera Connector for Tableau (NoSQL database©myNoSQL)

via: http://www.cloudera.com/blog/2012/02/cloudera-connector-for-tableau-has-been-released/


Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support

Starting today customers can view graphs of 23 job flow metrics within the EMR Console by selecting the Monitoring tab in the Job Flow Details page. These metrics are pushed CloudWatch every five minutes at no cost to you and include information on:

  • Job flow progress including metrics on the number of map and reduce tasks running and remaining in your job flow and the number of bytes read and written to S3 and HDFS.
  • Job flow contention including metrics on HDFS utilization, map and reduce slots open, jobs running, and the ratio between map tasks remaining and map slots.
  • Job flow health including metrics on whether your job flow is idle, if there are missing data blocks, and if there are any dead nodes.

That’s like free pr0n for operations teams.

On a different note, I’ve noticed that the Hadoop stack (Hadoop, Hive, Pig) on Amazon Elastic MapReduce is based on second to last versions, which says that extensive testing is performed on Amazon side before rolling new versions out:

Original title and link: Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support (NoSQL database©myNoSQL)

via: http://aws.typepad.com/aws/2012/01/new-elastic-mapreduce-features-metrics-updates-vpc-and-cluster-compute-support-guest-post.html


Powered by Hadoop and Hive: Budgeting for snow removal in your local community

I don’t know how I ended up becoming the head of our local community association. Anyhow, I’m now responsible for laying out next year’s budget. Most of our expenses seem to be fixed from one year to another, but then there’s the expense for the snow removal service. This year, no snow. Last year, most snow on record in 30 years! How do you budget for something as volatile as snow? I need more data!

Instead of just googling the answer, we’re going to fetch some raw data and feed it into Hadoop Hive.

Hadoop FTW!

Original title and link: Powered by Hadoop and Hive: Budgeting for snow removal in your local community (NoSQL database©myNoSQL)

via: http://magnusljadas.wordpress.com/2012/01/29/search-for-snow-with-hadoop-hive/


Measuring User Retention With Hadoop and Hive

A very practical example of how Hive and Hadoop could deliver value when applied to clickstreams, the most common data for each web property:

Hadoop, Hive, and related tech­nologies are formi­dable tools for unlocking value from data. […] Retention measure­ments are partic­u­larly signif­icant because they paint a detailed picture about the overall stick­iness of a product across the entire userbase.

The same clickstream data can be used to calculate visitors’ conversion with the Bayesian discriminant using Hadoop.

Original title and link: Measuring User Retention With Hadoop and Hive (NoSQL database©myNoSQL)

via: http://blog.polarmobile.com/2012/01/measuring-user-retention-with-hadoop-and-hive/


Last NoSQL Releases in 2011: MongoDB, Hive, ZooKeeper, Whirr, HBase, Redis, and Hadoop 1.0.0

Let’s start the year with a quick review of the latest releases that happened in December. Make sure that you scroll to the end as there are quite a few important ones.

MongoDB 2.0.2

Announced on Dec.15th, MongoDB 2.0.2 is a bug fix release:

  • Hit config server only once per mongos on meta data change to not overwhelm
  • Removed unnecessary connection close and open between mongos and mongod after getLastError
  • Replica set primaries close all sockets on stepDown()
  • Do not require authentication for the buildInfo command
  • scons option for using system libraries

Apache Hive 0.8.0

Apache Hive 0.8.0 came out on Dec.19th. The list of new features, improvements, and bug fixes is extremely long.

Just as a side note, who came out with the idea of having a Hive fans’ page on Facebook?

Apache ZooKeeper 3.4.2

ZooKeeper 3.4.0 has been followed up shortly by two new minor version updates fixing some critical bugs. The list of issues fixed in ZooKeeper 3.4.1 can be found here and for ZooKeeper 3.4.2 the 2 fixed bugs are listed here.

As with ZooKeeper 3.4.0, these versions are not yet production ready.

Apache Whirr 0.7.0

Apache Whirr 0.7.0 has been released on Dec.21st featuring 56 improvements and bug fixes including support for Puppet & Chef, and Mahout and Ganglia as a service. The complete list can be found here.

Some more details about Whirr 0.7.0 can be found here.

Apache HBase 0.90.5

Released Dec.23rd, HBase 0.90.5 packs 81 bug fixes. The complete list can be found here.

Redis 2.4.5

Redis 2.4.5 was released on Dec.23rd and provides 4 bug fixes:

  • [BUGFIX] Fixed a ZUNIONSTORE/ZINTERSTORE bug that can cause a NaN to be inserted as a sorted set element score. This happens when one of the elements has +inf/-inf score and the weight used is 0.
  • [BUGFIX] Fixed memory leak in CLIENT INFO.
  • [BUGFIX] Fixed a non critical SORT bug (Issue 224).
  • [BUGFIX] Fixed a replication bug: now the timeout configuration is respected during the connection with the master.
  • --quiet option implemented in the Redis test.

Last but definitely one of the most important announcements that came in December:

Hadoop 1.0.0

Based on the 0.20-security code line, Hadoop 1.0.0 was announced on Dec.29. This release includes support for:

  • HBase (append/hsynch/hflush) and Security
  • Webhdfs (with full support for security)
  • Performance enhanced access to local files for HBase
  • Other performance enhancements, bug fixes, and features
  • All version 0.20.205 and prior 0.20.2xx features

Complete release notes are available here.

Stéphane Fréchette, Ryan Slobojan, Duane Moore, Arun C. Murthy

And with this we are ready for 2012.

Original title and link: Last NoSQL Releases in 2011: MongoDB, Hive, ZooKeeper, Whirr, HBase, Redis, and Hadoop 1.0.0 (NoSQL database©myNoSQL)


Doug Cutting About Hadoop, Its Adoption and Future, and Its Relationship With Relational Databases

Jaikumar Vijayan (Computerworld) interviews Doug Cutting:

Q: How would you describe Hadoop to a CIO or a CFO? Why should enterprises care about it?

A: At a really simple level, it lets you affordably save and process vastly more data than you could before. With more data and the ability to process it, companies can see more, they can learn more, they can do more. [With Hadoop] you can start to do all sorts of analyses that just weren’t practical before. You can start to look at patterns over years, over seasons, across demographics. You have enough data to fill in patterns and make predictions and decide, “How should we price things?” and “What should we be selling now?” and “How should we advertise?” It is not only about having data for longer periods, but also richer data about any given period.

The interview covers topics like why the interest in Hadoop, Hadoop adoption in the enterprise world and outside, limitations of relational database. It is a must read—if only they would have added some newlines here and there.

Original title and link: Doug Cutting About Hadoop, Its Adoption and Future, and Its Relationship With Relational Databases (NoSQL database©myNoSQL)

via: http://www.computerworld.com/s/article/9222758/The_Grill_Doug_Cutting


Looking for a Map Reduce Language

Java, Cascading, Pipes - C++, Hive, Pig, Rhipe, Dumbo, Cascalog… which one of these should you use for writing Map Reduce code?

Antonio Piccolboni takes them up for a test:

At the end of this by necessity incomplete and unscientific language and library comparison, there is a winner and there isn’t. There isn’t because language comparison is always multidimensional and subjective but also because the intended applications are very different. On the other hand, looking for a general purpose, moderately elegant, not necessarily most efficient, not necessarily mature language for exploration purposes, Rhipe seems to fit the bill pretty nicely.

Original title and link: Looking for a Map Reduce Language (NoSQL database©myNoSQL)

via: http://blog.piccolboni.info/2011/04/looking-for-map-reduce-language.html


What Is Informatica HParser for Hadoop?

Sifting through the PRish announcements related to Informatica HParser, what I’ve figured out so far is:

  • it is the T in ETL
  • a visual tool for creating parsing definitions for formats like web logs, XML, JSON, FIX, SWIFT, HL7, CDR, WORD, PDF, XLS, etc.
  • transformations can be accessed from Hadoop MapReduce, Hive, or Pig
  • the benefits of using HParser come from being able to share the same parsing definitions/transformations in the context of the Hadoop distributed environment
  • HParser tries to provide an optimal transformation solution when streaming, splitting, and processing large files
  • HParser is available in two licensing formats: community and commercial

Original title and link: What Is Informatica HParser for Hadoop? (NoSQL database©myNoSQL)


Biodiversity Indexing: Offline Processing With Hadoop, Hive, Sqoop, Oozie

The architecture for offline processing biodiversity based on Sqoop, Hadoop, Oozie, and Hive:

Hadoop Sqoop Oozie Hive Biodiversity Indexing

And its future:

Following this processing work, we expect to modify our crawling to harvest directly into HBase. The flexibility HBase offers will allow us to grow incrementally the richness of the terms indexed in the Portal, while integrating nicely into Hadoop based workflows. The addition of coprocessors to HBase is of particular interest to further reduce the latency involved in processing, by eliminating batch processing altogether.

Many companies working with large datasets have to deal with multiple systems and duplicate data between the online services and offline processors. While the infrastructure costs are going down, the costs of complexity are not. The HBase + Hadoop and Cassandra + Brisk combos are starting to address this problem.

Original title and link: Biodiversity Indexing: Offline Processing With Hadoop, Hive, Sqoop, Oozie (NoSQL database©myNoSQL)

via: http://www.cloudera.com/blog/2011/06/biodiversity-indexing-migration-from-mysql-to-hadoop/


Choosing Technologies: The Library of Congress and the Twitter Archive

Remember when everyone was suggesting solutions for Twitter architecture? Now the Library of Congress is trying to figure out what technologies to use to store the Twitter archive:

The project is still very much under construction, and the team is weighing a number of different open source technologies in order to build out the storage, management and querying of the Twitter archive. While the decision hasn’t been made yet on which tools to use, the library is testing the following in various combinations: Hive, ElasticSearch, Pig, Elephant-bird, HBase, and Hadoop.

Note that in terms of storage only HBase is mentioned—Twitter’s main tweet storage is MySQL though.

Original title and link: Choosing Technologies: The Library of Congress and the Twitter Archive (NoSQL database©myNoSQL)

via: http://blogs.forbes.com/oreillymedia/2011/06/13/the-library-of-congress-twitter-archive-one-year-later/


Experimenting with Hadoop using Cloudera VirtualBox Demo

CDH Mac OS X VirtualBox VM

If you don’t count the download, you’ll get this up and running in 5 minutes tops. At the end you’ll have Hadoop, Sqoop, Pig, Hive, HBase, ZooKeeper, Oozie, Hume, Flume, and Whirr all configured and ready to experiment with.

Making it easy for users to experiment with these tools increases the chances for adoption. Adoption means business.

Original title and link: Experimenting with Hadoop using Cloudera VirtualBox Demo (NoSQL databases © myNoSQL)

via: http://www.cloudera.com/blog/2011/06/cloudera-distribution-including-apache-hadoop-3-demo-vm-installation-on-mac-os-x-using-virtualbox-cdh/