ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

PIG: All content tagged as PIG in NoSQL databases and polyglot persistence

Hortonworks Data Platform 1.0

Hortonworks has announced the 1.0 release of the Hortonworks Data Platform prior to the Hadoop Summit 2012 together with a lot of supporting quotes from companies like Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata, and VMware.

Some info points:

  1. Hortonworks Data Platform is a platform meant to simplify the installation, integration, management, and use of Apache Hadoop

    hdp-diagram

    1. HDP 1.0 is based on Apache Hadoop 1.0
    2. Apache Ambari is used for installation and provisioning
    3. The same Apache Amabari is behind the Hortonworks Management Console
    4. For Data integration, HDP offers WebHDFS, HCatalog APIs, and Talend Open Studio
    5. Apache HCatalog is the solution offering metadata and table management
  2. Hortonworks Data Platform is 100% open source—I really appreciate Hortonworks’s dedication to the Apache Hadoop project and open source community

  3. HDP comes with 3 levels of support subscriptions, pricing starting at $12500/year for a 10 nodes cluster

One of the most interesting aspects of the Hortonworks Data Platform release is that the high-availability (HA) option for HDP is based on using VMWare-powered virtual machines for the NameNode and JobTracker. My first thought about this approach is that it was chosen to strengthen a partnership with VMWare. On the other hand, Hadoop 2.0 contains already a new highly-available version of the NameNode (Cloudera Hadoop Distribution uses this solution) and VMWare has bigger plans for a virtualization-friendly Hadoop environment with project Serengeti.

You can read a lot of posts about this announcement, but you’ll find all the details in Hortonworks’s John Kreisa’s post here and the PR announcement.

Original title and link: Hortonworks Data Platform 1.0 (NoSQL database©myNoSQL)


Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop

Apache Bigtop:

The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.

Currently packaging:

  • Apache Hadoop 1.0.x
  • Apache Zookeeper 3.4.3
  • Apache HBase 0.92.0
  • Apache Hive 0.8.1
  • Apache Pig 0.9.2
  • Apache Mahout 0.6.1
  • Apache Oozie 3.1.3
  • Apache Sqoop 1.4.1
  • Apache Flume 1.0.0
  • Apache Whirr 0.7.0

Apache Bigtop looks like the first step towards the Big Data LAMP-like platform analysts are calling for. Practically though it’s goal is to ensure that all the components of the wide Hadoop ecosystem remain interoperable.

Original title and link: Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop (NoSQL database©myNoSQL)


Impressions About Hive, Pig, Scalding, Scoobi, Scrunch, Spark

Sami Badawi enumerates the issues he encountered while trying all these tools (Pig1, Scalding2, Scoobi3, Hive4, Spark5, Scrunch6, Cascalog7) for a simple experiment with Hadoop:

The task was to read log files join with other data do some statistics on arrays of doubles. Writing Hadoop MapReduce classes in Java is the assembly code of Big Data.


  1. Pig : a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. 

  2. Scalding: A Scala API for Cascading 

  3. Scoobi: a Scala productivity framework for Hadoop 

  4. Hive: a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. 

  5. Spark: open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write 

  6. Scrunch: a Scala wrapper for Crunch 

  7. Cascalog: a fully-featured Clojure-based data processing and querying library for Hadoop  

Original title and link: Impressions About Hive, Pig, Scalding, Scoobi, Scrunch, Spark (NoSQL database©myNoSQL)

via: http://blog.samibadawi.com/2012/03/hive-pig-scalding-scoobi-scrunch-and.html


Jython UDFs In Pig - The More Powerful The Language, The Shorter The Program

Jython UDFs were added to Pig in version 0.8, and are pretty stable in the current version, 0.9.2. They are highly convenient, and a major timesaver.

The subtitle—”the more powerful the language, the shorter the program”—says it all.

Original title and link: Jython UDFs In Pig - The More Powerful The Language, The Shorter The Program (NoSQL database©myNoSQL)

via: http://datasyndrome.com/post/17584921570/jython-udfs-in-pig


Lessons in Data Visualization: How to create a visualization

Pete Warden:

Pick a question. Now that I had a rough idea for what I wanted to visualize, I really needed to focus on what I would be doing. The best way to do that is to chose the exact title you want to give your visualization.

Oftentimes, you might be tempted to start with an answer in the form of a hypothesis or preconception. The results will get might be valid but radically different.

As for the technologies used for data crunching, it’s Pig on Hadoop over a Cassandra cluster:

In my case, we have a Cassandra cluster with information on more than 350 million photos shared on Facebook. I’ve been running Pig analytics jobs regularly to get a view of what we have in there. […] In this case I already had some Pig scripts asking similar questions, so I was able to adapt one of those. The biggest surprise was when I ran into issues with some of the joins. The hard part was running the Hadoop job to gather the raw data from our Cassandra cluster, and that worked. I was able to output smaller files containing the gathered data, and then run a local Pig job to do the joins I needed.

Original title and link: Lessons in Data Visualization: How to create a visualization (NoSQL database©myNoSQL)

via: http://radar.oreilly.com/2012/02/how-to-create-visualization-facebook-vacation.html


The components and their functions in the Hadoop ecosystem

Edd Dumbill enumerates the various components of the Hadoop ecosystem:

Hadoop ecosystem

My quick reference of the Hadoop ecosystem is including a couple of other tools that are not in this list, with the exception of Ambari and HCatalog which were released later.

Original title and link: The components and their functions in the Hadoop ecosystem (NoSQL database©myNoSQL)


PigEditor: Eclipse plugin for Apache Pig

PigEditor:

  • syntax/errors highlighting
  • check alias name existence
  • auto complete keywords, UDF names
  • outline…

PigEditor: Eclipse plugin for Apache Pig

Original title and link: PigEditor: Eclipse plugin for Apache Pig (NoSQL database©myNoSQL)


Paper: TiMR is a Time-oriented data processing system in MapReduce

From the “Temporal Analytics on Big Data for Web Advertising” paper:

TiMR is a framework that transparently combines a map-reduce (M-R) system with a temporal DSMS1. Users express time-oriented analytics using a temporal (DSMS) query lan- guage such as StreamSQL or LINQ. Streaming queries are declarative and easy to write/debug, real-time-ready, and often several orders of magnitude smaller than equivalent custom code for time-oriented applications. TiMR allows the temporal queries to transparently scale on offline temporal data in a cluster by leveraging existing M-R infrastructure.

Broadly speaking, TiMR’s architecture of compiling higher level queries into M-R stages is similar to that of Pig/SCOPE. However, TiMR specializes in time-oriented queries and data, with several new features such as: (1) the use of an unmodified DSMS as part of compilation, parallelization, and execution; and (2) the exploitation of new temporal parallelization opportunities unique to our setting. In addition, we leverage the temporal algebra underlying the DSMS in order to guarantee repeatability across runs in TiMR within M-R (when handling failures), as well as over live data.

According to the paper, DSMS work well for real-time data, but are not massively scalable. On the other hand, Map-Reduce is extremely scalable, but computation is performed on offline data. TiMR proposes a solution that is getting closer to a real-time map-reduce.

Read or download the paper after the break.


Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support

Starting today customers can view graphs of 23 job flow metrics within the EMR Console by selecting the Monitoring tab in the Job Flow Details page. These metrics are pushed CloudWatch every five minutes at no cost to you and include information on:

  • Job flow progress including metrics on the number of map and reduce tasks running and remaining in your job flow and the number of bytes read and written to S3 and HDFS.
  • Job flow contention including metrics on HDFS utilization, map and reduce slots open, jobs running, and the ratio between map tasks remaining and map slots.
  • Job flow health including metrics on whether your job flow is idle, if there are missing data blocks, and if there are any dead nodes.

That’s like free pr0n for operations teams.

On a different note, I’ve noticed that the Hadoop stack (Hadoop, Hive, Pig) on Amazon Elastic MapReduce is based on second to last versions, which says that extensive testing is performed on Amazon side before rolling new versions out:

Original title and link: Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support (NoSQL database©myNoSQL)

via: http://aws.typepad.com/aws/2012/01/new-elastic-mapreduce-features-metrics-updates-vpc-and-cluster-compute-support-guest-post.html


DataFu: Open Source Apache Pig UDFs by LinkedIn

Here’s a taste of what you can do with DataFu:

  • Run PageRank on a large number of independent graphs.
  • Perform set operations such as intersect and union.
  • Compute the haversine distance between two points on the globe.
  • Create an assertion on input data which will cause the script to fail if the condition is not met.
  • Perform various operations on bags such as append a tuple, prepend a tuple, concatenate bags, generate unordered pairs, etc.

I’m starting to notice a pattern here. Twitter is open sourcing pretty much everything they are doing related to data storage. Yahoo (now Hortonworks) and Cloudera are the forces behind the open source Hadoop and the tools to bring data to Hadoop. And LinkedIn is starting to open source the tools they are using on top of Hadoop to analyze big data.

What is interesting about this is that you might not get the most polished tools, but they definitely are battle tested.

Original title and link: DataFu: Open Source Apache Pig UDFs by LinkedIn (NoSQL database©myNoSQL)

via: http://engineering.linkedin.com/open-source/introducing-datafu-open-source-collection-useful-apache-pig-udfs


Doug Cutting About Hadoop, Its Adoption and Future, and Its Relationship With Relational Databases

Jaikumar Vijayan (Computerworld) interviews Doug Cutting:

Q: How would you describe Hadoop to a CIO or a CFO? Why should enterprises care about it?

A: At a really simple level, it lets you affordably save and process vastly more data than you could before. With more data and the ability to process it, companies can see more, they can learn more, they can do more. [With Hadoop] you can start to do all sorts of analyses that just weren’t practical before. You can start to look at patterns over years, over seasons, across demographics. You have enough data to fill in patterns and make predictions and decide, “How should we price things?” and “What should we be selling now?” and “How should we advertise?” It is not only about having data for longer periods, but also richer data about any given period.

The interview covers topics like why the interest in Hadoop, Hadoop adoption in the enterprise world and outside, limitations of relational database. It is a must read—if only they would have added some newlines here and there.

Original title and link: Doug Cutting About Hadoop, Its Adoption and Future, and Its Relationship With Relational Databases (NoSQL database©myNoSQL)

via: http://www.computerworld.com/s/article/9222758/The_Grill_Doug_Cutting


Amazon Elastic MapReduce Upgrades to Hadoop 0.20.205, Pig 0.9.1, AMI Versioning, and Amazon VPC

Starting today you can run your job flows using Hadoop 0.20.205 and Pig 0.9.1. To simplify the upgrade process, we have also introduced the concept of AMI versions. You can now provide a specific AMI version to use at job flow launch or specify that you would like to use our “latest” AMI, ensuring that you are always using our most up-to-date features. The following AMI versions are now available:

  • Version 2.0: Hadoop 0.20.205, Hive 0.7.1, Pig 0.9.1, Debian 6.0.2 (Squeeze)
  • Version 1.0: Hadoop 0.18.3 and 0.20.2, Hive 0.5 and 0.7.1, Pig 0.3 and 0.6, Debian 5.0 (Lenny)

Amazon Elastic MapReduce is the perfect solution for:

  1. learning and experimenting with Hadoop
  2. running huge processing jobs in cases where your company doesn’t already have the necessary resources

Original title and link: Amazon Elastic MapReduce Upgrades to Hadoop 0.20.205, Pig 0.9.1, AMI Versioning, and Amazon VPC (NoSQL database©myNoSQL)

via: https://forums.aws.amazon.com/ann.jspa?annID=1275