PIG: All content tagged as PIG in NoSQL databases and polyglot persistence
Friday, 15 June 2012
Hortonworks Data Platform 1.0
Hortonworks has announced the 1.0 release of the Hortonworks Data Platform prior to the Hadoop Summit 2012 together with a lot of supporting quotes from companies like Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata, and VMware.
Some info points:
-
Hortonworks Data Platform is a platform meant to simplify the installation, integration, management, and use of Apache Hadoop
- HDP 1.0 is based on Apache Hadoop 1.0
- Apache Ambari is used for installation and provisioning
- The same Apache Amabari is behind the Hortonworks Management Console
- For Data integration, HDP offers WebHDFS, HCatalog APIs, and Talend Open Studio
- Apache HCatalog is the solution offering metadata and table management
-
Hortonworks Data Platform is 100% open source—I really appreciate Hortonworks’s dedication to the Apache Hadoop project and open source community
- HDP comes with 3 levels of support subscriptions, pricing starting at $12500/year for a 10 nodes cluster
One of the most interesting aspects of the Hortonworks Data Platform release is that the high-availability (HA) option for HDP is based on using VMWare-powered virtual machines for the NameNode and JobTracker. My first thought about this approach is that it was chosen to strengthen a partnership with VMWare. On the other hand, Hadoop 2.0 contains already a new highly-available version of the NameNode (Cloudera Hadoop Distribution uses this solution) and VMWare has bigger plans for a virtualization-friendly Hadoop environment with project Serengeti.
You can read a lot of posts about this announcement, but you’ll find all the details in Hortonworks’s John Kreisa’s post here and the PR announcement.
Original title and link: Hortonworks Data Platform 1.0 (©myNoSQL)
Wednesday, 4 April 2012
Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop
The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.
Currently packaging:
- Apache Hadoop 1.0.x
- Apache Zookeeper 3.4.3
- Apache HBase 0.92.0
- Apache Hive 0.8.1
- Apache Pig 0.9.2
- Apache Mahout 0.6.1
- Apache Oozie 3.1.3
- Apache Sqoop 1.4.1
- Apache Flume 1.0.0
- Apache Whirr 0.7.0
Apache Bigtop looks like the first step towards the Big Data LAMP-like platform analysts are calling for. Practically though it’s goal is to ensure that all the components of the wide Hadoop ecosystem remain interoperable.
Original title and link: Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop (©myNoSQL)
Monday, 26 March 2012
Impressions About Hive, Pig, Scalding, Scoobi, Scrunch, Spark
Sami Badawi enumerates the issues he encountered while trying all these tools (Pig1, Scalding2, Scoobi3, Hive4, Spark5, Scrunch6, Cascalog7) for a simple experiment with Hadoop:
The task was to read log files join with other data do some statistics on arrays of doubles. Writing Hadoop MapReduce classes in Java is the assembly code of Big Data.
-
Pig : a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. ↩
-
Hive: a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. ↩
-
Spark: open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write ↩
-
Cascalog: a fully-featured Clojure-based data processing and querying library for Hadoop ↩
Original title and link: Impressions About Hive, Pig, Scalding, Scoobi, Scrunch, Spark (©myNoSQL)
via: http://blog.samibadawi.com/2012/03/hive-pig-scalding-scoobi-scrunch-and.html
Thursday, 16 February 2012
Jython UDFs In Pig - The More Powerful The Language, The Shorter The Program
Jython UDFs were added to Pig in version 0.8, and are pretty stable in the current version, 0.9.2. They are highly convenient, and a major timesaver.
The subtitle—”the more powerful the language, the shorter the program”—says it all.
Original title and link: Jython UDFs In Pig - The More Powerful The Language, The Shorter The Program (©myNoSQL)
via: http://datasyndrome.com/post/17584921570/jython-udfs-in-pig
Wednesday, 15 February 2012
Lessons in Data Visualization: How to create a visualization
Pete Warden:
Pick a question. Now that I had a rough idea for what I wanted to visualize, I really needed to focus on what I would be doing. The best way to do that is to chose the exact title you want to give your visualization.
Oftentimes, you might be tempted to start with an answer in the form of a hypothesis or preconception. The results will get might be valid but radically different.
As for the technologies used for data crunching, it’s Pig on Hadoop over a Cassandra cluster:
In my case, we have a Cassandra cluster with information on more than 350 million photos shared on Facebook. I’ve been running Pig analytics jobs regularly to get a view of what we have in there. […] In this case I already had some Pig scripts asking similar questions, so I was able to adapt one of those. The biggest surprise was when I ran into issues with some of the joins. The hard part was running the Hadoop job to gather the raw data from our Cassandra cluster, and that worked. I was able to output smaller files containing the gathered data, and then run a local Pig job to do the joins I needed.
Original title and link: Lessons in Data Visualization: How to create a visualization (©myNoSQL)
via: http://radar.oreilly.com/2012/02/how-to-create-visualization-facebook-vacation.html
Monday, 13 February 2012
The components and their functions in the Hadoop ecosystem
Edd Dumbill enumerates the various components of the Hadoop ecosystem:

My quick reference of the Hadoop ecosystem is including a couple of other tools that are not in this list, with the exception of Ambari and HCatalog which were released later.
Original title and link: The components and their functions in the Hadoop ecosystem (©myNoSQL)
Wednesday, 8 February 2012
PigEditor: Eclipse plugin for Apache Pig
- syntax/errors highlighting
- check alias name existence
- auto complete keywords, UDF names
- outline…

Original title and link: PigEditor: Eclipse plugin for Apache Pig (©myNoSQL)
Sunday, 5 February 2012
Paper: TiMR is a Time-oriented data processing system in MapReduce
From the “Temporal Analytics on Big Data for Web Advertising” paper:
TiMR is a framework that transparently combines a map-reduce (M-R) system with a temporal DSMS1. Users express time-oriented analytics using a temporal (DSMS) query lan- guage such as StreamSQL or LINQ. Streaming queries are declarative and easy to write/debug, real-time-ready, and often several orders of magnitude smaller than equivalent custom code for time-oriented applications. TiMR allows the temporal queries to transparently scale on offline temporal data in a cluster by leveraging existing M-R infrastructure.
Broadly speaking, TiMR’s architecture of compiling higher level queries into M-R stages is similar to that of Pig/SCOPE. However, TiMR specializes in time-oriented queries and data, with several new features such as: (1) the use of an unmodified DSMS as part of compilation, parallelization, and execution; and (2) the exploitation of new temporal parallelization opportunities unique to our setting. In addition, we leverage the temporal algebra underlying the DSMS in order to guarantee repeatability across runs in TiMR within M-R (when handling failures), as well as over live data.
According to the paper, DSMS work well for real-time data, but are not massively scalable. On the other hand, Map-Reduce is extremely scalable, but computation is performed on offline data. TiMR proposes a solution that is getting closer to a real-time map-reduce.
Read or download the paper after the break.
Wednesday, 1 February 2012
Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support
Starting today customers can view graphs of 23 job flow metrics within the EMR Console by selecting the Monitoring tab in the Job Flow Details page. These metrics are pushed CloudWatch every five minutes at no cost to you and include information on:
- Job flow progress including metrics on the number of map and reduce tasks running and remaining in your job flow and the number of bytes read and written to S3 and HDFS.
- Job flow contention including metrics on HDFS utilization, map and reduce slots open, jobs running, and the ratio between map tasks remaining and map slots.
- Job flow health including metrics on whether your job flow is idle, if there are missing data blocks, and if there are any dead nodes.
That’s like free pr0n for operations teams.
On a different note, I’ve noticed that the Hadoop stack (Hadoop, Hive, Pig) on Amazon Elastic MapReduce is based on second to last versions, which says that extensive testing is performed on Amazon side before rolling new versions out:
- Hadoop: 0.20.205 precursor of Hadoop 1.0.0 supports append and security, but doesn’t have RAID, symlinks or MR2
- Hive: 0.7.1 (precursor of latest 0.8.0)
- Pig: 0.9.1 (precursor of latest 0.9.2)
Original title and link: Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support (©myNoSQL)
Thursday, 12 January 2012
DataFu: Open Source Apache Pig UDFs by LinkedIn
Here’s a taste of what you can do with DataFu:
- Run PageRank on a large number of independent graphs.
- Perform set operations such as intersect and union.
- Compute the haversine distance between two points on the globe.
- Create an assertion on input data which will cause the script to fail if the condition is not met.
- Perform various operations on bags such as append a tuple, prepend a tuple, concatenate bags, generate unordered pairs, etc.
I’m starting to notice a pattern here. Twitter is open sourcing pretty much everything they are doing related to data storage. Yahoo (now Hortonworks) and Cloudera are the forces behind the open source Hadoop and the tools to bring data to Hadoop. And LinkedIn is starting to open source the tools they are using on top of Hadoop to analyze big data.
What is interesting about this is that you might not get the most polished tools, but they definitely are battle tested.
Original title and link: DataFu: Open Source Apache Pig UDFs by LinkedIn (©myNoSQL)
Tuesday, 20 December 2011
Doug Cutting About Hadoop, Its Adoption and Future, and Its Relationship With Relational Databases
Jaikumar Vijayan (Computerworld) interviews Doug Cutting:
Q: How would you describe Hadoop to a CIO or a CFO? Why should enterprises care about it?
A: At a really simple level, it lets you affordably save and process vastly more data than you could before. With more data and the ability to process it, companies can see more, they can learn more, they can do more. [With Hadoop] you can start to do all sorts of analyses that just weren’t practical before. You can start to look at patterns over years, over seasons, across demographics. You have enough data to fill in patterns and make predictions and decide, “How should we price things?” and “What should we be selling now?” and “How should we advertise?” It is not only about having data for longer periods, but also richer data about any given period.
The interview covers topics like why the interest in Hadoop, Hadoop adoption in the enterprise world and outside, limitations of relational database. It is a must read—if only they would have added some newlines here and there.
Original title and link: Doug Cutting About Hadoop, Its Adoption and Future, and Its Relationship With Relational Databases (©myNoSQL)
via: http://www.computerworld.com/s/article/9222758/The_Grill_Doug_Cutting
Monday, 12 December 2011
Amazon Elastic MapReduce Upgrades to Hadoop 0.20.205, Pig 0.9.1, AMI Versioning, and Amazon VPC
Starting today you can run your job flows using Hadoop 0.20.205 and Pig 0.9.1. To simplify the upgrade process, we have also introduced the concept of AMI versions. You can now provide a specific AMI version to use at job flow launch or specify that you would like to use our “latest” AMI, ensuring that you are always using our most up-to-date features. The following AMI versions are now available:
- Version 2.0: Hadoop 0.20.205, Hive 0.7.1, Pig 0.9.1, Debian 6.0.2 (Squeeze)
- Version 1.0: Hadoop 0.18.3 and 0.20.2, Hive 0.5 and 0.7.1, Pig 0.3 and 0.6, Debian 5.0 (Lenny)
Amazon Elastic MapReduce is the perfect solution for:
- learning and experimenting with Hadoop
- running huge processing jobs in cases where your company doesn’t already have the necessary resources
Original title and link: Amazon Elastic MapReduce Upgrades to Hadoop 0.20.205, Pig 0.9.1, AMI Versioning, and Amazon VPC (©myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling
