ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Pig: All content tagged as Pig in NoSQL databases and polyglot persistence

DataFu: A Collection of Pig UDFs for Data Analysis on Hadoop by LinkedIn

Sam Shah in a guest post on Hortonworks blog:

If Pig is the “duct tape for big data”, then DataFu is the WD-40. Or something. […] Over the years, we developed several routines that were used across LinkedIn and were thrown together into an internal package we affectionately called “littlepiggy.”

a penetrating oil and water-displacing spray“? “littlepiggy”? Seriously?

How could one come up with these names for such a useful library of statistical functions, PageRank, set and bag operations?

Original title and link: DataFu: A Collection of Pig UDFs for Data Analysis on Hadoop by LinkedIn (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/datafu/


11 Interesting Releases From the First Weeks of January

The list of releases I wanted to post about has been growing fast these last couple of weeks, so instead of waiting leaving it to Here it is (in no particular order1):

  1. (Jan.2nd) Cassandra 1.2 — announcement on DataStax’s blog. I’m currently learning and working on a post looking at what’s new in Cassandra 1.2.
  2. (Jan.10th) Apache Pig 0.10.1 — Hortonworks wrote about it
  3. (Jan.10th) DataStax Community Edition 1.2 and OpsCenter 2.1.3 — DataStax announcement
  4. (Jan.10th) CouchDB 1.0.4, 1.1.2, and 1.2.1 — releases fixing some security vulnerabilities
  5. (Jan.11th) MongoDB 2.3.2 unstable — announcement. This dev release includes support for full text indexing. For more details you can check:

    […] an open source project extending Hadoop and Hive with a collection of useful user-defined-functions. Its aim is to make the Hive Big Data developer more productive, and to enable scalable and robust dataflows.


  1. I’ve tried to order it chronologically, but most probably I’ve failed. 

Original title and link: 11 Interesting Releases From the First Weeks of January (NoSQL database©myNoSQL)


What Is the Spring Data Project?

Short answer: another sign that the Spring framework wants to do everything everywhere. A mammoth1.

Version 1.0 was released in 2004 as a lightweight alternative to Enterprise Java Beans (EJB). Since, then Spring has expanded into many other areas of enterprise development, such as enterprise integration (Spring Integration), batch processing (Spring Batch), web development (Spring MVC, Spring Webflow), security (Spring Security). Spring continues to push the envelope for mobile applications (Spring Mobile), social media (Spring Social), rich web applications (Spring MVC, s2js Javascript libraries), and NoSQL data access(Spring Data).

[…]

The complete pipeline can be implemented using Spring for Apache Hadoop along with Spring Integration and Spring Batch. However, Hadoop has its own set of challenges which the Spring for Apache Hadoop project is designed to address. Like all Spring projects, it leverages the Spring Framework to provide a consistent structure and simplify writing Hadoop applications. For example, Hadoop applications rely heavily on command shell tools. So applications end up being a hodge-podge of Perl, Python, Ruby, and bash scripts. Spring for Apache Hadoop, provides a dedicated XML namespace for configuring Hadoop jobs with embedded scripting features and support for Hive and Pig.


  1. There’s a business reason for doing this though: when you have tons of clients you want to make sure they don’t have a chance to step outside. Is this new year resolution a heresy : I plan to use vastly less Spring this year

Original title and link: What Is the Spring Data Project? (NoSQL database©myNoSQL)

via: http://www.odbms.org/blog/2013/01/the-spring-data-project-interview-with-david-turanski/


Pig the Big Data Duct Tape: Examples for MongoDB, HBase, and Cassandra

A three part article from Hortonworks showing how Pig can be used with MongoDB, HBase, and Cassandra:

Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.

Original title and link: Pig the Big Data Duct Tape: Examples for MongoDB, HBase, and Cassandra (NoSQL database©myNoSQL)


Pig Performance and Optimization Analysis

Although Pig is designed as a data flow language, it supports all the functionalities required by TPC-H; thus it makes sense to use TPC-H to benchmark Pig’s performance. Below is the final result.

tpc-h 100gb

Original title and link: Pig Performance and Optimization Analysis (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/pig-performance-and-optimization-analysis/


Groovy User Defined Functions for Pig

After supporting UDFs in Python and Ruby and also embedding (embedding Pig scripts inside Python programs), now it’s time for Pig to accept Groovy UDFs. Nice.

Original title and link: Groovy User Defined Functions for Pig (NoSQL database©myNoSQL)

via: https://issues.apache.org/jira/browse/PIG-2763


Hortonworks Data Platform 1.0

Hortonworks has announced the 1.0 release of the Hortonworks Data Platform prior to the Hadoop Summit 2012 together with a lot of supporting quotes from companies like Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata, and VMware.

Some info points:

  1. Hortonworks Data Platform is a platform meant to simplify the installation, integration, management, and use of Apache Hadoop

    hdp-diagram

    1. HDP 1.0 is based on Apache Hadoop 1.0
    2. Apache Ambari is used for installation and provisioning
    3. The same Apache Amabari is behind the Hortonworks Management Console
    4. For Data integration, HDP offers WebHDFS, HCatalog APIs, and Talend Open Studio
    5. Apache HCatalog is the solution offering metadata and table management
  2. Hortonworks Data Platform is 100% open source—I really appreciate Hortonworks’s dedication to the Apache Hadoop project and open source community

  3. HDP comes with 3 levels of support subscriptions, pricing starting at $12500/year for a 10 nodes cluster

One of the most interesting aspects of the Hortonworks Data Platform release is that the high-availability (HA) option for HDP is based on using VMWare-powered virtual machines for the NameNode and JobTracker. My first thought about this approach is that it was chosen to strengthen a partnership with VMWare. On the other hand, Hadoop 2.0 contains already a new highly-available version of the NameNode (Cloudera Hadoop Distribution uses this solution) and VMWare has bigger plans for a virtualization-friendly Hadoop environment with project Serengeti.

You can read a lot of posts about this announcement, but you’ll find all the details in Hortonworks’s John Kreisa’s post here and the PR announcement.

Original title and link: Hortonworks Data Platform 1.0 (NoSQL database©myNoSQL)


Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop

Apache Bigtop:

The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.

Currently packaging:

  • Apache Hadoop 1.0.x
  • Apache Zookeeper 3.4.3
  • Apache HBase 0.92.0
  • Apache Hive 0.8.1
  • Apache Pig 0.9.2
  • Apache Mahout 0.6.1
  • Apache Oozie 3.1.3
  • Apache Sqoop 1.4.1
  • Apache Flume 1.0.0
  • Apache Whirr 0.7.0

Apache Bigtop looks like the first step towards the Big Data LAMP-like platform analysts are calling for. Practically though it’s goal is to ensure that all the components of the wide Hadoop ecosystem remain interoperable.

Original title and link: Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop (NoSQL database©myNoSQL)


Impressions About Hive, Pig, Scalding, Scoobi, Scrunch, Spark

Sami Badawi enumerates the issues he encountered while trying all these tools (Pig1, Scalding2, Scoobi3, Hive4, Spark5, Scrunch6, Cascalog7) for a simple experiment with Hadoop:

The task was to read log files join with other data do some statistics on arrays of doubles. Writing Hadoop MapReduce classes in Java is the assembly code of Big Data.


  1. Pig : a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. 

  2. Scalding: A Scala API for Cascading 

  3. Scoobi: a Scala productivity framework for Hadoop 

  4. Hive: a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. 

  5. Spark: open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write 

  6. Scrunch: a Scala wrapper for Crunch 

  7. Cascalog: a fully-featured Clojure-based data processing and querying library for Hadoop  

Original title and link: Impressions About Hive, Pig, Scalding, Scoobi, Scrunch, Spark (NoSQL database©myNoSQL)

via: http://blog.samibadawi.com/2012/03/hive-pig-scalding-scoobi-scrunch-and.html


Jython UDFs In Pig - The More Powerful The Language, The Shorter The Program

Jython UDFs were added to Pig in version 0.8, and are pretty stable in the current version, 0.9.2. They are highly convenient, and a major timesaver.

The subtitle—”the more powerful the language, the shorter the program”—says it all.

Original title and link: Jython UDFs In Pig - The More Powerful The Language, The Shorter The Program (NoSQL database©myNoSQL)

via: http://datasyndrome.com/post/17584921570/jython-udfs-in-pig


Lessons in Data Visualization: How to create a visualization

Pete Warden:

Pick a question. Now that I had a rough idea for what I wanted to visualize, I really needed to focus on what I would be doing. The best way to do that is to chose the exact title you want to give your visualization.

Oftentimes, you might be tempted to start with an answer in the form of a hypothesis or preconception. The results will get might be valid but radically different.

As for the technologies used for data crunching, it’s Pig on Hadoop over a Cassandra cluster:

In my case, we have a Cassandra cluster with information on more than 350 million photos shared on Facebook. I’ve been running Pig analytics jobs regularly to get a view of what we have in there. […] In this case I already had some Pig scripts asking similar questions, so I was able to adapt one of those. The biggest surprise was when I ran into issues with some of the joins. The hard part was running the Hadoop job to gather the raw data from our Cassandra cluster, and that worked. I was able to output smaller files containing the gathered data, and then run a local Pig job to do the joins I needed.

Original title and link: Lessons in Data Visualization: How to create a visualization (NoSQL database©myNoSQL)

via: http://radar.oreilly.com/2012/02/how-to-create-visualization-facebook-vacation.html


The components and their functions in the Hadoop ecosystem

Edd Dumbill enumerates the various components of the Hadoop ecosystem:

Hadoop ecosystem

My quick reference of the Hadoop ecosystem is including a couple of other tools that are not in this list, with the exception of Ambari and HCatalog which were released later.

Original title and link: The components and their functions in the Hadoop ecosystem (NoSQL database©myNoSQL)