ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

PIG: All content tagged as PIG in NoSQL databases and polyglot persistence

Scaling Big Data Mining Infrastructure at Twitter

I’m almost always enjoying the lessons learned-style presentations from Twitter’s people. The slides below, by Jimmy Lin and Dmitriy Ryaboy, have been used at HadoopSummit. Besides the technical and practical details, there are two things that I really like:

DJ Patil: “It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data”

and then the reality check:

  1. Your boss says something vague
  2. You think very hard on how to move the needle
  3. Where’s the data?
  4. What’s in this dataset?
  5. What’s all the f#$#$ crap in the data?
  6. Clean the data
  7. Run some off-the-shelf data mining algorithm
  8. Productionize, act on the insight
  9. Rinse, repeat

Enjoy!


A Brief Guide to Pig Latin for the SQL Guy

Cat Miller from Mortar Data offers a quick intro to Pig Latin from a SQLish perspective:

Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.

Pig and SQL similarities are in the operations they both support. But the whole model is different. Pig is an imperative data manipulation tool, while SQL is a declarative query language.

Original title and link: A Brief Guide to Pig Latin for the SQL Guy (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/pig-eye-for-the-sql-guy/


Apache Pig Goes 0.11

Almost lost in the tons of Hadoopy releases, I have found the announcement of Apache Pig 0.11, which, as a serious open source project, packages nice new features for a point release:

  1. DateTime data type
  2. RANK, CUBE, ROLLUP operators
  3. Groovy UDFs

Plus tons of improvements.

Original title and link: Apache Pig Goes 0.11 (NoSQL database©myNoSQL)

via: https://blogs.apache.org/pig/entry/apache_pig_it_goes_to


Flatten Entire HBase Column Families With Pig and Python UDFs

Chase Seibert:

Most Pig tutorials you will find assume that you are working with data where you know all the column names ahead of time, and that the column names themselves are just labels, versus being composites of labels and data. For example, when working with HBase, it’s actually not uncommon for both of those assumptions to be false. Being a columnar database, it’s very common to be working to rows that have thousands of columns. Under that circumstance, it’s also common for the column names themselves to encode to dimensions, such as date and counter type.

Original title and link: Flatten Entire HBase Column Families With Pig and Python UDFs (NoSQL database©myNoSQL)

via: http://chase-seibert.github.com/blog/2013/02/10/pig-hbase-flatten-column-family.html


Using Hadoop Pig With MongoDB

In this post, we’ll see how to install MongoDB support for Pig and we’ll illustrate it with an example where we join 2 MongoDB collections with Pig and store the result in a new collection.

Color me very biased this time, but all these (especially the JOIN) can be done directly using RethinkDB.

Original title and link: Using Hadoop Pig With MongoDB (NoSQL database©myNoSQL)

via: http://chimpler.wordpress.com/2013/02/07/using-hadoop-pig-with-mongodb/


Playing With Hadoop Pig

Anything missing from Pig?

[…] the following SQL operations can be translated as follows. We put the order in which the operations have to be run between parenthesis.

  • SELECT id, name: resultData = FOREACH limitData GENERATE id, name
  • FROM Table: data = LOAD ‘person.csv’ USING PigStorage(‘,’) AS (id:int, name:chararray, age:int)
  • WHERE a=1: filteredData = FILTER data BY a=1
  • ORDER BY age DESC: orderedData = ORDER filteredData BY age DESC
  • LIMIT 10: limitData = LIMIT orderedData 10

One can also use left join and join as follows:

  • JOIN: join_data: JOIN data1 BY id1, data2 BY id2
  • LEFT JOIN: left_join_data = JOIN data1 BY id1 LEFT OUTER, data2 BY id2

Original title and link: Playing With Hadoop Pig (NoSQL database©myNoSQL)

via: http://chimpler.wordpress.com/2013/02/04/playing-with-hadoop-pig/


DataFu: A Collection of Pig UDFs for Data Analysis on Hadoop by LinkedIn

Sam Shah in a guest post on Hortonworks blog:

If Pig is the “duct tape for big data”, then DataFu is the WD-40. Or something. […] Over the years, we developed several routines that were used across LinkedIn and were thrown together into an internal package we affectionately called “littlepiggy.”

a penetrating oil and water-displacing spray“? “littlepiggy”? Seriously?

How could one come up with these names for such a useful library of statistical functions, PageRank, set and bag operations?

Original title and link: DataFu: A Collection of Pig UDFs for Data Analysis on Hadoop by LinkedIn (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/datafu/


11 Interesting Releases From the First Weeks of January

The list of releases I wanted to post about has been growing fast these last couple of weeks, so instead of waiting leaving it to Here it is (in no particular order1):

  1. (Jan.2nd) Cassandra 1.2 — announcement on DataStax’s blog. I’m currently learning and working on a post looking at what’s new in Cassandra 1.2.
  2. (Jan.10th) Apache Pig 0.10.1 — Hortonworks wrote about it
  3. (Jan.10th) DataStax Community Edition 1.2 and OpsCenter 2.1.3 — DataStax announcement
  4. (Jan.10th) CouchDB 1.0.4, 1.1.2, and 1.2.1 — releases fixing some security vulnerabilities
  5. (Jan.11th) MongoDB 2.3.2 unstable — announcement. This dev release includes support for full text indexing. For more details you can check:

    […] an open source project extending Hadoop and Hive with a collection of useful user-defined-functions. Its aim is to make the Hive Big Data developer more productive, and to enable scalable and robust dataflows.


  1. I’ve tried to order it chronologically, but most probably I’ve failed. 

Original title and link: 11 Interesting Releases From the First Weeks of January (NoSQL database©myNoSQL)


What Is the Spring Data Project?

Short answer: another sign that the Spring framework wants to do everything everywhere. A mammoth1.

Version 1.0 was released in 2004 as a lightweight alternative to Enterprise Java Beans (EJB). Since, then Spring has expanded into many other areas of enterprise development, such as enterprise integration (Spring Integration), batch processing (Spring Batch), web development (Spring MVC, Spring Webflow), security (Spring Security). Spring continues to push the envelope for mobile applications (Spring Mobile), social media (Spring Social), rich web applications (Spring MVC, s2js Javascript libraries), and NoSQL data access(Spring Data).

[…]

The complete pipeline can be implemented using Spring for Apache Hadoop along with Spring Integration and Spring Batch. However, Hadoop has its own set of challenges which the Spring for Apache Hadoop project is designed to address. Like all Spring projects, it leverages the Spring Framework to provide a consistent structure and simplify writing Hadoop applications. For example, Hadoop applications rely heavily on command shell tools. So applications end up being a hodge-podge of Perl, Python, Ruby, and bash scripts. Spring for Apache Hadoop, provides a dedicated XML namespace for configuring Hadoop jobs with embedded scripting features and support for Hive and Pig.


  1. There’s a business reason for doing this though: when you have tons of clients you want to make sure they don’t have a chance to step outside. Is this new year resolution a heresy : I plan to use vastly less Spring this year

Original title and link: What Is the Spring Data Project? (NoSQL database©myNoSQL)

via: http://www.odbms.org/blog/2013/01/the-spring-data-project-interview-with-david-turanski/


Pig the Big Data Duct Tape: Examples for MongoDB, HBase, and Cassandra

A three part article from Hortonworks showing how Pig can be used with MongoDB, HBase, and Cassandra:

Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.

Original title and link: Pig the Big Data Duct Tape: Examples for MongoDB, HBase, and Cassandra (NoSQL database©myNoSQL)


Pig Performance and Optimization Analysis

Although Pig is designed as a data flow language, it supports all the functionalities required by TPC-H; thus it makes sense to use TPC-H to benchmark Pig’s performance. Below is the final result.

tpc-h 100gb

Original title and link: Pig Performance and Optimization Analysis (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/pig-performance-and-optimization-analysis/


Groovy User Defined Functions for Pig

After supporting UDFs in Python and Ruby and also embedding (embedding Pig scripts inside Python programs), now it’s time for Pig to accept Groovy UDFs. Nice.

Original title and link: Groovy User Defined Functions for Pig (NoSQL database©myNoSQL)

via: https://issues.apache.org/jira/browse/PIG-2763