pig: All content tagged as pig in NoSQL databases and polyglot persistence
Wednesday, 3 April 2013
Scaling Big Data Mining Infrastructure at Twitter
I’m almost always enjoying the lessons learned-style presentations from Twitter’s people. The slides below, by Jimmy Lin and Dmitriy Ryaboy, have been used at HadoopSummit. Besides the technical and practical details, there are two things that I really like:
DJ Patil: “It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data”
and then the reality check:
- Your boss says something vague
- You think very hard on how to move the needle
- Where’s the data?
- What’s in this dataset?
- What’s all the f#$#$ crap in the data?
- Clean the data
- Run some off-the-shelf data mining algorithm
- …
- Productionize, act on the insight
- Rinse, repeat
Enjoy!
Monday, 4 March 2013
A Brief Guide to Pig Latin for the SQL Guy
Cat Miller from Mortar Data offers a quick intro to Pig Latin from a SQLish perspective:
Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.
Pig and SQL similarities are in the operations they both support. But the whole model is different. Pig is an imperative data manipulation tool, while SQL is a declarative query language.
Original title and link: A Brief Guide to Pig Latin for the SQL Guy (©myNoSQL)
Tuesday, 26 February 2013
Apache Pig Goes 0.11
Almost lost in the tons of Hadoopy releases, I have found the announcement of Apache Pig 0.11, which, as a serious open source project, packages nice new features for a point release:
- DateTime data type
RANK,CUBE,ROLLUPoperators- Groovy UDFs
Plus tons of improvements.
Original title and link: Apache Pig Goes 0.11 (©myNoSQL)
via: https://blogs.apache.org/pig/entry/apache_pig_it_goes_to
Monday, 11 February 2013
Flatten Entire HBase Column Families With Pig and Python UDFs
Chase Seibert:
Most Pig tutorials you will find assume that you are working with data where you know all the column names ahead of time, and that the column names themselves are just labels, versus being composites of labels and data. For example, when working with HBase, it’s actually not uncommon for both of those assumptions to be false. Being a columnar database, it’s very common to be working to rows that have thousands of columns. Under that circumstance, it’s also common for the column names themselves to encode to dimensions, such as date and counter type.
Original title and link: Flatten Entire HBase Column Families With Pig and Python UDFs (©myNoSQL)
via: http://chase-seibert.github.com/blog/2013/02/10/pig-hbase-flatten-column-family.html
Using Hadoop Pig With MongoDB
In this post, we’ll see how to install MongoDB support for Pig and we’ll illustrate it with an example where we join 2 MongoDB collections with Pig and store the result in a new collection.
Color me very biased this time, but all these (especially the JOIN) can be done directly using RethinkDB.
Original title and link: Using Hadoop Pig With MongoDB (©myNoSQL)
via: http://chimpler.wordpress.com/2013/02/07/using-hadoop-pig-with-mongodb/
Playing With Hadoop Pig
Anything missing from Pig?
[…] the following SQL operations can be translated as follows. We put the order in which the operations have to be run between parenthesis.
SELECT id, name:resultData = FOREACH limitData GENERATE id, nameFROM Table:data = LOAD ‘person.csv’ USING PigStorage(‘,’) AS (id:int, name:chararray, age:int)WHERE a=1:filteredData = FILTER data BY a=1ORDER BY age DESC:orderedData = ORDER filteredData BY age DESCLIMIT 10:limitData = LIMIT orderedData 10One can also use left join and join as follows:
- JOIN: join_data:
JOIN data1 BY id1, data2 BY id2- LEFT JOIN:
left_join_data = JOIN data1 BY id1 LEFT OUTER, data2 BY id2
Original title and link: Playing With Hadoop Pig (©myNoSQL)
via: http://chimpler.wordpress.com/2013/02/04/playing-with-hadoop-pig/
Wednesday, 30 January 2013
DataFu: A Collection of Pig UDFs for Data Analysis on Hadoop by LinkedIn
Sam Shah in a guest post on Hortonworks blog:
If Pig is the “duct tape for big data”, then DataFu is the WD-40. Or something. […] Over the years, we developed several routines that were used across LinkedIn and were thrown together into an internal package we affectionately called “littlepiggy.”
“a penetrating oil and water-displacing spray“? “littlepiggy”? Seriously?
How could one come up with these names for such a useful library of statistical functions, PageRank, set and bag operations?
Original title and link: DataFu: A Collection of Pig UDFs for Data Analysis on Hadoop by LinkedIn (©myNoSQL)
Monday, 21 January 2013
11 Interesting Releases From the First Weeks of January
The list of releases I wanted to post about has been growing fast these last couple of weeks, so instead of waiting leaving it to Here it is (in no particular order1):
- (Jan.2nd) Cassandra 1.2 — announcement on DataStax’s blog. I’m currently learning and working on a post looking at what’s new in Cassandra 1.2.
- (Jan.10th) Apache Pig 0.10.1 — Hortonworks wrote about it
- (Jan.10th) DataStax Community Edition 1.2 and OpsCenter 2.1.3 — DataStax announcement
- (Jan.10th) CouchDB 1.0.4, 1.1.2, and 1.2.1 — releases fixing some security vulnerabilities
-
(Jan.11th) MongoDB 2.3.2 unstable — announcement. This dev release includes support for full text indexing. For more details you can check:
- MongoDB Full Text Search Explained and MongoDB Text Search Tutorial
- Full text search in MongoDB: details about supported languages and queries
- Indexing a Markdown blog using MongoDB full text indexing
- Short demo of MongoDB text search and hashed shard keys
- (Jan.12th) Apache HBase 0.94.4 — announcement and release notes
- (Jan.14th) Apache Hive 0.10.0: Hortonworks’s post about it
- (Jan.15th) Hortonworks Data Platform 1.2 featuring Apache Amabari — official PR announcement
- (Jan.16th) Redis 2.6.9 — release notes
- (Jan.16th) HyperDex 1.0RC1 — no docs
- (Jan.16th) Klout’s Brickhouse — announcement:
[…] an open source project extending Hadoop and Hive with a collection of useful user-defined-functions. Its aim is to make the Hive Big Data developer more productive, and to enable scalable and robust dataflows.
-
I’ve tried to order it chronologically, but most probably I’ve failed. ↩
Original title and link: 11 Interesting Releases From the First Weeks of January (©myNoSQL)
Thursday, 3 January 2013
What Is the Spring Data Project?
Short answer: another sign that the Spring framework wants to do everything everywhere. A mammoth1.
Version 1.0 was released in 2004 as a lightweight alternative to Enterprise Java Beans (EJB). Since, then Spring has expanded into many other areas of enterprise development, such as enterprise integration (Spring Integration), batch processing (Spring Batch), web development (Spring MVC, Spring Webflow), security (Spring Security). Spring continues to push the envelope for mobile applications (Spring Mobile), social media (Spring Social), rich web applications (Spring MVC, s2js Javascript libraries), and NoSQL data access(Spring Data).
[…]
The complete pipeline can be implemented using Spring for Apache Hadoop along with Spring Integration and Spring Batch. However, Hadoop has its own set of challenges which the Spring for Apache Hadoop project is designed to address. Like all Spring projects, it leverages the Spring Framework to provide a consistent structure and simplify writing Hadoop applications. For example, Hadoop applications rely heavily on command shell tools. So applications end up being a hodge-podge of Perl, Python, Ruby, and bash scripts. Spring for Apache Hadoop, provides a dedicated XML namespace for configuring Hadoop jobs with embedded scripting features and support for Hive and Pig.
-
There’s a business reason for doing this though: when you have tons of clients you want to make sure they don’t have a chance to step outside. Is this new year resolution a heresy : I plan to use vastly less Spring this year? ↩
Original title and link: What Is the Spring Data Project? (©myNoSQL)
via: http://www.odbms.org/blog/2013/01/the-spring-data-project-interview-with-david-turanski/
Tuesday, 2 October 2012
Pig the Big Data Duct Tape: Examples for MongoDB, HBase, and Cassandra
A three part article from Hortonworks showing how Pig can be used with MongoDB, HBase, and Cassandra:
Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.
- Part 1: Pig, MongoDB and Node.js
- Part 2: Pig, HBase, JRuby and Sinatra
- Part 3: TF-IDF Topics with Cassandra, Python Streaming and Flask
Original title and link: Pig the Big Data Duct Tape: Examples for MongoDB, HBase, and Cassandra (©myNoSQL)
Tuesday, 4 September 2012
Pig Performance and Optimization Analysis
Although Pig is designed as a data flow language, it supports all the functionalities required by TPC-H; thus it makes sense to use TPC-H to benchmark Pig’s performance. Below is the final result.
Original title and link: Pig Performance and Optimization Analysis (©myNoSQL)
via: http://hortonworks.com/blog/pig-performance-and-optimization-analysis/
Thursday, 12 July 2012
Groovy User Defined Functions for Pig
After supporting UDFs in Python and Ruby and also embedding (embedding Pig scripts inside Python programs), now it’s time for Pig to accept Groovy UDFs. Nice.
Original title and link: Groovy User Defined Functions for Pig (©myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling
