PIG: All content tagged as PIG in NoSQL databases and polyglot persistence
I’m almost always enjoying the lessons learned-style presentations from Twitter’s people. The slides below, by Jimmy Lin and Dmitriy Ryaboy, have been used at HadoopSummit. Besides the technical and practical details, there are two things that I really like:
DJ Patil: “It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data”
and then the reality check:
- Your boss says something vague
- You think very hard on how to move the needle
- Where’s the data?
- What’s in this dataset?
- What’s all the f#$#$ crap in the data?
- Clean the data
- Run some off-the-shelf data mining algorithm
- Productionize, act on the insight
- Rinse, repeat
The list of releases I wanted to post about has been growing fast these last couple of weeks, so instead of waiting leaving it to Here it is (in no particular order1):
- (Jan.2nd) Cassandra 1.2 — announcement on DataStax’s blog. I’m currently learning and working on a post looking at what’s new in Cassandra 1.2.
- (Jan.10th) Apache Pig 0.10.1 — Hortonworks wrote about it
- (Jan.10th) DataStax Community Edition 1.2 and OpsCenter 2.1.3 — DataStax announcement
- (Jan.10th) CouchDB 1.0.4, 1.1.2, and 1.2.1 — releases fixing some security vulnerabilities
(Jan.11th) MongoDB 2.3.2 unstable — announcement. This dev release includes support for full text indexing. For more details you can check:
- MongoDB Full Text Search Explained and MongoDB Text Search Tutorial
- Full text search in MongoDB: details about supported languages and queries
- Indexing a Markdown blog using MongoDB full text indexing
- Short demo of MongoDB text search and hashed shard keys
- (Jan.12th) Apache HBase 0.94.4 — announcement and release notes
- (Jan.14th) Apache Hive 0.10.0: Hortonworks’s post about it
- (Jan.15th) Hortonworks Data Platform 1.2 featuring Apache Amabari — official PR announcement
- (Jan.16th) Redis 2.6.9 — release notes
- (Jan.16th) HyperDex 1.0RC1 — no docs
- (Jan.16th) Klout’s Brickhouse — announcement:
[…] an open source project extending Hadoop and Hive with a collection of useful user-defined-functions. Its aim is to make the Hive Big Data developer more productive, and to enable scalable and robust dataflows.
I’ve tried to order it chronologically, but most probably I’ve failed. ↩
Original title and link: 11 Interesting Releases From the First Weeks of January ( ©myNoSQL)
A three part article from Hortonworks showing how Pig can be used with MongoDB, HBase, and Cassandra:
Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.
- Part 1: Pig, MongoDB and Node.js
- Part 2: Pig, HBase, JRuby and Sinatra
- Part 3: TF-IDF Topics with Cassandra, Python Streaming and Flask
Original title and link: Pig the Big Data Duct Tape: Examples for MongoDB, HBase, and Cassandra ( ©myNoSQL)