pig: All content tagged as pig in NoSQL databases and polyglot persistence
Donald Miner, author of MapReduce Design Patterns and CTO at ClearEdge IT Solutions discusses how he chooses between Pig and MapReduce, considering developer and processing time, maintainability and deployment, and repurposing engineers that are new to Java and Pig.
Video and slides after the break.
In 4 years of writing this blog I haven’t seen such a prolific month:
- Apache Hadoop 2.2.0 (more links here)
- Apache HBase 0.96 (here and here)
- Apache Hive 0.12 (more links here)
- Apache Ambari 1.4.1
- Apache Pig 0.12
- Apache Oozie 4.0.0
- Plus Presto.
Actually I don’t think I’ve ever seen such an ecosystem like the one created around Hadoop.
Original title and link: A prolific season for Hadoop and its ecosystem ( ©myNoSQL)
I’m almost always enjoying the lessons learned-style presentations from Twitter’s people. The slides below, by Jimmy Lin and Dmitriy Ryaboy, have been used at HadoopSummit. Besides the technical and practical details, there are two things that I really like:
DJ Patil: “It’s impossible to overstress this: 80% of the work in any data project is in cleaning the data”
and then the reality check:
- Your boss says something vague
- You think very hard on how to move the needle
- Where’s the data?
- What’s in this dataset?
- What’s all the f#$#$ crap in the data?
- Clean the data
- Run some off-the-shelf data mining algorithm
- Productionize, act on the insight
- Rinse, repeat