DataFu: Open Source Apache Pig UDFs by LinkedIn
Here’s a taste of what you can do with DataFu:
- Run PageRank on a large number of independent graphs.
- Perform set operations such as intersect and union.
- Compute the haversine distance between two points on the globe.
- Create an assertion on input data which will cause the script to fail if the condition is not met.
- Perform various operations on bags such as append a tuple, prepend a tuple, concatenate bags, generate unordered pairs, etc.
I’m starting to notice a pattern here. Twitter is open sourcing pretty much everything they are doing related to data storage. Yahoo (now Hortonworks) and Cloudera are the forces behind the open source Hadoop and the tools to bring data to Hadoop. And LinkedIn is starting to open source the tools they are using on top of Hadoop to analyze big data.
What is interesting about this is that you might not get the most polished tools, but they definitely are battle tested.
Original title and link: DataFu: Open Source Apache Pig UDFs by LinkedIn (©myNoSQL)