ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

DataFu: Open Source Apache Pig UDFs by LinkedIn

Here’s a taste of what you can do with DataFu:

  • Run PageRank on a large number of independent graphs.
  • Perform set operations such as intersect and union.
  • Compute the haversine distance between two points on the globe.
  • Create an assertion on input data which will cause the script to fail if the condition is not met.
  • Perform various operations on bags such as append a tuple, prepend a tuple, concatenate bags, generate unordered pairs, etc.

I’m starting to notice a pattern here. Twitter is open sourcing pretty much everything they are doing related to data storage. Yahoo (now Hortonworks) and Cloudera are the forces behind the open source Hadoop and the tools to bring data to Hadoop. And LinkedIn is starting to open source the tools they are using on top of Hadoop to analyze big data.

What is interesting about this is that you might not get the most polished tools, but they definitely are battle tested.

Original title and link: DataFu: Open Source Apache Pig UDFs by LinkedIn (NoSQL database©myNoSQL)

via: http://engineering.linkedin.com/open-source/introducing-datafu-open-source-collection-useful-apache-pig-udfs