ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

DataFu: All content tagged as DataFu in NoSQL databases and polyglot persistence

LinkedIn's Hourglass: Incremental data processing in Hadoop

Matthew Hayes introduces a very interesting new framework from LinkedIn

Hourglass is designed to make computations over sliding windows more efficient. For these types of computations, the input data is partitioned in some way, usually according to time, and the range of input data to process is adjusted as new data arrives. Hourglass works with input data that is partitioned by day, as this is a common scheme for partitioning temporal data.

Hourglass is available on GitHub.

We have found that two types of sliding window computations are extremely common in practice:

  • Fixed-length: the length of the window is set to some constant number of days and the entire window moves forward as new data becomes available. Example: a daily report summarizing the the number of visitors to a site from the past 30 days.
  • Fixed-start: the beginning of the window stays constant, but the end slides forward as new input data becomes available. Example: a daily report summarizing all visitors to a site since the site launched.

Original title and link: LinkedIn’s Hourglass: Incremental data processing in Hadoop (NoSQL database©myNoSQL)

via: http://engineering.linkedin.com/datafu/datafus-hourglass-incremental-data-processing-hadoop


DataFu: A Collection of Pig UDFs for Data Analysis on Hadoop by LinkedIn

Sam Shah in a guest post on Hortonworks blog:

If Pig is the “duct tape for big data”, then DataFu is the WD-40. Or something. […] Over the years, we developed several routines that were used across LinkedIn and were thrown together into an internal package we affectionately called “littlepiggy.”

a penetrating oil and water-displacing spray“? “littlepiggy”? Seriously?

How could one come up with these names for such a useful library of statistical functions, PageRank, set and bag operations?

Original title and link: DataFu: A Collection of Pig UDFs for Data Analysis on Hadoop by LinkedIn (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/datafu/