ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

ORCFile: All content tagged as ORCFile in NoSQL databases and polyglot persistence

Scaling the Facebook data warehouse to 300 PB

Fascinating read, raising interesting observations on different levels:

  1. At Facebook, data warehouse means Hadoop and Hive.

    Our warehouse stores upwards of 300 PB of Hive data, with an incoming daily rate of about 600 TB.

  2. I don’t see how in-memory solutions, like Hana, will see their market expanding.

    In the Enterprise Data Warehouses and the first Hadoop squeeze, Rob Klopp predicted a squeeze of the EDW market under the pressure of in-memory DBMS and Hadoop. I still think that in-memory will become just a custom engine in the Hadoop toolkit and existing EDW products.

    On the always mentioned argument that “not everybody is Facebook”, I think that the part that is hidden under the rug is that today’s size of data is the smallest you’ll ever have.

    In the last year, the warehouse has seen a 3x growth in the amount of data stored. Given this growth trajectory, storage efficiency is and will continue to be a focus for our warehouse infrastructure.

  3. At Facebook’s scale, balancing availability and costs is again a challenge. But there’s no mention of network attached storage.

    There are many areas we are innovating in to improve storage efficiency for the warehouse – building cold storage data centers, adopting techniques like RAID in HDFS to reduce replication ratios (while maintaining high availability), and using compression for data reduction before it’s written to HDFS.

  4. For the nits and bolts of effectively optimizing compression, read the rest of the post which covers the optimization Facebook brought to the ORCFile format.

    There seem to be two competing formats at play: ORCFile (with support from Hortonworks and Facebook) and Parquet (with support from Twitter and Cloudera). Unfortunately I don’t have any good comparison of the two. And I couldn’t find one (why?).

Original title and link: Scaling the Facebook data warehouse to 300 PB (NoSQL database©myNoSQL)

via: https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/