ALL COVERED TOPICS

NoSQLBenchmarksNoSQL use casesNoSQL VideosNoSQL Hybrid SolutionsNoSQL PresentationsBig DataHadoopMapReducePigHiveFlume OozieSqoopHDFSZooKeeperCascadingCascalog BigTableCassandraHBaseHypertableCouchbaseCouchDBMongoDBOrientDBRavenDBJackrabbitTerrastoreAmazon DynamoDBRedisRiakProject VoldemortTokyo CabinetKyoto CabinetmemcachedAmazon SimpleDBDatomicMemcacheDBM/DBGT.MAmazon DynamoDynomiteMnesiaYahoo! PNUTS/SherpaNeo4jInfoGridSones GraphDBInfiniteGraphAllegroGraphMarkLogicClustrixCouchDB Case StudiesMongoDB Case StudiesNoSQL at AdobeNoSQL at FacebookNoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Facebook: All content tagged as Facebook in NoSQL databases and polyglot persistence

Scaling the Facebook data warehouse to 300 PB

Fascinating read, raising interesting observations on different levels:

  1. At Facebook, data warehouse means Hadoop and Hive.

    Our warehouse stores upwards of 300 PB of Hive data, with an incoming daily rate of about 600 TB.

  2. I don’t see how in-memory solutions, like Hana, will see their market expanding.

    In the Enterprise Data Warehouses and the first Hadoop squeeze, Rob Klopp predicted a squeeze of the EDW market under the pressure of in-memory DBMS and Hadoop. I still think that in-memory will become just a custom engine in the Hadoop toolkit and existing EDW products.

    On the always mentioned argument that “not everybody is Facebook”, I think that the part that is hidden under the rug is that today’s size of data is the smallest you’ll ever have.

    In the last year, the warehouse has seen a 3x growth in the amount of data stored. Given this growth trajectory, storage efficiency is and will continue to be a focus for our warehouse infrastructure.

  3. At Facebook’s scale, balancing availability and costs is again a challenge. But there’s no mention of network attached storage.

    There are many areas we are innovating in to improve storage efficiency for the warehouse – building cold storage data centers, adopting techniques like RAID in HDFS to reduce replication ratios (while maintaining high availability), and using compression for data reduction before it’s written to HDFS.

  4. For the nits and bolts of effectively optimizing compression, read the rest of the post which covers the optimization Facebook brought to the ORCFile format.

    There seem to be two competing formats at play: ORCFile (with support from Hortonworks and Facebook) and Parquet (with support from Twitter and Cloudera). Unfortunately I don’t have any good comparison of the two. And I couldn’t find one (why?).

Original title and link: Scaling the Facebook data warehouse to 300 PB (NoSQL database©myNoSQL)

via: https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/