Fascinating read, raising interesting observations on different levels:
At Facebook, data warehouse means Hadoop and Hive.
Our warehouse stores upwards of 300 PB of Hive data, with an incoming daily
rate of about 600 TB.
I don’t see how in-memory solutions, like Hana, will see their market expanding.
In the Enterprise Data Warehouses and the first Hadoop squeeze, Rob Klopp predicted a squeeze of the EDW market under the pressure of in-memory DBMS and Hadoop. I still think that in-memory will become just a custom engine in the Hadoop toolkit and existing EDW products.
On the always mentioned argument that “not everybody is Facebook”, I think that the part that is hidden under the rug is that today’s size of data is the smallest you’ll ever have.
In the last year, the warehouse has seen a 3x growth
in the amount of data stored. Given this growth trajectory, storage
efficiency is and will continue to be a focus for our warehouse
At Facebook’s scale, balancing availability and costs is again a challenge. But there’s no mention of network attached storage.
There are many areas we are innovating in to improve
storage efficiency for the warehouse – building cold
storage data centers, adopting techniques like RAID in
HDFS to reduce replication ratios (while maintaining high
availability), and using compression for data reduction
before it’s written to HDFS.
For the nits and bolts of effectively optimizing compression, read the rest of the post which covers the optimization Facebook brought to the
There seem to be two competing formats at play: ORCFile (with support from Hortonworks and Facebook) and Parquet (with support from Twitter and Cloudera). Unfortunately I don’t have any good comparison of the two. And I couldn’t find one (why?).
Original title and link: Scaling the Facebook data warehouse to 300 PB