facebook: All content tagged as facebook in NoSQL databases and polyglot persistence
A recent GigaOM article provides some interesting data points about how much data Facebook is handling:
- 2.5 bil. content items shared per day
- 2.7bil. likes per day
- 300mil. uploaded photos
- 500+ terabytes of ingested data per day
The numbers above do not include any details about how many data points Facebook is collecting for analytic purposes. But I don’t think I’d be off by assuming this number should probably be a good multiplier of the above numbers. We’ll go with 10 to keep things simple.
A couple of days ago, James Hamilton posted an analysis of Facebook’s Carbon and Energy Impact:
Using the Facebook PUE number of 1.07, we know they are delivering 54.27MW to the IT load (servers and storage). We don’t know the average server draw at Facebook but they have excellent server designs (see Open Compute Server Design) so they likely average at or below as 300W per server. Since 300W is an estimate, let’s also look at 250W and 400W per server:
- 250W/server: 217,080 servers
- 300W/server: 180,900 servers
- 350W/server: 155,057 servers
It’s difficult to determine how many of the 180k servers are databases, but if considering a 1:10 ratio for databases to front end + cache servers, that would give us an approximate number of 18k database servers ingesting 500+ terabytes of data through a guestimated 50+ billion calls.
There’s also something that confuses me about these numbers. If Facebook is getting 300mil. photo uploads per day and ingests 500+ terabytes that could mean that either 1) the average photo size is very low; or 2) Facebook doesn’t count photos when mentioning the ingested data size.
Original title and link: Fun With Numbers: How Much Data Is Facebook Ingesting ( ©myNoSQL)
The mid-part of this Wired article talks a bit about the way Facebook is storing its Open Graph data:
We have an object store, which stores things like users and events and groups and photos, and then we have an edge store that stores the relationship between objects. With Open Graph, we built a layer on top of those systems that allowed developers to define what their objects look like and what their edges look like and then publish those third party objects and edges into the same infrastructure that we used to store all of the first party objects and edges.
Couple of thoughts:
- this data is a good example of a multigraph
- I don’t think Facebook is actually using a graph database for storing the data. Considering the size of the data Facebook is handling, this could be understandable
- There’s no mention of how the metadata, the description of the objects and edges, is stored. I assume this should somehow be connected to historical data to allow the evolution of the data while maintaining its original meaning over time.
- The processing happening on this multigraph data sounds like cluster analysis
Original title and link: Inside Facebook’s Open Graph ( ©myNoSQL)
Shawn Rogers has a short but compelling list of Big Data deployments in his article Big Data is Scaling BI and Analytics. This list also shows that even if there are some common components like Hadoop, there are no blueprints yet for dealing with Big Data.
Facebook: Hadoop analytic data warehouse, using HDFS to store more than 30 petabytes of data. Their Big Data stack is based only on open source solutions.
Quantcast: 3,000 core, 3,500 terabyte Hadoop deployment that processes more than a petabyte of raw data each day
University of Nebraska-Lincoln: 1.6 petabytes of physics data Hadoop cluster
Yahoo!: 100,000 CPUs in 40,000 computers, all running Hadoop. Also running a 12 terabyte MOLAP cube based on Tableau Software
eBay: has 3 separate analytics environments:
- 6PB data warehouse for structured data and SQL access
- 40PB deep analytics (Teradata)
- 20PB Hadoop system to support advanced analytic workload on unstructured data
Original title and link: Big Data Is Going Mainstream: Facebook, Yahoo!, eBay, Quantcast, and Many Others ( ©myNoSQL)
Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. Apache HBase is a database-like layer built on Hadoop designed to support billions of messages per day. This paper describes the reasons why Facebook chose Hadoop and HBase over other systems such as Apache Cassandra and Voldemort and discusses the application’s requirements for consistency, availability, partition tolerance, data model and scalability. We explore the enhancements made to Hadoop to make it a more effective realtime system, the tradeoffs we made while configuring the system, and how this solution has significant advantages over the sharded MySQL database scheme used in other applications at Facebook and many other web-scale companies. We discuss the motivations behind our design choices, the challenges that we face in day-to-day operations, and future capabilities and improvements still under development. We offer these observations on the deployment as a model for other companies who are contemplating a Hadoop-based solution over traditional sharded RDBMS deployments.
Also embedded below for your quick reference:
Dhruba Borthakur started a series of posts — part 1 and part 2 — describing both the process that lead Facebook to using HBase and Hadoop, but also the projects where these are used and their requirements:
After considerable research and experimentation, we chose Hadoop and HBase as the foundational storage technology for these next generation applications. The decision was based on the state of HBase at the point of evaluation as well as our confidence in addressing the features that were lacking at that point via in- house engineering. HBase already provided a highly consistent, high write-throughput key-value store. The HDFS NameNode stood out as a central point of failure, but we were confident that our HDFS team could build a highly-available NameNode (AvatarNode) in a reasonable time-frame, and this would be useful for our warehouse operations as well. Good disk read-efficiency seemed to be within striking reach (pending adding Bloom filters to HBase’’s version of LSM Trees, making local DataNode reads efficient and caching NameNode metadata). Based on our experience operating the Hive/Hadoop warehouse, we knew HDFS was stellar in tolerating and isolating faults in the disk subsystem. The failure of entire large HBase/HDFS clusters was a scenario that ran against the goal of fault-isolation, but could be considerably mitigated by storing data in smaller HBase clusters. Wide area replication projects, both in-house and within the HBase community, seemed to provide a promising path to achieving disaster recovery.
The second part is describing 3 problems Facebook is solving using HBase and Hadoop and provides further details on the requirements of each of these.
The two posts represent a great resource for understanding not only where HBase and Hadoop can be used, but also on how to formulate the requirements (and non-requirements) for new systems.
A Facebook team will present the paper “Apache Hadoop Goes Realtime at Facebook” at ACM SIGMOD. I’m looking forward for the moment the paper will be available.