Facebook: All content tagged as Facebook in NoSQL databases and polyglot persistence
Nice screenshot by TechCrunch people of the slide talking about the data lifecycle at Facebook:
Based on this you’ll now have a better picture of how Facebook data ingestion numbers correlate to their architecture.
Original title and link: Life of Data at Facebook ( ©myNoSQL)
A recent GigaOM article provides some interesting data points about how much data Facebook is handling:
- 2.5 bil. content items shared per day
- 2.7bil. likes per day
- 300mil. uploaded photos
- 500+ terabytes of ingested data per day
The numbers above do not include any details about how many data points Facebook is collecting for analytic purposes. But I don’t think I’d be off by assuming this number should probably be a good multiplier of the above numbers. We’ll go with 10 to keep things simple.
A couple of days ago, James Hamilton posted an analysis of Facebook’s Carbon and Energy Impact:
Using the Facebook PUE number of 1.07, we know they are delivering 54.27MW to the IT load (servers and storage). We don’t know the average server draw at Facebook but they have excellent server designs (see Open Compute Server Design) so they likely average at or below as 300W per server. Since 300W is an estimate, let’s also look at 250W and 400W per server:
- 250W/server: 217,080 servers
- 300W/server: 180,900 servers
- 350W/server: 155,057 servers
It’s difficult to determine how many of the 180k servers are databases, but if considering a 1:10 ratio for databases to front end + cache servers, that would give us an approximate number of 18k database servers ingesting 500+ terabytes of data through a guestimated 50+ billion calls.
There’s also something that confuses me about these numbers. If Facebook is getting 300mil. photo uploads per day and ingests 500+ terabytes that could mean that either 1) the average photo size is very low; or 2) Facebook doesn’t count photos when mentioning the ingested data size.
Original title and link: Fun With Numbers: How Much Data Is Facebook Ingesting ( ©myNoSQL)
The mid-part of this Wired article talks a bit about the way Facebook is storing its Open Graph data:
We have an object store, which stores things like users and events and groups and photos, and then we have an edge store that stores the relationship between objects. With Open Graph, we built a layer on top of those systems that allowed developers to define what their objects look like and what their edges look like and then publish those third party objects and edges into the same infrastructure that we used to store all of the first party objects and edges.
Couple of thoughts:
- this data is a good example of a multigraph
- I don’t think Facebook is actually using a graph database for storing the data. Considering the size of the data Facebook is handling, this could be understandable
- There’s no mention of how the metadata, the description of the objects and edges, is stored. I assume this should somehow be connected to historical data to allow the evolution of the data while maintaining its original meaning over time.
- The processing happening on this multigraph data sounds like cluster analysis
Original title and link: Inside Facebook’s Open Graph ( ©myNoSQL)
Shawn Rogers has a short but compelling list of Big Data deployments in his article Big Data is Scaling BI and Analytics. This list also shows that even if there are some common components like Hadoop, there are no blueprints yet for dealing with Big Data.
Facebook: Hadoop analytic data warehouse, using HDFS to store more than 30 petabytes of data. Their Big Data stack is based only on open source solutions.
Quantcast: 3,000 core, 3,500 terabyte Hadoop deployment that processes more than a petabyte of raw data each day
University of Nebraska-Lincoln: 1.6 petabytes of physics data Hadoop cluster
Yahoo!: 100,000 CPUs in 40,000 computers, all running Hadoop. Also running a 12 terabyte MOLAP cube based on Tableau Software
eBay: has 3 separate analytics environments:
- 6PB data warehouse for structured data and SQL access
- 40PB deep analytics (Teradata)
- 20PB Hadoop system to support advanced analytic workload on unstructured data
Original title and link: Big Data Is Going Mainstream: Facebook, Yahoo!, eBay, Quantcast, and Many Others ( ©myNoSQL)
Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. Apache HBase is a database-like layer built on Hadoop designed to support billions of messages per day. This paper describes the reasons why Facebook chose Hadoop and HBase over other systems such as Apache Cassandra and Voldemort and discusses the application’s requirements for consistency, availability, partition tolerance, data model and scalability. We explore the enhancements made to Hadoop to make it a more effective realtime system, the tradeoffs we made while configuring the system, and how this solution has significant advantages over the sharded MySQL database scheme used in other applications at Facebook and many other web-scale companies. We discuss the motivations behind our design choices, the challenges that we face in day-to-day operations, and future capabilities and improvements still under development. We offer these observations on the deployment as a model for other companies who are contemplating a Hadoop-based solution over traditional sharded RDBMS deployments.
Also embedded below for your quick reference: