NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



facebook: All content tagged as facebook in NoSQL databases and polyglot persistence

Life of Data at Facebook

Nice screenshot by TechCrunch people of the slide talking about the data lifecycle at Facebook:


Credit TechCrunch

Based on this you’ll now have a better picture of how Facebook data ingestion numbers correlate to their architecture.

Original title and link: Life of Data at Facebook (NoSQL database©myNoSQL)

Fun With Numbers: How Much Data Is Facebook Ingesting

A recent GigaOM article provides some interesting data points about how much data Facebook is handling:

  • 2.5 bil. content items shared per day
  • 2.7bil. likes per day
  • 300mil. uploaded photos
  • 500+ terabytes of ingested data per day

The numbers above do not include any details about how many data points Facebook is collecting for analytic purposes. But I don’t think I’d be off by assuming this number should probably be a good multiplier of the above numbers. We’ll go with 10 to keep things simple.

A couple of days ago, James Hamilton posted an analysis of Facebook’s Carbon and Energy Impact:

Using the Facebook PUE number of 1.07, we know they are delivering 54.27MW to the IT load (servers and storage). We don’t know the average server draw at Facebook but they have excellent server designs (see Open Compute Server Design) so they likely average at or below as 300W per server. Since 300W is an estimate, let’s also look at 250W and 400W per server:

  • 250W/server: 217,080 servers
  • 300W/server: 180,900 servers
  • 350W/server: 155,057 servers

It’s difficult to determine how many of the 180k servers are databases, but if considering a 1:10 ratio for databases to front end + cache servers, that would give us an approximate number of 18k database servers ingesting 500+ terabytes of data through a guestimated 50+ billion calls.

There’s also something that confuses me about these numbers. If Facebook is getting 300mil. photo uploads per day and ingests 500+ terabytes that could mean that either 1) the average photo size is very low; or 2) Facebook doesn’t count photos when mentioning the ingested data size.

Original title and link: Fun With Numbers: How Much Data Is Facebook Ingesting (NoSQL database©myNoSQL)

Inside Facebook’s Open Graph

The mid-part of this Wired article talks a bit about the way Facebook is storing its Open Graph data:

We have an object store, which stores things like users and events and groups and photos, and then we have an edge store that stores the relationship between objects. With Open Graph, we built a layer on top of those systems that allowed developers to define what their objects look like and what their edges look like and then publish those third party objects and edges into the same infrastructure that we used to store all of the first party objects and edges.

Couple of thoughts:

  1. this data is a good example of a multigraph
  2. I don’t think Facebook is actually using a graph database for storing the data. Considering the size of the data Facebook is handling, this could be understandable
  3. There’s no mention of how the metadata, the description of the objects and edges, is stored. I assume this should somehow be connected to historical data to allow the evolution of the data while maintaining its original meaning over time.
  4. The processing happening on this multigraph data sounds like cluster analysis

Original title and link: Inside Facebook’s Open Graph (NoSQL database©myNoSQL)

What Big Data Is Used for at Facebook

Just a couple of examples: product and brand engagement, advertising.

A recent study we just published in the Proceedings of the National Academy of Sciences tells a new story about the way people adopt products and engage with them. The prevailing theories about this process suggested that what influences a person [to] adopt technologies is the number or percentage of friends who have already adopted the same technology, along with a person’s threshold for adopting such technologies. Our study shows that it’s less about the number of your friends who are using the technology, but more about their diversity. […] Some of the work we’re interested in understanding is how your friends influence your decisions to engage with advertising and brands.

Original title and link: What Big Data Is Used for at Facebook (NoSQL database©myNoSQL)


Yahoo Patent Letter to Facebook Referring to Memcached and Other Open Source Technologies

Sarah Lacy:

The technologies in question include things like memcached which was created in 2003 by LiveJournal and has been used longer than Facebook has been alive.[…]

Other examples include Open Compute, an open hardware project started by Facebook that focuses on low-cost, energy efficient server and data center hardware; Tornado a Python-based web server used for building real-time Web services; and HPHP, a source code transformer that turns PHP into C++.

I have no other details about this patent letter Yahoo sent Facebook, but I seriously doubt it targets these technologies separately. Most probably it refers to some sort of combinations of these and one that Facebook has mentioned as part of their IP.

Original title and link: Yahoo Patent Letter to Facebook Referring to Memcached and Other Open Source Technologies (NoSQL database©myNoSQL)


A Story of MySQL and InnoDB at Facebook Told by Mark Callaghan

Just whetting your apetite for this interview with Mark Callaghan about MySQL, InnoDB, and his work at Facebook:

Q: How do you make MySQL both “less slow” and “faster” at the same time?

A: I ask questions like, “If I can make it do 10 things per second today, can I make it do 20 things per second tomorrow?” For example, we used to use an algorithm that is very CPU intensive to check database pages. Another person on my team, Ryan Mack, modified it to use hardware support on X86 processors so we could profile the servers in production to see what they were doing in these computing checksums.  We then realized that the newest CPUs had a faster way to do that, so we modified the MySQL to use the CRC32 for checksums. The hard part there was upgrading the servers on the fly from using the old check zones to the new checksums without taking the site down.

Exciting and scary.

Original title and link: A Story of MySQL and InnoDB at Facebook Told by Mark Callaghan (NoSQL database©myNoSQL)


Facebook: There Are No Published Cases of NoSQL Databases Operating at the Scale of Facebook’s MySQL Database

Joe Maguire referring to the Facebook talk embedded below MySQL and HBase:

if Facebook doesn’t need NoSQL, who does?

My answer: many of those that cannot employ a specialized team to hack the hell out of MySQL to make it work at that scale.

On the flipside, many other companies don’t have the time or engineering power to grow their product together with a NoSQL database.


Memcached and Sherpa for Yahoo! News Activity Data Service

Mixer, the recently announced Yahoo’s new data service for news activities, uses Memcached and Sherpa for its data backend. Plus a combination of asynchronous libraries and task execution tools:

Mixer - Memcached Sherpa Yahoo News Activity

The data processing model and the clear separation between read and write data solutions is not only compelling, but essential for maintaining the SLA (max. 250ms/response):

Memcache maintains two types of materialized views: 1) Consumer-pivoted, and 2) Producer-pivoted. Consumer-pivoted views (e.g. user’s friends’ latest read activity) are refreshed at query time by refresh tasks. Producer-pivoted views (e.g. user’s latest read activity) are refreshed at update time (i.e. when “read” event is posted). And producer-pivoted views are used to refresh consumer-pivoted views.

Sherpa is Yahoo!’s cloud-based NoSql data store that provides low-latency reads and writes of key-value records and short range scans. Efficient range scans are particular important for the Mixer use cases. The “read” event is stored in the Updates table. The Updates table is a Sherpa Distributed Ordered Table that is ordered by “user,timestamp desc”. This provides efficient scans through a user’s latest read activity. A reference to the “read” record is stored in the UpdatesIndex table to support efficient point lookups. UpdatesIndex is a Sherpa Distributed Hash Table

Original title and link: Memcached and Sherpa for Yahoo! News Activity Data Service (NoSQL database©myNoSQL)


Big Data Is Going Mainstream: Facebook, Yahoo!, eBay, Quantcast, and Many Others

Shawn Rogers has a short but compelling list of Big Data deployments in his article Big Data is Scaling BI and Analytics. This list also shows that even if there are some common components like Hadoop, there are no blueprints yet for dealing with Big Data.

  • Facebook: Hadoop analytic data warehouse, using HDFS to store more than 30 petabytes of data. Their Big Data stack is based only on open source solutions.

  • Quantcast: 3,000 core, 3,500 terabyte Hadoop deployment that processes more than a petabyte of raw data each day

  • University of Nebraska-Lincoln: 1.6 petabytes of physics data Hadoop cluster

  • Yahoo!: 100,000 CPUs in 40,000 computers, all running Hadoop. Also running a 12 terabyte MOLAP cube based on Tableau Software

  • eBay: has 3 separate analytics environments:

    • 6PB data warehouse for structured data and SQL access
    • 40PB deep analytics (Teradata)
    • 20PB Hadoop system to support advanced analytic workload on unstructured data

Original title and link: Big Data Is Going Mainstream: Facebook, Yahoo!, eBay, Quantcast, and Many Others (NoSQL database©myNoSQL)

Moving an Elephant: Large Scale Hadoop Data Migration at Facebook

The story of migrating 30PB of HDFS stored data:

The scale of this migration exceeded all previous ones, and we considered a couple of different migration strategies. One was a physical move of the machines to the new data center. We could have moved all the machines within a few days with enough hands at the job. However, this was not a viable option as our users and analysts depend on the data 24/7, and the downtime would be too long.

Facebook could have convinced part of the 750+ million users to help carry over the physical machines.

Luckily HDFS has replication built in and Facebook engineers have paired it with their own improvements:

Another approach was to set up a replication system that mirrors changes from the old cluster to the new, larger cluster. Then at the switchover time, we could simply redirect everything to the new cluster. This approach is more complex as the source is a live file system, with files being created and deleted continuously. Due to the unprecedented cluster size, a new replication system that could handle the load would need to be developed. However, because replication minimizes downtime, it was the approach that we decided to use for this massive migration.

Another bit worth emphasizing in this post:

In 2010, Facebook had the largest Hadoop cluster in the world, with over 20 PB of storage. By March 2011, the cluster had grown to 30 PB — that’s 3,000 times the size of the Library of Congress!

Until now I thought Yahoo! has the largest Hadoop deployment, at least in terms of number of machines.

Original title and link: Moving an Elephant: Large Scale Hadoop Data Migration at Facebook (NoSQL database©myNoSQL)


An Alternative Approach for Big Data Real Time Analytics

Starting from the architecture of Facebook’s realtime analytics presented in the paper Apache Hadoop Goes Realtime at Facebook and Dhruba Borthakur’s excellent posts HDFS: Realtime Hadoop and HBase Usage at Facebook, Nati Shalom describes an alternative approach for real-time analytics using data grids making the following assumptions:

They had some assumptions in design that centered around the reliability of in-memory systems and database neutrality that affected what they did: for memory, that transactional memory was unreliable, and for the database, that HBase was the only targeted data store.

What if those assumptions are changed? We can see reliable transactional memory in the field, as a requirement for any in-memory data grid, and certainly there are more databases than HBase; given database and platform neutrality, and reliable transactional memory, how could you build a realtime analytics system?

While a great read, I get the feeling there’s something wrong. Maybe this:

There are lots of areas in which you can see potential improvements, if the assumptions are changed. As a contrast to Facebook’s working system: […] We can consolidate the analytics system so that management is easier and unified. While there are system management standards like SNMP that allow management events to be presented  in the same way no matter the source, having so many different pieces means that managing the system requires an encompassing understanding, which makes maintenance and scaling more difficult.

and then:

One other advantage of data grids is in write-through support. With write-through, updates to the data grid are written asynchronously to a backend data store – which could be HBase (as used by Facebook), Cassandra, a relational database such as MySQL, or any other data medium you choose for long-term storage, should you need that.

Original title and link: An Alternative Approach for Big Data Real Time Analytics (NoSQL database©myNoSQL)


Paper: Apache Hadoop Goes Realtime at Facebook

Links to the most publicized paper of a Hadoop use case, Facebook has presented at SIGMOD 2011: paper and presentation slides:

Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. Apache HBase is a database-like layer built on Hadoop designed to support billions of messages per day. This paper describes the reasons why Facebook chose Hadoop and HBase over other systems such as Apache Cassandra and Voldemort and discusses the application’’s requirements for consistency, availability, partition tolerance, data model and scalability. We explore the enhancements made to Hadoop to make it a more effective realtime system, the tradeoffs we made while configuring the system, and how this solution has significant advantages over the sharded MySQL database scheme used in other applications at Facebook and many other web-scale companies. We discuss the motivations behind our design choices, the challenges that we face in day-to-day operations, and future capabilities and improvements still under development. We offer these observations on the deployment as a model for other companies who are contemplating a Hadoop-based solution over traditional sharded RDBMS deployments.

Also embedded below for your quick reference: