facebook: All content tagged as facebook in NoSQL databases and polyglot persistence
Wednesday, 16 May 2012
What Big Data Is Used for at Facebook
Just a couple of examples: product and brand engagement, advertising.
A recent study we just published in the Proceedings of the National Academy of Sciences tells a new story about the way people adopt products and engage with them. The prevailing theories about this process suggested that what influences a person [to] adopt technologies is the number or percentage of friends who have already adopted the same technology, along with a person’s threshold for adopting such technologies. Our study shows that it’s less about the number of your friends who are using the technology, but more about their diversity. […] Some of the work we’re interested in understanding is how your friends influence your decisions to engage with advertising and brands.
Original title and link: What Big Data Is Used for at Facebook (©myNoSQL)
via: http://www.technologyreview.com/printer_friendly_article.aspx?id=40184
Friday, 4 May 2012
Yahoo Patent Letter to Facebook Referring to Memcached and Other Open Source Technologies
Sarah Lacy:
The technologies in question include things like memcached which was created in 2003 by LiveJournal and has been used longer than Facebook has been alive.[…]
Other examples include Open Compute, an open hardware project started by Facebook that focuses on low-cost, energy efficient server and data center hardware; Tornado a Python-based web server used for building real-time Web services; and HPHP, a source code transformer that turns PHP into C++.
I have no other details about this patent letter Yahoo sent Facebook, but I seriously doubt it targets these technologies separately. Most probably it refers to some sort of combinations of these and one that Facebook has mentioned as part of their IP.
Original title and link: Yahoo Patent Letter to Facebook Referring to Memcached and Other Open Source Technologies (©myNoSQL)
Tuesday, 13 March 2012
A Story of MySQL and InnoDB at Facebook Told by Mark Callaghan
Just whetting your apetite for this interview with Mark Callaghan about MySQL, InnoDB, and his work at Facebook:
Q: How do you make MySQL both “less slow” and “faster” at the same time?
A: I ask questions like, “If I can make it do 10 things per second today, can I make it do 20 things per second tomorrow?” For example, we used to use an algorithm that is very CPU intensive to check database pages. Another person on my team, Ryan Mack, modified it to use hardware support on X86 processors so we could profile the servers in production to see what they were doing in these computing checksums. We then realized that the newest CPUs had a faster way to do that, so we modified the MySQL to use the CRC32 for checksums. The hard part there was upgrading the servers on the fly from using the old check zones to the new checksums without taking the site down.
Exciting and scary.
Original title and link: A Story of MySQL and InnoDB at Facebook Told by Mark Callaghan (©myNoSQL)
Sunday, 11 December 2011
Facebook: There Are No Published Cases of NoSQL Databases Operating at the Scale of Facebook’s MySQL Database
Joe Maguire referring to the Facebook talk embedded below MySQL and HBase:
if Facebook doesn’t need NoSQL, who does?
My answer: many of those that cannot employ a specialized team to hack the hell out of MySQL to make it work at that scale.
On the flipside, many other companies don’t have the time or engineering power to grow their product together with a NoSQL database.
via: http://josephmaguire.blogspot.com/2011/12/facebook-there-are-no-published-cases.html
Friday, 23 September 2011
Memcached and Sherpa for Yahoo! News Activity Data Service
Mixer, the recently announced Yahoo’s new data service for news activities, uses Memcached and Sherpa for its data backend. Plus a combination of asynchronous libraries and task execution tools:

The data processing model and the clear separation between read and write data solutions is not only compelling, but essential for maintaining the SLA (max. 250ms/response):
Memcache maintains two types of materialized views: 1) Consumer-pivoted, and 2) Producer-pivoted. Consumer-pivoted views (e.g. user’s friends’ latest read activity) are refreshed at query time by refresh tasks. Producer-pivoted views (e.g. user’s latest read activity) are refreshed at update time (i.e. when “read” event is posted). And producer-pivoted views are used to refresh consumer-pivoted views.
Sherpa is Yahoo!’s cloud-based NoSql data store that provides low-latency reads and writes of key-value records and short range scans. Efficient range scans are particular important for the Mixer use cases. The “read” event is stored in the Updates table. The Updates table is a Sherpa Distributed Ordered Table that is ordered by “user,timestamp desc”. This provides efficient scans through a user’s latest read activity. A reference to the “read” record is stored in the UpdatesIndex table to support efficient point lookups. UpdatesIndex is a Sherpa Distributed Hash Table
Original title and link: Memcached and Sherpa for Yahoo! News Activity Data Service (©myNoSQL)
Thursday, 22 September 2011
Big Data Is Going Mainstream: Facebook, Yahoo!, eBay, Quantcast, and Many Others
Shawn Rogers has a short but compelling list of Big Data deployments in his article Big Data is Scaling BI and Analytics. This list also shows that even if there are some common components like Hadoop, there are no blueprints yet for dealing with Big Data.
-
Facebook: Hadoop analytic data warehouse, using HDFS to store more than 30 petabytes of data. Their Big Data stack is based only on open source solutions.
-
Quantcast: 3,000 core, 3,500 terabyte Hadoop deployment that processes more than a petabyte of raw data each day
-
University of Nebraska-Lincoln: 1.6 petabytes of physics data Hadoop cluster
-
Yahoo!: 100,000 CPUs in 40,000 computers, all running Hadoop. Also running a 12 terabyte MOLAP cube based on Tableau Software
-
eBay: has 3 separate analytics environments:
- 6PB data warehouse for structured data and SQL access
- 40PB deep analytics (Teradata)
- 20PB Hadoop system to support advanced analytic workload on unstructured data
Original title and link: Big Data Is Going Mainstream: Facebook, Yahoo!, eBay, Quantcast, and Many Others (©myNoSQL)
Monday, 8 August 2011
Moving an Elephant: Large Scale Hadoop Data Migration at Facebook
The story of migrating 30PB of HDFS stored data:
The scale of this migration exceeded all previous ones, and we considered a couple of different migration strategies. One was a physical move of the machines to the new data center. We could have moved all the machines within a few days with enough hands at the job. However, this was not a viable option as our users and analysts depend on the data 24/7, and the downtime would be too long.
Facebook could have convinced part of the 750+ million users to help carry over the physical machines.
Luckily HDFS has replication built in and Facebook engineers have paired it with their own improvements:
Another approach was to set up a replication system that mirrors changes from the old cluster to the new, larger cluster. Then at the switchover time, we could simply redirect everything to the new cluster. This approach is more complex as the source is a live file system, with files being created and deleted continuously. Due to the unprecedented cluster size, a new replication system that could handle the load would need to be developed. However, because replication minimizes downtime, it was the approach that we decided to use for this massive migration.
Another bit worth emphasizing in this post:
In 2010, Facebook had the largest Hadoop cluster in the world, with over 20 PB of storage. By March 2011, the cluster had grown to 30 PB — that’s 3,000 times the size of the Library of Congress!
Until now I thought Yahoo! has the largest Hadoop deployment, at least in terms of number of machines.
Original title and link: Moving an Elephant: Large Scale Hadoop Data Migration at Facebook (©myNoSQL)
Thursday, 4 August 2011
An Alternative Approach for Big Data Real Time Analytics
Starting from the architecture of Facebook’s realtime analytics presented in the paper Apache Hadoop Goes Realtime at Facebook and Dhruba Borthakur’s excellent posts HDFS: Realtime Hadoop and HBase Usage at Facebook, Nati Shalom describes an alternative approach for real-time analytics using data grids making the following assumptions:
They had some assumptions in design that centered around the reliability of in-memory systems and database neutrality that affected what they did: for memory, that transactional memory was unreliable, and for the database, that HBase was the only targeted data store.
What if those assumptions are changed? We can see reliable transactional memory in the field, as a requirement for any in-memory data grid, and certainly there are more databases than HBase; given database and platform neutrality, and reliable transactional memory, how could you build a realtime analytics system?
While a great read, I get the feeling there’s something wrong. Maybe this:
There are lots of areas in which you can see potential improvements, if the assumptions are changed. As a contrast to Facebook’s working system: […] We can consolidate the analytics system so that management is easier and unified. While there are system management standards like SNMP that allow management events to be presented in the same way no matter the source, having so many different pieces means that managing the system requires an encompassing understanding, which makes maintenance and scaling more difficult.
and then:
One other advantage of data grids is in write-through support. With write-through, updates to the data grid are written asynchronously to a backend data store – which could be HBase (as used by Facebook), Cassandra, a relational database such as MySQL, or any other data medium you choose for long-term storage, should you need that.
Original title and link: An Alternative Approach for Big Data Real Time Analytics (©myNoSQL)
Monday, 4 July 2011
Paper: Apache Hadoop Goes Realtime at Facebook
Links to the most publicized paper of a Hadoop use case, Facebook has presented at SIGMOD 2011: paper and presentation slides:
Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. Apache HBase is a database-like layer built on Hadoop designed to support billions of messages per day. This paper describes the reasons why Facebook chose Hadoop and HBase over other systems such as Apache Cassandra and Voldemort and discusses the application’s requirements for consistency, availability, partition tolerance, data model and scalability. We explore the enhancements made to Hadoop to make it a more effective realtime system, the tradeoffs we made while configuring the system, and how this solution has significant advantages over the sharded MySQL database scheme used in other applications at Facebook and many other web-scale companies. We discuss the motivations behind our design choices, the challenges that we face in day-to-day operations, and future capabilities and improvements still under development. We offer these observations on the deployment as a model for other companies who are contemplating a Hadoop-based solution over traditional sharded RDBMS deployments.
Also embedded below for your quick reference:
Monday, 6 June 2011
HDFS: Realtime Hadoop and HBase Usage at Facebook
Dhruba Borthakur started a series of posts — part 1 and part 2 — describing both the process that lead Facebook to using HBase and Hadoop, but also the projects where these are used and their requirements:
After considerable research and experimentation, we chose Hadoop and HBase as the foundational storage technology for these next generation applications. The decision was based on the state of HBase at the point of evaluation as well as our confidence in addressing the features that were lacking at that point via in- house engineering. HBase already provided a highly consistent, high write-throughput key-value store. The HDFS NameNode stood out as a central point of failure, but we were confident that our HDFS team could build a highly-available NameNode (AvatarNode) in a reasonable time-frame, and this would be useful for our warehouse operations as well. Good disk read-efficiency seemed to be within striking reach (pending adding Bloom filters to HBase’’s version of LSM Trees, making local DataNode reads efficient and caching NameNode metadata). Based on our experience operating the Hive/Hadoop warehouse, we knew HDFS was stellar in tolerating and isolating faults in the disk subsystem. The failure of entire large HBase/HDFS clusters was a scenario that ran against the goal of fault-isolation, but could be considerably mitigated by storing data in smaller HBase clusters. Wide area replication projects, both in-house and within the HBase community, seemed to provide a promising path to achieving disaster recovery.
The second part is describing 3 problems Facebook is solving using HBase and Hadoop and provides further details on the requirements of each of these.
The two posts represent a great resource for understanding not only where HBase and Hadoop can be used, but also on how to formulate the requirements (and non-requirements) for new systems.
A Facebook team will present the paper “Apache Hadoop Goes Realtime at Facebook” at ACM SIGMOD. I’m looking forward for the moment the paper will be available.
Original title and link: HDFS: Realtime Hadoop and HBase Usage at Facebook (NoSQL databases © myNoSQL)
Tuesday, 22 March 2011
Firefox Downloads Visualization Powered by HBase
Not only is Mozilla celebrating the release of Firefox 4, but they took the time to set up a nice visualization for downloads.
glow.mozilla.org is powered by tailing logs and streaming data into HBase:
- The various load balancing clusters that host download.mozilla.org are configured to log download requests to a remote syslog server.
- The remote server is running rsyslog and has a config that specifically filters those remote syslog events into a dedicated file that rolls over hourly
- SQLStream is installed on that server and it is tailing those log files as they appear.
- The SQLStream pipeline does the following for each request:
- filtering out anything other than valid download requests
- uses MaxMind GeoIP to get a geographic location from the IP address
- uses a streaming group by to aggregate the number of downloads by product, location, and timestamp
- every 10 seconds, sends a stream of counter increments to HBase for the timestamp row with the column qualifiers being each distinct location that had downloads in that time interval
- The glow backend is a python app that pulls the data out of HBase using the Python Thrift interface and writes a file containing a JSON representation of the data every minute.
- That JSON file can be cached on the front-end forever since each minute of data has a distinct filename
- The glow website pulls down that data and plays back the downloads or allows you to browse the geographic totals in the arc chart view
This sounds a lot like what Facebook is doing for the new Real-Time Analytics system.. The parts missing are Scribe and ptail.
Original title and link: Firefox Downloads Visualization Powered by HBase (NoSQL databases © myNoSQL)
via: http://blog.mozilla.com/data/2011/03/22/how-glow-mozilla-org-gets-its-data/
Saturday, 5 March 2011
Facebook Builds HBase-based Real-Time Analytics
More applications of HBase at Facebook, after the new messaging system:
If you are interesting to read more about Facebook messages here’s a list of posts:
- Facebook replacing Cassandara with HBase in new messaging system
- The underlying technology of messages using HBase
- HBase at Facebook and why not MySQL or Cassandra
- HBase at Facebook: A technical presentation of the underlying technology
- Facebook messages: a presentation from FOSDEM
Original title and link: Facebook Builds HBase-based Real-Time Analytics (NoSQL databases © myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling