ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Membase Amazon SimpleDB MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

facebook: All content tagged as facebook in NoSQL databases and polyglot persistence

Facebook: There Are No Published Cases of NoSQL Databases Operating at the Scale of Facebook’s MySQL Database

Joe Maguire referring to the Facebook talk embedded below MySQL and HBase:

if Facebook doesn’t need NoSQL, who does?

My answer: many of those that cannot employ a specialized team to hack the hell out of MySQL to make it work at that scale.

On the flipside, many other companies don’t have the time or engineering power to grow their product together with a NoSQL database.

Original title and link: Facebook: There Are No Published Cases of NoSQL Databases Operating at the Scale of Facebook’s MySQL Database (NoSQL database©myNoSQL)

via: http://josephmaguire.blogspot.com/2011/12/facebook-there-are-no-published-cases.html


Memcached and Sherpa for Yahoo! News Activity Data Service

Mixer, the recently announced Yahoo’s new data service for news activities, uses Memcached and Sherpa for its data backend. Plus a combination of asynchronous libraries and task execution tools:

Mixer - Memcached Sherpa Yahoo News Activity

The data processing model and the clear separation between read and write data solutions is not only compelling, but essential for maintaining the SLA (max. 250ms/response):

Memcache maintains two types of materialized views: 1) Consumer-pivoted, and 2) Producer-pivoted. Consumer-pivoted views (e.g. user’s friends’ latest read activity) are refreshed at query time by refresh tasks. Producer-pivoted views (e.g. user’s latest read activity) are refreshed at update time (i.e. when “read” event is posted). And producer-pivoted views are used to refresh consumer-pivoted views.

Sherpa is Yahoo!’s cloud-based NoSql data store that provides low-latency reads and writes of key-value records and short range scans. Efficient range scans are particular important for the Mixer use cases. The “read” event is stored in the Updates table. The Updates table is a Sherpa Distributed Ordered Table that is ordered by “user,timestamp desc”. This provides efficient scans through a user’s latest read activity. A reference to the “read” record is stored in the UpdatesIndex table to support efficient point lookups. UpdatesIndex is a Sherpa Distributed Hash Table

Original title and link: Memcached and Sherpa for Yahoo! News Activity Data Service (NoSQL database©myNoSQL)

via: http://developer.yahoo.com/blogs/ydn/posts/2011/09/mixer-?-the-data-service-that-powers-yahoo-news-activity/


Big Data Is Going Mainstream: Facebook, Yahoo!, eBay, Quantcast, and Many Others

Shawn Rogers has a short but compelling list of Big Data deployments in his article Big Data is Scaling BI and Analytics. This list also shows that even if there are some common components like Hadoop, there are no blueprints yet for dealing with Big Data.

  • Facebook: Hadoop analytic data warehouse, using HDFS to store more than 30 petabytes of data. Their Big Data stack is based only on open source solutions.

  • Quantcast: 3,000 core, 3,500 terabyte Hadoop deployment that processes more than a petabyte of raw data each day

  • University of Nebraska-Lincoln: 1.6 petabytes of physics data Hadoop cluster

  • Yahoo!: 100,000 CPUs in 40,000 computers, all running Hadoop. Also running a 12 terabyte MOLAP cube based on Tableau Software

  • eBay: has 3 separate analytics environments:

    • 6PB data warehouse for structured data and SQL access
    • 40PB deep analytics (Teradata)
    • 20PB Hadoop system to support advanced analytic workload on unstructured data

Original title and link: Big Data Is Going Mainstream: Facebook, Yahoo!, eBay, Quantcast, and Many Others (NoSQL database©myNoSQL)


Moving an Elephant: Large Scale Hadoop Data Migration at Facebook

The story of migrating 30PB of HDFS stored data:

The scale of this migration exceeded all previous ones, and we considered a couple of different migration strategies. One was a physical move of the machines to the new data center. We could have moved all the machines within a few days with enough hands at the job. However, this was not a viable option as our users and analysts depend on the data 24/7, and the downtime would be too long.

Facebook could have convinced part of the 750+ million users to help carry over the physical machines.

Luckily HDFS has replication built in and Facebook engineers have paired it with their own improvements:

Another approach was to set up a replication system that mirrors changes from the old cluster to the new, larger cluster. Then at the switchover time, we could simply redirect everything to the new cluster. This approach is more complex as the source is a live file system, with files being created and deleted continuously. Due to the unprecedented cluster size, a new replication system that could handle the load would need to be developed. However, because replication minimizes downtime, it was the approach that we decided to use for this massive migration.

Another bit worth emphasizing in this post:

In 2010, Facebook had the largest Hadoop cluster in the world, with over 20 PB of storage. By March 2011, the cluster had grown to 30 PB — that’s 3,000 times the size of the Library of Congress!

Until now I thought Yahoo! has the largest Hadoop deployment, at least in terms of number of machines.

Original title and link: Moving an Elephant: Large Scale Hadoop Data Migration at Facebook (NoSQL database©myNoSQL)

via: http://www.facebook.com/notes/paul-yang/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920


An Alternative Approach for Big Data Real Time Analytics

Starting from the architecture of Facebook’s realtime analytics presented in the paper Apache Hadoop Goes Realtime at Facebook and Dhruba Borthakur’s excellent posts HDFS: Realtime Hadoop and HBase Usage at Facebook, Nati Shalom describes an alternative approach for real-time analytics using data grids making the following assumptions:

They had some assumptions in design that centered around the reliability of in-memory systems and database neutrality that affected what they did: for memory, that transactional memory was unreliable, and for the database, that HBase was the only targeted data store.

What if those assumptions are changed? We can see reliable transactional memory in the field, as a requirement for any in-memory data grid, and certainly there are more databases than HBase; given database and platform neutrality, and reliable transactional memory, how could you build a realtime analytics system?

While a great read, I get the feeling there’s something wrong. Maybe this:

There are lots of areas in which you can see potential improvements, if the assumptions are changed. As a contrast to Facebook’s working system: […] We can consolidate the analytics system so that management is easier and unified. While there are system management standards like SNMP that allow management events to be presented  in the same way no matter the source, having so many different pieces means that managing the system requires an encompassing understanding, which makes maintenance and scaling more difficult.

and then:

One other advantage of data grids is in write-through support. With write-through, updates to the data grid are written asynchronously to a backend data store – which could be HBase (as used by Facebook), Cassandra, a relational database such as MySQL, or any other data medium you choose for long-term storage, should you need that.

Original title and link: An Alternative Approach for Big Data Real Time Analytics (NoSQL database©myNoSQL)

via: http://natishalom.typepad.com/nati_shaloms_blog/2011/07/real-time-analytics-for-big-data-an-alternative-approach.html


Paper: Apache Hadoop Goes Realtime at Facebook

Links to the most publicized paper of a Hadoop use case, Facebook has presented at SIGMOD 2011: paper and presentation slides:

Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. Apache HBase is a database-like layer built on Hadoop designed to support billions of messages per day. This paper describes the reasons why Facebook chose Hadoop and HBase over other systems such as Apache Cassandra and Voldemort and discusses the application’’s requirements for consistency, availability, partition tolerance, data model and scalability. We explore the enhancements made to Hadoop to make it a more effective realtime system, the tradeoffs we made while configuring the system, and how this solution has significant advantages over the sharded MySQL database scheme used in other applications at Facebook and many other web-scale companies. We discuss the motivations behind our design choices, the challenges that we face in day-to-day operations, and future capabilities and improvements still under development. We offer these observations on the deployment as a model for other companies who are contemplating a Hadoop-based solution over traditional sharded RDBMS deployments.

Also embedded below for your quick reference:


HDFS: Realtime Hadoop and HBase Usage at Facebook

Dhruba Borthakur started a series of posts — part 1 and part 2 — describing both the process that lead Facebook to using HBase and Hadoop, but also the projects where these are used and their requirements:

After considerable research and experimentation, we chose Hadoop and HBase as the foundational storage technology for these next generation applications. The decision was based on the state of HBase at the point of evaluation as well as our confidence in addressing the features that were lacking at that point via in- house engineering. HBase already provided a highly consistent, high write-throughput key-value store. The HDFS NameNode stood out as a central point of failure, but we were confident that our HDFS team could build a highly-available NameNode (AvatarNode) in a reasonable time-frame, and this would be useful for our warehouse operations as well. Good disk read-efficiency seemed to be within striking reach (pending adding Bloom filters to HBase’’s version of LSM Trees, making local DataNode reads efficient and caching NameNode metadata). Based on our experience operating the Hive/Hadoop warehouse, we knew HDFS was stellar in tolerating and isolating faults in the disk subsystem. The failure of entire large HBase/HDFS clusters was a scenario that ran against the goal of fault-isolation, but could be considerably mitigated by storing data in smaller HBase clusters. Wide area replication projects, both in-house and within the HBase community, seemed to provide a promising path to achieving disaster recovery.

The second part is describing 3 problems Facebook is solving using HBase and Hadoop and provides further details on the requirements of each of these.

The two posts represent a great resource for understanding not only where HBase and Hadoop can be used, but also on how to formulate the requirements (and non-requirements) for new systems.

A Facebook team will present the paper “Apache Hadoop Goes Realtime at Facebook” at ACM SIGMOD. I’m looking forward for the moment the paper will be available.

Original title and link: HDFS: Realtime Hadoop and HBase Usage at Facebook (NoSQL databases © myNoSQL)


Firefox Downloads Visualization Powered by HBase

Not only is Mozilla celebrating the release of Firefox 4, but they took the time to set up a nice visualization for downloads.

glow.mozilla.org is powered by tailing logs and streaming data into HBase:

  1. The various load balancing clusters that host download.mozilla.org are configured to log download requests to a remote syslog server.
  2. The remote server is running rsyslog and has a config that specifically filters those remote syslog events into a dedicated file that rolls over hourly
  3. SQLStream is installed on that server and it is tailing those log files as they appear.
  4. The SQLStream pipeline does the following for each request:
    • filtering out anything other than valid download requests
    • uses MaxMind GeoIP to get a geographic location from the IP address
    • uses a streaming group by to aggregate the number of downloads by product, location, and timestamp
    • every 10 seconds, sends a stream of counter increments to HBase for the timestamp row with the column qualifiers being each distinct location that had downloads in that time interval
  5. The glow backend is a python app that pulls the data out of HBase using the Python Thrift interface and writes a file containing a JSON representation of the data every minute.
  6. That JSON file can be cached on the front-end forever since each minute of data has a distinct filename
  7. The glow website pulls down that data and plays back the downloads or allows you to browse the geographic totals in the arc chart view

This sounds a lot like what Facebook is doing for the new Real-Time Analytics system.. The parts missing are Scribe and ptail.

Original title and link: Firefox Downloads Visualization Powered by HBase (NoSQL databases © myNoSQL)

via: http://blog.mozilla.com/data/2011/03/22/how-glow-mozilla-org-gets-its-data/


Facebook Builds HBase-based Real-Time Analytics

More applications of HBase at Facebook, after the new messaging system:

If you are interesting to read more about Facebook messages here’s a list of posts:

Original title and link: Facebook Builds HBase-based Real-Time Analytics (NoSQL databases © myNoSQL)


Facebook Messages: FOSDEM NoSQL Event

From this year’s FOSDEM, Facebook talking about the technology behind the messaging platform:

Original title and link: Facebook Messages: FOSDEM NoSQL Event (NoSQL databases © myNoSQL)


HBase at Facebook: The Underlying Technology of Messages

There have been lots of discussions and speculations after the announcement that Facebook is using HBase for the new messaging system. In case you missed it, here are the most important bits:

  • Kannan Muthukkaruppan: The underlying Technology of Messages (facebook.com)

    We spent a few weeks setting up a test framework to evaluate clusters of MySQL, Apache Cassandra, Apache HBase, and a couple of other systems. We ultimately chose HBase. MySQL proved to not handle the long tail of data well; as indexes and data sets grew large, performance suffered. We found Cassandra’s eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure.

  • Quora.com: How does HBase write performance differ from write performance in Cassandra with consistency level ALL

    While setting the a write consistency level of ALL with a read level of ONE in Cassandra provides a strong consistency model similar to what HBase provides (and in fact using quorum writes and reads would as well), the two operations are actually semantically different and lead to different durability and availability guarantees.

  • Cassandra mailing list: Facebook messaging and choice of HBase over Cassandra

  • Todd Hoff: Facebook’s New Real-Time Messaging System: HBase to Store 135+ Billion Messages a Month (highscalability.com)

    HBase is a scaleout table store supporting very high rates of row-level updates over massive amounts of data. Exactly what is needed for a Messaging system. HBase is also a column based key-value store built on the BigTable model. It’s good at fetching rows by key or scanning ranges of rows and filtering. Also what is needed for a Messaging system. Complex queries are not supported however. Queries are generally given over to an analytics tool like Hive, which Facebook created to make sense of their multi-petabyte data warehouse, and Hive is based on Hadoop’s file system, HDFS, which is also used by HBase.

  • Jeremiah Peschka: Facebook messaging - HBase Comes of Age (facility9.com)

    Existing expertise: The technology behind HBase – Hadoop and HDFS – is very well understood and has been used previously at Facebook. […] Since Hive makes use of Hadoop and HDFS, these shared technologies are well understood by Facebook’s operations teams. As a result, the same technology that allows Facebook to scale their data will be the technology that allows Facebook to scale their Social Messaging feature. The operations team already understands many of the problems they will encounter.

  • Quora.com: What version of HBase is Facebook using for its new messaging platform?

    Facebook has an internal branch of HBase which periodically updates from the Apache SVN. As far as I know, the current version in production is very similar to the 0.89.20100924 development release with a couple more patches pulled in from trunk.

    Facebook engineers continue to actively contribute to the open source trunk, though - it’s not an internal “fork”

    Todd Lipcon (HBase committer)

The engineering team behind Facebook’s new messaging system has posted now a video talking more about their choice of HBase. You can watch the a bit over 1 hour long video here.

The engineering team behind Facebook Messages spent the past year building out a robust, scalable infrastructure. We shared some details about the technology on our Engineering Blog (http://fb.me/95OQ8YaD2rkb3r). This tech talk digs deeper into some of the twenty different infrastructure services we created for the project as well as how we’re using Apache HBase.

I’m still watching the video, so my notes will follow.

Why HBase?

Choosing HBase at Facebook

  • Strong consistency model
  • Automatic failover
  • Multiple shards per server for load balancing

  • Prevents cascading failures

  • Compression: save disk space and network bandwidth

  • Read-modify-write operation support, like counter increment
  • Map Reduce supported out of the box

I’m still not sure why one needs a strong consistency model for messages (and that’s the part missing from all these articles).

As a side note, I feel like the decission was based not on some major facts, but rather a sum of small but important features that HBase was offering compared to other solutions (i.e. consistent increments, perfect integration with Hadoop, etc.)

Original title and link: HBase at Facebook: The Underlying Technology of Messages (NoSQL databases © myNoSQL)


MySQL at Facebook with Mark Callaghan

When Facebook talks MySQL, it usually means BigData MySQL, high availability and scalable MySQL, and last, but not least NoSQLized MySQL. Mark Callaghan:

And another one from O’Reilly MySQL Conference: