NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



MySQL at Twitter: Storing 250mil Tweets Daily

Todd Hoff took the time to disect and extract in a post the interesting bits from Jeremy Cole’s talk[1]Big and Small Data at @Twitter from the O’Reilly MySQL conference:

  • MySQL works well enough most of the time that it’s worth using. Twitter values stability over features so they’ve stayed with older releases.
  • MySQL doesn’t work for ID generation and graph storage.
  • MySQL is used for smaller datasets of < 1.5TB, which is the size of their RAID array, and as a backing store for larger datasets.
  • Typical database server config: HP DL380, 72GB RAM, 24 disk RAID10. Good balance of memory and disk.

In my summary of the talk I’ve noted:

  • Use MySQL when it works, something else when not - fortunately MySQL often does work
  • MySQL is used by Twitter because it’s robust, replication works and it’s easy to use and run
  • MySQL doesn’t work good for graphs, auto_increment, replication lag is a problem
  • MySQL replication improvements like crash safe multi-threaded slave is what they need

But Twitter is also one of the most prominent use cases of polyglot persistence.While MySQL is an important piece of the Twitter architecture, it is not the only storage or data processing engine.

The following other data solutions get mentioned in Jeremy’s talk:

  • Cassandra is used for high velocity writes, and lower velocity reads. The advantage is Cassandra can run on cheaper hardware than MySQL, it can expand easier, and they like schemaless design.
  • Hadoop is used to process unstructured and large datasets, hundreds of billions of rows.
  • Vertica is being used for analytics and large aggregations and joins so they don’t have to write MapReduce jobs. 

Yet that’s not the whole story. Twitter is using Cassandra and Memcached for real-time URL fetchers and they also experimented with using Gizzard for Redis. After buying BackType, Twitter got and then open sourced Storm, a Hadoop-like real-time data processing tool. And who knows what’s in the Twitter labs right now.

I’m embedding below Jeremy Cole’s “Big and Small Data at @Twitter”:

  1. Jeremy Cole is a DBA Team Lead/Database Architect at Twitter  

Original title and link: MySQL at Twitter: Storing 250mil Tweets Daily (NoSQL database©myNoSQL)