Twitter: All content tagged as Twitter in NoSQL databases and polyglot persistence
Interesting pull request submitted to Redis by Pierre-Yves Ritschard:
This is similar to what snowflake and the recent boundary solution do, but it makes sense to use redis for that type of use cases for people wanting a simple way to get incremental ids in distributed systems without an additional daemon requirement.
Snoflake is the network service for generating unique ID numbers at high scale used (and open sourced) by Twitter. Flake is an Erlang decentralized, k-ordered id generation service open sourced by Boundary.
Original title and link: Redis: Adding a 128-Bit K-Ordered Unique Id Generator ( ©myNoSQL)
- MySQL works well enough most of the time that it’s worth using. Twitter values stability over features so they’ve stayed with older releases.
- MySQL doesn’t work for ID generation and graph storage.
- MySQL is used for smaller datasets of < 1.5TB, which is the size of their RAID array, and as a backing store for larger datasets.
- Typical database server config: HP DL380, 72GB RAM, 24 disk RAID10. Good balance of memory and disk.
In my summary of the talk I’ve noted:
- Use MySQL when it works, something else when not - fortunately MySQL often does work
- MySQL is used by Twitter because it’s robust, replication works and it’s easy to use and run
- MySQL doesn’t work good for graphs, auto_increment, replication lag is a problem
- MySQL replication improvements like crash safe multi-threaded slave is what they need
But Twitter is also one of the most prominent use cases of polyglot persistence.While MySQL is an important piece of the Twitter architecture, it is not the only storage or data processing engine.
The following other data solutions get mentioned in Jeremy’s talk:
- Cassandra is used for high velocity writes, and lower velocity reads. The advantage is Cassandra can run on cheaper hardware than MySQL, it can expand easier, and they like schemaless design.
- Hadoop is used to process unstructured and large datasets, hundreds of billions of rows.
- Vertica is being used for analytics and large aggregations and joins so they don’t have to write MapReduce jobs.
Yet that’s not the whole story. Twitter is using Cassandra and Memcached for real-time URL fetchers and they also experimented with using Gizzard for Redis. After buying BackType, Twitter got and then open sourced Storm, a Hadoop-like real-time data processing tool. And who knows what’s in the Twitter labs right now.
I’m embedding below Jeremy Cole’s “Big and Small Data at @Twitter”:
That’s to say BackType, creators of Cascalog, ElephantDB, and Storm , has been acquired by Twitter (which in case you didn’t know names most of their open source libraries and storage solutions using bird names).
The announcement is here . Looking forward to seeing Storm open sourced.
Original title and link: ElephantDB and Storm Join the Twitter Flock ( ©myNoSQL)
Twitter DBA Lead at Twitter, Jeremy Cole‘s talk about MySQL at Twitter from MySQL CE 2011:
Roland Bouman had some interesting notes (nb: actually tweets) from the talk:
115 mln tweets a day, 1 bln tweets a week, about 50.000 new accounts / day
random server uptime 212d, 127 bln questions (6943/s) rows read: 1.36 mln/s
Use MySQL when it works, something else when not - fortunately MySQL often does work
MySQL is used by twitter because it’s robust, replication works and it’s easy to use and run
MySQL doesn’t work good for graphs, auto_increment, replication lag is a problem
MySQL replication improvements like crash safe multi-threaded slave exactly what they need
Twitter open sourced snowflake (id generation system) and Gizzard distributed data storage
Use soft launches: new code is launched in a disabled state, turn up slowly, back down if needed
Gizzard builds in MySQL/InnoDB handles sharding, replication, job scheduling
Twitter uses Cassandra too for some projects. high velocity writes, schemaless design
Twitter uses Hadoop for analyzing extremely large datasets: 10 to 100 blns rows (http logs)
Twitter also uses vertica for analysis, 100M - 10Blns of rows. Runs 100x faster than MySQL
MySQL’s happy place: <= 1.5 TB datasets, archive store for larger sets.
Three presentations covering the various NoSQL usages at Twitter:
Kevin Weil talking about data analysis using Scribe for logging, base analysis with Pig/Hadoop, and specialized data analysis with HBase, Cassandra, and FlockDB on InfoQ
Ryan King’s presentation from last year’s QCon SF NoSQL track on Gizzard, Cassandra, Hadoop, and Redis on InfoQ
Dmitriy Ryaboy on Hadoop from Devoxx 2010:
- Twitter: Cassandra, HBase, Hadoop, Scribe, FlockDB, Redis
- Facebook: Cassandra, HBase, Hadoop, Scribe, Hive
- Netflix: Amazon SimpleDB, Cassandra
- Digg: Cassandra
- SimpleGeo: Cassandra
- StumbleUpon: HBase, OpenTSDB
- Yahoo!: Hadoop, HBase, PNUTS
- Rackspace: Cassandra
And probably many more missing from the list. But that could change if you leave a comment.
Kevin Weil presented Twitter’s ZooKeeper and Cassandra based solution for realtime analytics named Rainbird at Strata 2011:
Until recently, counters where a unique feature of HBase. While the latest version of Cassandra does not include distributed counters, this feature is available in Cassandra’s trunk.
Original title and link: Rainbird: Twitter’s ZooKeeper + Cassandra Based Realtime Analytics Solution (NoSQL databases © myNoSQL)
There was one NoSQL conference that I’ve missed and I was really pissed off: Hadoop World. Even if I’ve followed and curated the Twitter feed, resulting in Hadoop World in tweets, the feeling of not being there made me really sad. But now, thanks to Cloudera I’ll be able to watch most of the presentations. Many of them have already been published and the complete list can be found ☞ here.
Based on the twitter activity on that day, I’ve selected below the ones that seemed to have generated most buzz. The list contains names like Facebook, Twitter, eBay, Yahoo!, StumbleUpon, comScore, Mozilla, AOL. And there are quite a few more …