NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



twitter: All content tagged as twitter in NoSQL databases and polyglot persistence

Redis: Adding a 128-Bit K-Ordered Unique Id Generator

Interesting pull request submitted to Redis by Pierre-Yves Ritschard:

This is similar to what snowflake and the recent boundary solution do, but it makes sense to use redis for that type of use cases for people wanting a simple way to get incremental ids in distributed systems without an additional daemon requirement.

Snoflake is the network service for generating unique ID numbers at high scale used (and open sourced) by Twitter. Flake is an Erlang decentralized, k-ordered id generation service open sourced by Boundary.

Original title and link: Redis: Adding a 128-Bit K-Ordered Unique Id Generator (NoSQL database©myNoSQL)

Interesting Data Sets and Tools: Monthly Twitter Activity for All Members of the U.S. Congress

Drew Conway:

Today I am pleased to announce that we have worked out most of the bugs, and now have a reliable data set upon which to build. Better still, we are ready to share. Unlike our old system, the data now lives on a live CouchDB database, and can be queried for specific research tasks. We have combined all of the data available from Twitter’s search API with the information on each member from Sunlight Foundation’s Congressional API. […] But be forewarned, working with this system and CouchDB requires a mature understanding of several tools and languages; including but not restricted to; curl, map/reduce, Javascript, and JSON. And that’s before you have even done any analysis.

Original title and link: Interesting Data Sets and Tools: Monthly Twitter Activity for All Members of the U.S. Congress (NoSQL database©myNoSQL)


MySQL at Twitter: Storing 250mil Tweets Daily

Todd Hoff took the time to disect and extract in a post the interesting bits from Jeremy Cole’s talk[1]Big and Small Data at @Twitter from the O’Reilly MySQL conference:

  • MySQL works well enough most of the time that it’s worth using. Twitter values stability over features so they’ve stayed with older releases.
  • MySQL doesn’t work for ID generation and graph storage.
  • MySQL is used for smaller datasets of < 1.5TB, which is the size of their RAID array, and as a backing store for larger datasets.
  • Typical database server config: HP DL380, 72GB RAM, 24 disk RAID10. Good balance of memory and disk.

In my summary of the talk I’ve noted:

  • Use MySQL when it works, something else when not - fortunately MySQL often does work
  • MySQL is used by Twitter because it’s robust, replication works and it’s easy to use and run
  • MySQL doesn’t work good for graphs, auto_increment, replication lag is a problem
  • MySQL replication improvements like crash safe multi-threaded slave is what they need

But Twitter is also one of the most prominent use cases of polyglot persistence.While MySQL is an important piece of the Twitter architecture, it is not the only storage or data processing engine.

The following other data solutions get mentioned in Jeremy’s talk:

  • Cassandra is used for high velocity writes, and lower velocity reads. The advantage is Cassandra can run on cheaper hardware than MySQL, it can expand easier, and they like schemaless design.
  • Hadoop is used to process unstructured and large datasets, hundreds of billions of rows.
  • Vertica is being used for analytics and large aggregations and joins so they don’t have to write MapReduce jobs. 

Yet that’s not the whole story. Twitter is using Cassandra and Memcached for real-time URL fetchers and they also experimented with using Gizzard for Redis. After buying BackType, Twitter got and then open sourced Storm, a Hadoop-like real-time data processing tool. And who knows what’s in the Twitter labs right now.

I’m embedding below Jeremy Cole’s “Big and Small Data at @Twitter”:

Twitter's Real-Time URL Fetcher Using Cassandra and Memcached

Twitter’s real-time URL fetcher, code named SpiderDuck, is an excellent example of how NoSQL databases fit in the architecture of today’s systems:

Metadata Store: This is a Cassandra-based distributed hash table that stores page metadata and resolution information keyed by URL, as well as fetch status for every URL recently encountered by the system. This store serves clients across Twitter that need real-time access to URL metadata.

SpiderDuck is also using memcached:

Memcached: This is a distributed cache used by the fetchers to temporarily store robots.txt files.

SpiderDuck Architecture Cassandra Memcached

Original title and link: Twitter’s Real-Time URL Fetcher Using Cassandra and Memcached (NoSQL database©myNoSQL)


Twitter Open Sourcing Storm at Strange Loop

Ask and you’ll be answered. Nathan Marz announces that Twitter will open source Storm, the Hadoop-like real-time data processing tool developed at BackType:

I’m pleased to announce that I will be releasing Storm at Strange Loop on September 19th!

Here’s a recap of the three broad use cases for Storm:

  • Stream processing: Storm can be used to process a stream of new data and update databases in realtime. Unlike the standard approach of doing stream processing with a network of queues and workers, Storm is fault-tolerant and scalable.
  • Continuous computation: Storm can do a continuous query and stream the results to clients in realtime. An example is streaming trending topics on Twitter into browsers. The browsers will have a realtime view on what the trending topics are as they happen.
  • Distributed RPC: Storm can be used to parallelize an intense query on the fly. The idea is that your Storm topology is a distributed function that waits for invocation messages. When it receives an invocation, it computes the query and sends back the results. Examples of Distributed RPC are parallelizing search queries or doing set operations on large numbers of large sets.

Original title and link: Twitter Open Sourcing Storm at Strange Loop (NoSQL database©myNoSQL)


ElephantDB and Storm Join the Twitter Flock

That’s to say BackType, creators of Cascalog, ElephantDB, and Storm , has been acquired by Twitter (which in case you didn’t know names most of their open source libraries and storage solutions using bird names).

The announcement is here . Looking forward to seeing Storm open sourced.

Original title and link: ElephantDB and Storm Join the Twitter Flock (NoSQL database©myNoSQL)

Big and Small Data at Twitter: MySQL CE 2011

Twitter DBA Lead at Twitter, Jeremy Cole‘s talk about MySQL at Twitter from MySQL CE 2011:

Roland Bouman had some interesting notes (nb: actually tweets) from the talk:

  • 115 mln tweets a day, 1 bln tweets a week, about 50.000 new accounts / day

  • random server uptime 212d, 127 bln questions (6943/s) rows read: 1.36 mln/s

  • Use MySQL when it works, something else when not - fortunately MySQL often does work

  • MySQL is used by twitter because it’s robust, replication works and it’s easy to use and run

  • MySQL doesn’t work good for graphs, auto_increment, replication lag is a problem

  • MySQL replication improvements like crash safe multi-threaded slave exactly what they need

  • Twitter open sourced snowflake (id generation system) and Gizzard distributed data storage

  • Use soft launches: new code is launched in a disabled state, turn up slowly, back down if needed

  • Gizzard builds in MySQL/InnoDB handles sharding, replication, job scheduling

  • Twitter uses Cassandra too for some projects. high velocity writes, schemaless design

  • Twitter uses Hadoop for analyzing extremely large datasets: 10 to 100 blns rows (http logs)

  • Twitter also uses vertica for analysis, 100M - 10Blns of rows. Runs 100x faster than MySQL

  • MySQL’s happy place: <= 1.5 TB datasets, archive store for larger sets.

Original title and link: Big and Small Data at Twitter: MySQL CE 2011 (NoSQL databases © myNoSQL)

Hadoop and NoSQL Databases at Twitter

Three presentations covering the various NoSQL usages at Twitter:

  1. Kevin Weil talking about data analysis using Scribe for logging, base analysis with Pig/Hadoop, and specialized data analysis with HBase, Cassandra, and FlockDB on InfoQ

  2. Ryan King’s presentation from last year’s QCon SF NoSQL track on Gizzard, Cassandra, Hadoop, and Redis on InfoQ

  3. Dmitriy Ryaboy on Hadoop from Devoxx 2010:

By looking at the powered by NoSQL page and my records, Twitter seems to be the largest adopter of NoSQL solutions. Here is an updated version of who is using Cassandra and HBase

  • Twitter: Cassandra, HBase, Hadoop, Scribe, FlockDB, Redis
  • Facebook: Cassandra, HBase, Hadoop, Scribe, Hive
  • Netflix: Amazon SimpleDB, Cassandra
  • Digg: Cassandra
  • SimpleGeo: Cassandra
  • StumbleUpon: HBase, OpenTSDB
  • Yahoo!: Hadoop, HBase, PNUTS
  • Rackspace: Cassandra

And probably many more missing from the list. But that could change if you leave a comment.

Original title and link: Hadoop and NoSQL Databases at Twitter (NoSQL databases © myNoSQL)

Rewriting the Redis Twitter Clone

The Redis Twitter clone app is showing its age:

I’m looking at the Twitter Clone and noticed a N + 1 -like “get” in the code […] The above code seems rather suboptimal, if my understanding is correct.

At least three better approaches have been suggested, so who is up for experimenting with Redis and rewriting this app to use latest Redis features?

  1. use pipelining to get all the posts in one server roundtrip (won’t change the code much and be much faster)
  2. use SORTGET semantics to get all the post data at once from the list of ids (should be somewhat faster than 1)
  3. Use MGET to get all the post data at once.

Original title and link: Rewriting the Redis Twitter Clone (NoSQL databases © myNoSQL)


Rainbird: Twitter’s ZooKeeper + Cassandra Based Realtime Analytics Solution

Kevin Weil[1] presented Twitter’s ZooKeeper and Cassandra based solution for realtime analytics named Rainbird at Strata 2011:

Until recently, counters where a unique feature of HBase. While the latest version of Cassandra does not include distributed counters, this feature is available in Cassandra’s trunk.

  1. Kevin Weil: Product Lead for Revenue, Twitter, @kevinweil  

Original title and link: Rainbird: Twitter’s ZooKeeper + Cassandra Based Realtime Analytics Solution (NoSQL databases © myNoSQL)

CouchDB Usecase: Decentralizing Twitter

J.Chris Anderson in an interview over ReadWriteWeb:

Klint Finley: Let’s start at the top: what exactly is Twebz? It’s described as a “decentralized Twitter client.” What exactly does that mean?

J Chris Anderson: The aim is to allow you to interact with Twitter when Twitter is up and you are online. But if Twitter is down for maintenance or you are in the middle of nowhere, you can still tweet. And when you can reach Twitter again, it will go through.

If lots of folks are using it, then they can see each other’s tweets come in even when Twitter is down.

Mostly the goal was to show the way on how to integrate CouchDB with web services and APIs.

A classical example of CouchDB powerful P2P replication capabilities. Dave Winer would probably be its ☞ biggest fan.

Original title and link: CouchDB Usecase: Decentralizing Twitter (NoSQL databases © myNoSQL)


Videos from Hadoop World

There was one NoSQL conference that I’ve missed and I was really pissed off: Hadoop World. Even if I’ve followed and curated the Twitter feed, resulting in Hadoop World in tweets, the feeling of not being there made me really sad. But now, thanks to Cloudera I’ll be able to watch most of the presentations. Many of them have already been published and the complete list can be found ☞ here.

Based on the twitter activity on that day, I’ve selected below the ones that seemed to have generated most buzz. The list contains names like Facebook, Twitter, eBay, Yahoo!, StumbleUpon, comScore, Mozilla, AOL. And there are quite a few more …