NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Twitter: All content tagged as Twitter in NoSQL databases and polyglot persistence

Building TweetReach with Sinatra, Tokyo Cabinet and Grackle

I’m starting to forget how many Twitter NoSQL-enabled apps I’ve mentioned on the NoSQL blog — fortunately the consistent tagging helps, so you can find them all under the tag Twitter — but every time I’m finding a new one I feel like posting about it.

This time it is a presentation about building a Twitter utility using Tokyo Cabinet and ☞ Sinatra (a Ruby web framework).

The author concludes with some Tokyo Cabinet lessons learned:

  • Lack of auto-expiration when using as mostly a key-value cache is annoying

  • Would definitely use it again for this type of task

I think it is interesting to note that from the key-value stores covered here, only Redis comes with support for key expiration.

Building TweetReach with Sinatra, Tokyo Cabinet and Grackle

Getting started with Redis, Python and YQL

A quick intro to Redis by Khashayar showing why he loves Redis, how to install and perform basic operations against Redis and building an RSS-to-Twitter tool with Python, YQL and Redis:

In this code we first use YQL to get the RSS. Then we parse the RSS to get our desired field […]. After that we save these values to our database […]


nosql:eu - Second day

The 2nd day at nosql:eu is over. It is time to review the great twits from the 1st day, the slides from the 1st day presentations and the great twits below.

For those of us that haven’t made it to ☞ nosql:eu conference I’ve extracted below some (hopefully most) of the most interesting twits from the conference. I’ve also post slides of the presentations as these are coming.

nosql:eu quotes

Check also the best twits from 1st day @ nosql:eu

  • kevinweil: Modifying my talk in realtime for #nosqleu. Adding Cassandra, HBase, FlockDB to already existing discussion of Scribe, Hadoop, Pig.
  • emileifrem: Cassandra is simply the best in its category. Check out @spyced’s latest deck: #nosql #nosqleu
  • maslett: RT @natishalom: @maslett with the planned support for memcache - gigaspaces turns memcache to a real NoSQL alternative IMO #nosqleu

    Note: personally I’d find that quite confusing. If Gigaspaces is not anymore an elastic cache, then what is it?

  • danharvey: #nosqleu question for today: how do you backup casandra / HBase for user/dev errors? The failure back up is built in.
  • Werner: Arrived at #nosqleu for the first presentation of the day.
  • awhitehouse: @werner We should challenge assumptions that DB partitioning papers make; sometimes smallest possibilities are treated as reality. #nosqleu
  • AndySeaborn: #nosqleu ☞
  • awhitehouse: @werner: We should all read “The 1995 SQL Reunion: People, Project, and Politics” ☞ #nosqleu
  • AndySeaborne: #nosqleu ☞
  • tlossen: “in real systems, there are no corners to cut” — werner vogels about the importance of occam’s razor in systems design #nosqleu
  • hungryblank: RT @tlossen: “in real systems, there are no corners to cut” — werner vogels about the importance of occam’s razor in systems design #nosqleu
  • matwall: @Werner #nosqleu nosql is about choice, not a fight between SQL and new tech.
  • tlossen: “you should all read the multics book” — werner vogels #nosqleu
  • martinbtt: #nosqleu @Werner “on the birth of dynamo”
  • tlossen: “real systems are pretty nasty things” — werner vogels #nosqleu
  • tlossen: “scaling amazon was all about the database, every year scaling out, scaling out ….” — werner vogels #nosqleu
  • tlossen: “scalability, availability, performance, cost-effectiveness are all in the end dominated by data management” — werner vogels #nosqleu
  • martinbtt: “The Amazon homepage is constructed by 200-300 different web services”. #Werner #soa #nosqleu
  • maslett: Amazon CTO @werner: “It all comes down to data management… that’s where the scalability is… that’s where most of the costs are” #nosqleu
  • tlossen: “i HATE eventual consistenty” — werner vogels #nosqleu
  • maslett: .@werner: “What we all want is strongly consistent systems - this eventual consistency stuff is a compromise.” #nosqleu
  • tlossen: “your customers will ALWAYS use your system in a way you did not expect” — werner vogels #nosqleu #dynamo
  • mfiguiere: #nosqleu Werner Vogels: “Customer put something in the shopping cart, they are about to give you money, that should ALWAYS works !”
  • monkchips: “In 2004 we felt we could no longer rely on commercial [relational] systems to operate at Amazon scale”. @werner vogels Amazon CTO, #nosqleu
  • buzzkills: “there were no comercial systems that could support amaon’s scale” [for many of their use cases] @Werner #nosqleu
  • tlossen: “at scale, ALL of this shit happens” — werner vogels on datacenter SNAFUs like flooding from the roof down etc. #nosqleu
  • tlossen: “scaling amazon = upgrading cessna to 747 in mid-flight” — werner vogels #nosqleu
  • tlossen: “object storage is FOREVER” — werner vogels on data outliving software #nosqleu
  • tlossen: “don’t forget, hardward LIES to you!” — werner vogels #nosqleu
  • awhitehouse: @werner: “Economies of scale are mostly about people” (and the knowledge they need to run your system) #nosqleu
  • tlossen: “we really have to dive deep and understand all the problems from top to bottom” - werner vogels on INTELLECTUAL economies of scale #nosqleu
  • beobal: “economies of scale are not just about technologies, it has a lot to do with people” @werner #nosqleu
  • tlossen: “transparency is EVIL” — werner vogels about NFS etc. #nosqleu
  • tlossen: “remember that storage is a very long-lasting relationship” — werner vogels #nosqleu
  • maslett: .@werner: “We shouldn’t all be doing this.” #nosqleu Companies should be focused on their business, not their databases.
  • simonw: Werner Vogels: “S3 is a better key/value store than Dynamo” (due to list/prefix operators) #nosqleu
  • tlossen: “if you keep your system simple, it drives simplicity at the customer side as well” — werner vogels on importance of occam’s razor #nosqleu
  • awhitehouse: “Simplicity needs to happen at the interface” … the API to your system drives the architecture @werner at #nosqleu
  • seanparsons : @Werner’s talk at #nosqleu was illuminating about the focus on managing interaction between systems.
  • CooperDino : WernerVogels at #NoSQLeu: When u do trillions of ops per day even the slightest probability becomes reality
  • martinbtt : Fantastic talk by @Werner at #nosqleu - loads of useful tech nuggets to take away. Great start to the day so far.
  • CooperDino : WernerVogels at #NoSQLeu: Bruce Lindsay & Jim Gray are our heroes, we should all read about their data sys work in the 70s
  • CooperDino : @Werner at #NoSQLeu: Last time Amazon was down was 2004 & it was related to an RDB crashing
  • CooperDino : @Werner at #NoSQLeu: 70% of storage operations in Amazon are key/value
  • CooperDino : WernerVogels at #NoSQLeu: If u have2 jump thru lots of hoops 2use any DB then it prob wrong choice. #JOOB is fresh choice4 #dotNet
  • CooperDino : @Werner at #NoSQLeu: Customers will not look at a DB in isolation, they will always look at where it sits in big picture
  • matwall : Head buzzing from inspiring talk from @werner at #nosqleu
  • monkchips : now @kevinweil (twitter’s analytics lead) presents via skype video… just showed us some very dark twitter offices ;-) #nosqleu
  • awhitehouse : Big hand to @kevinweil for giving his talk from Twitter HQ at 3am local time. #nosqleu
  • matwall : @kevinweil say twitter increase userbase by 300K per day, generate 7Tb of data *per day* #nosqleu
  • buzzkills : Twitter gave up on syslog because it didn’t scale #nosqleu
  • thobe : This is me contributing to the 300GB of twitter data generated while @kevinweil talk about it on #nosqleu
  • tlossen : “you write log lines — scribe does the rest” — kevin weill about logging at scale #nosqleu
  • buzzkills : @buzzkills apparently faceyb wrote scribe, Twitter are big contris #nosqleu (thx to @ianmeyers for correction)
  • matwall : @kevinweil from Twitter describing their Scribe -> Hadoop -> Pig pipeline for data alanysis at #nosqleu Very interesting, I want one.
  • tlossen : “want less java in your life? use pig!” — kevin weill, giving advice on hadoop #nosqleu
  • matwall : @kevinweill on datamining user data: It’s easy to answer questions, it’s hard to ask the right questions. #nosqleu
  • wwwicked : Loving the simplicity of a Pig script versus the equivalent Hadoop/Java code #nosqleu
  • beobal : “value the system that promotes innovation, iteration” @kevinweil #nosqleu
  • monkchips : facebook’s scribe at master - GitHub ☞ a logging system for client performance data, also used by twitter. #nosqleu
  • awhitehouse : @kevinweil: Twitter does most of its data analysis in Pig - scripts can call user-defined functions coded in Java (v. powerful) #nosqleu
  • matwall : Twitter using Apache Mahout coupled with Pig for machine learning when examining user behaviour #nosqleu
  • andrewgarner : Totally sold on Pig #nosqleu
  • wwwicked : A friend of mine said “NoSQL is retarded”. The more I’ve heard over the past 2 days, more more I realise he’s wildly wrong #nosqleu
  • emileifrem : @wwwicked Term is retarded. Notion all RDBMSes will be replaced is retarded. That we’re heading to a polyglot persistence era isnt. #nosqleu
  • monkchips : “we’re trying to move all tweets to Cassandra”. @kevinweil Twitter #nosqleu

    Note: You can read the whole story in myNoSQL exclusive Cassandra @ Twitter: And interview with Ryan King

  • tlossen : “better eventual consistency than POTENTIAL consistency” — kevin weil on reasons to use cassandra at twitter #nosqleu
  • maslett : Twitter is working with Digg to create real-time analytics for Cassandra. Plans to open source. #nosqleu
  • msk_y : RT @buzzkills: Twitter store their log files in Lzo compressed, protocol buffers format on hdfs #nosqleu
  • kingsleydavies : #nosqleu CouchDB used at BBC - typically used as a KVS and is used in iPlayer and parts of the homepage…
  • tlossen : “you can throw rocks and stones at it, and it just keeps going” — enda farrell (bbc) about robustness of couchdb #nosqleu
  • matwall : @endafarrell CouchDb restarts in < 1sec. Occasionally restart in production as restarts are far less than TCP timeout! #nosqleu
  • tlossen : enda farrell shared a neat idea: “pre-sharding” — running 4 instances of couchdb on every node [couchdb @ bbc talk] #nosqleu
  • matwall : @endafarrell “Having things that just work and are simple from the users perspective is brilliant” #nosqleu
  • CooperDino : #NoSQLeu: BBC web site handles 200m requests per day on 1.5TB of data using 8 servers & #CouchDB
  • monkchips : exciting! presentation at #nosqleu from Comcast chief engineer @jon_moore : Why Big Enterprises Are Interested in NoSQL
  • matwall : Agree with @jon_moore at #nosqleu : storage is a means to a business end, nosql contains intrinsic risk
  • benoitc : idealized api of comcast looks like the #couchdb one get,post, get _views #nosqleu
  • matwall : @jon_moore at #nosqleu Can I add more capacity without adding too many more sysadmins? Can my admins work 9-5?
  • matwall : @jon_moore at #nosqleu Is there a company behind product to provide operational support? Important for commoditization
  • monkchips : surprising requirement of the #nosqleu conference. NoSQL providers take note: Enterprises expect JMX support. java ain’t dead. devops?
  • wwwicked : #nosqleu @jon_moore made a fair point re: my comment about analytics on KV stores; may not be best idea but “they” will want to do it anyway
  • timanglade : Totally awesome break-down of the CAP theorem (in the context of Multiple Datacenters) by the amazing @jon_moore. Refreshingly enlightening.
  • kingsleydavies : loving the name *Tokyo Tyrant* and a great, upbeat start to @makoto_inoue preso… #nosqleu
  • matwall : @makato_inoue Says that @al3xandu’s site myNoSL is like “Hello magazine for nosql” :) #nosqleu
  • matwall : Can we have a 3 hour workshop with @makato_inoue please? He’s great! #nosqleu
  • kingsleydavies : +1 yeah… I fear we wont have enough time :-( RT @matwall: Can we have a 3 hour workshop with @makato_inoue please? He’s great! #nosqleu
  • maslett : great presentation on the highly random world of Tokyo Cabinet/Tyrant by @makoto_inoue #nosqleu
  • maslett : Quote of the day: “myNoSQL is the Hello magazine of NoSQL" #nosqleu
  • michaeltiberg : #nosqleu conference is to an end - attendees seems to be satisfied and that makes my day

Check also the best twits from 1st day @ nosql:eu

nosql:eu presentations

Check also the nosql:eu presentations from 1st day

On the Birth of Dynamo - Werner Vogels

Nothing here yet :-(.

Twitter’s use of Cassandra, Pig and HBase - Kevin Weil

Slides from Kevin Weil (@kevinweil) presentation on Twitter’s use of Cassandra, Pig and HBase

CouchDB at the BBC - Enda Farrell

Nothing here yet :(

Why Big Enterprises are Interested in NoSQL - Jon Moore

Slides from Jon Moore (@jon_moore) presentation: Why Big Enterprises are Interested in NoSQL

Memory as the New Disk: Why Redis Rocks - Tim Lossen

Slides from Tim Lossen (@tlossen): Memory as the New Disk: Why Redis Rocks

Tokyo Cabinet, Tokyo Tyrant and Kyoto Cabinet - Makoto Inoue

Nothing here yet :(

Notes from the field: NoSQL tools in Production - Matthew Ford

Slides from Matthew Ford (@matthewcford) Notes from the field: NoSQL tools in Production presentation

nosql:eu live twitter stream

fetching nosql:eu…

Check also the nosql:eu presentations from 1st day

nosql:eu - First day

In case you’ve missed the ☞ nosql:eu conference, you can find below the most interesting nosql:eu twits and some of the nosql:eu presentations. And make sure to check nosql:eu second day.

nosql:eu quotes

Check also the best twits from nosql:eu 2nd day

  • emileifrem: Very envious of the crowd that’s just about to roll into the @nosqleu venue. VolcaNoSQL has been rough, but it will still be a kickass show.
  • al3xandru: Make your twits through the ash cloud #nosqleu so those stopped by the #VolcaNoSQL can hear something too
  • nosqleu: All the equipment is set, the remote presentation setup has been tested and attendees are already pouring in. Let’s go — wish us luck!
  • jystewart: wow! @werner managed to get on an iceland-glasgow flight and so will make it to #nosqleu against all the odds
  • wwwicked: used Redis for analysis of leaked BNP membership list - mapping postcodes to constituencies #nosqleu
  • wwwicked: Guardian used a traditional RDBMS for first MPs expenses crowd-sourced review app; “possibly the worst possible implementation” #nosqleu

    Note: we’ve published a post on this topic Redis Usecase: replacing MySQL order by rand()

  • monkchips : Guardian Zeitgeist. “We use Big Table as a dumping ground for data you can sort by 1 or 2 columns when you need to” @simonw @nosqleu #nosql
  • zacksm: “spreadsheets are NoSQL too…” #nosqleu
  • kingsleydavies: use of #AppEngine and #BigTable w/ scatter/gather and shards at the guardian + GOOG sprdshts for rapid prototyping and releases.. #nosqleu
  • monkchips : “there is a form of NoSQL we have been using for years - spreadsheets”. says @simonw. #pragmatism #NoSQL #NoSQLeu
  • klbostee :The Guardian’s web architecture as shown at #nosqleu
  • matshenricson :Total total kick-ass presentation by two guys at ! The future of journalism, from a developer point of view. #nosqleu
  • monkchips : one of the best tech talks i have ever seen. @simonw and @matwal explained exactly how NoSQL supports the Guardian’s *business*. #nosqleu
  • philb0: You need to understand your DB query patterns before choosing a technology #nosqleu

    Note: last time I wrote about it was yesterday in considering data stores post.

  • awhitehouse: Use the right tool for the job » @timanglade at #nosqleu: “Let’s not hack a RDBMS to do a graph, let’s use a real graph database”
  • wwwicked: NoSQL sucks! It’s true… if you use it badly; e.g. don’t try analytics on a key-value store. #nosqleu
  • kingsleydavies: flagrant promotion ;-) #nosqleu 1 stop shop for NoSQL links and info…
  • ianmeyers : #nosqleu Wondering how many people have approached NOSQL from the perspective of NOT being anti-SQL?
  • mfiguiere: #nosqleu Le NoSQL n’est pas une question de volume de données, mais de représentation de données

    (trans): #nosqleu NoSQL is not a matter of volume of data, but data representation

  • wwwicked: “I am terrified of the uber database that is yet to come” - @matwall urges us to think Unix not Windows re: task/use-specific DBs #nosqleu
  • coderholic: Great morning so far at #nosqleu including spreadsheet backed websites from the guardian!
  • kingsleydavies: #nosqleu actually, some of the benefits of a choppy preso transmission, is I heard *enough* to want to chase up independantly
  • kingsleydavies: Key featr’s of #Riak: links, map-reduce, vctr clocks directly impl; dynmo conflict res; distrbtd, auto-scales (on the fly), durable #nosqleu
  • awhitehouse: awhitehouse Distributed key stores would benefit from better management tools (e.g. SNMP-based) - @hobbyist’s team looking at this. #nosqleu
  • benoitc: @mongodb definition by comparing itself to @CouchDB … #nosqleu
  • kingsleydavies: #nosqleu document oriented datastores sounds pretty much like ‘the new’ object DB to me ?
  • benoitc: for those in #nosqleu who have questions about #couchdb dont hesitate to ask me or to @endafarrell
  • matwall: MongoDB looking fantastic. I want one now! How best to pick from couch or mongo? #nosqleu
  • ck1125: wishing i booked a place to #nosqleu
  • buzzkills: Mongodb supports geolocation queries #nosqleu
  • kjlloydie: MongoDB looks really useful. Would love to see performance metrics for large datasets. #nosqleu
  • matwall: Wow. MongoDb features geo-location also. Beautiful query syntax in the python examples. #nosqleu
  • coderholic: The geo features of mongodb look very cool #nosqleu
  • wwwicked: MongoDB has quite a sexy update syntax. Geolocation features are nice, too. #nosqleu
  • buzzkills: Mongodb makes it easy to send deltas to your documents. Increase values, push values into arrays etc #nosqleu

    Note: it’s quite interesting to see that from tons of features, people got excited about geo support which was added only in the last MongoDB 1.4 release

  • wwwicked: Quite a hard sell by the MongoDB guy though. Far more so than the Riak guy #nosqleu
  • buzzkills: Mongodb prez is much more about what it does rather than how it does it, I think I would have prefered the latter. #nosqleu
  • jystewart: liking the look of mongodb’s geo features. not surprised @foursquare are using them #nosqleu
  • PaulDJohnston: #nosqleu I’m getting confused about when to use specific types of database… don’t just say “use mine” but “use mine for *this*”
  • mfiguiere:#nosqleu La plus large base MongoDB installée en production : 12 To sur une seule instance !

    Note: never heard of this MongoDB size before, so I’m wondering if the project is secret!

  • awhitehouse: Jonathan Ellis aka @spyced has just started a company called Riptano based around Cassandra #nosqleu

    Breaking: Riptano - First company focused on Cassandra started by Cassandra project chair, Jonathan Ellis

  • matwall: I wonder how cassandra knows which machines are in each rack? #nosqleu
  • mfiguiere: #nosqleu if your software wakes people up at 4 am to fix it, then you’re probably doing things wrong…
  • monkchips: Cassandra uses JMX? blimey. didn’t expect to hear that acronym today. “like most things in Java its quite clunky”. #cassandra #nosqleu
  • matwall: Feel like we’re watching the battle of the low end data structures at the mo. How does it help my business? #nosqleu
  • buzzkills: Good to hear digg are contributing a vector clock implementation to cassandra for the next version #nosqleu

    Note: Digg announced quite a few more goodies to be added to Cassandra. Some of them have already been included in the Cassandra 0.6.0 release

  • awhitehouse: @monkchips asked “what use cases does Cassandra cover?” - @spyced ref’d to this talk [“RDBMS’s don’t scale”] #nosqleu

    Note: Link to What every developer should know about database scalability presentation.

  • monkchips one of the most useful insights about “why NoSQL?” so far today. “data is more and more semi-structured” #nosqleu #neo4j
  • wwwicked Oh! Graph databases are nothing like what I pictured. No pun intended. #nosqleu
  • matwall Good examples of possible usages of Neo4j explained well. #nosqleu
  • kingsleydavies #nosqleu watching @thobe present on #neo4j graph DB. can already think og at least 1 use case to trial this.. great to see use cases in pres
  • kevinweil Looking forward to giving my #nosqleu talk remotely from TwitterHQ at 3am (11am London time) this morning
  • maslett #nosqleu phrase of the day: choose the best solution/tool/storage model for the job. There might be something in “Not Only SQL” after all
  • NeilRobbins For me the best talks of the day were the Cassandra & Riak talks, though thanks to the better quality audio the Cassandra talk wins #nosqleu
  • tom_wilkie #nosqleu good day. Guardian talk the best IMHO

Check also the best twits from nosql:eu 2nd day

nosql:eu presentations

Check also the nosql:eu presentations from 2nd day

NoSql at - Matthew Wall & Simon Willison

Slides from Matthew Wall (@matwall) & Simon Willison (@simonw) on NoSQL usage at

An Overview of NoSQL - Tim Anglade

Slides from Tim Anglade (@timanglade) An Overview of NoSQL

Key-value stores and Riak - Bryan Fink

Nothing here yet :(.

Document-oriented databases and MongoDB - Mathias Stearn

Slides from Mathias Stearn (remote) presentation on document-oriented databases and MongoDB

Column-oriented databases and Cassandra - Jonathan Ellis

Slides from Jonathan Ellis (@spyced) presentation on column-oriented databases and Cassandra:

Graph databases and Neo4j - Tobias Ivarsson

Slides from Tobias Ivarsson (@thobe) presentation on graph databases and Neo4j

Check also the nosql:eu presentations from 2nd day

Redis-powered Facebook-like newsfeeds

As we’ve learned over time there are only two ways to keep your service usable: either make it fast for every access or do the work upfront. Each of these comes with their limitations and costs and for the proposed solution using the precomputing approach these are well explained in the linked article:

In any real-life architecture, there are trade-offs made between speed, storage type and memory usage. In this case, memory and disk space is being compromised for the sake of speed of access. Retrieving any type of newsfeed is virtually free, which means that page loading speed will not be affected. Writes are still very fast, as there is no disk or SQL access involved. Although memory usage is relatively high, the minimal amount of data which is stored, and the use of MD5 keys to avoid unneccesary data replication, help to keep it within reason. Additionally, appropriate use of Redis’ automatic key expiry settings, and a regular cronjob pruning of old lists, will help even further.

While nothing written in the article is incorrect per se, there are some additional aspects of complexity which are not covered:

  • writes explosion: when using the precomputing approach with denormalized data there are many situations in which a single core write will trigger many more additional write operations. A very simple example is a user having tons of followers. At that moment, continuing to perform writes synchronously will soon become user noticeable (nb however fast writes are, at some moment the operation will become noticeable) and that’s something you’ll want to avoid
  • data explosion or data not fitting a single machine. At that point for each write you’ll also need to determine the different locations where data will need to be pushed to. Basically you’ll have to route your data to various locations in your cluster and this will come with . Also, without a very good data partitioning strategy — I’m not referring here only to core data (i.e. newsfeed), but also additional info that is usually presented along with it — data retrieval might require multiple roundtrips to various machines and that will contribute to the complexity of your system.

Bottom line this is a common problem for services like Facebook, Twitter and many others. And as far as I know, Twitter is using the precompute strategy, while Facebook is using both precompute and make it fast for every access strategies by using heavily parallelization and various optimizations[1].


  • [1] Disclaimer: I haven’t worked for any of these and my comment is based on the architecture talks I’ve been watching about the two services. ()


Redis and Twitter filters in Python or Ruby

Mirko Froehlich has a ☞ long post explaining the problem and the rationale behind the chosen architectures. Then, he goes on presenting the various pieces used in building the solution:

Code is available on ☞ GitHub.

Bulkan Evcimen took this sample application and built it on a Python stack:

So now you have yet another “good” reason[1] to play with Redis and Twitter.


Cassandra @ Twitter: An Interview with Ryan King

There have been confirmed rumors[1] about Twitter planning to use Cassandra for a long time. But except the mentioned post, I couldn’t find any other references.

Twitter is fun by itself and we all know that NoSQL projects love Twitter. So, imagine how excited I was when after posting about Cassandra 0.5.0 release, I received a short email from Ryan King, the lead of Cassandra efforts at Twitter simply saying that he would be glad to talk about these efforts.

So without further ado, here is the conversation I had with Ryan King (@rk) about Cassandra usage at Twitter:

MyNoSQL: Can you please start by stating the problem that lead you to look into NoSQL?

Ryan King: We have a lot of data, the growth factor in that data is huge and the rate of growth is accelerating.

We have a system in place based on shared mysql + memcache but its quickly becoming prohibitively costly (in terms of manpower) to operate. We need a system that can grow in a more automated fashion and be highly available.

MyNoSQL: I imagine you’ve investigated many possible approaches, so what are the major solutions that you have considered?

Ryan King:

  • A more automated sharded mysql setup
  • Various databases: HBase, Voldemort, MongoDB, MemcacheDB, Redis, Cassandra, HyperTable and probably some others I’m forgetting.

MyNoSQL: What kind of tests have you run to evaluate these systems?

Ryan King: We first evaluated them on their architectures by asking many questions along the lines of:

  • How will we add new machines?
  • Are their any single points of failure?
  • Do the writes scale as well?
  • How much administration will the system require?
  • If its open source, is there a healthy community?
  • How much time and effort would we have to expend to deploy and integrate it?
  • Does it use technology which we know we can work with? *… and so on.

Asking these questions narrowed down our choices dramatically. Everything but Cassandra was ruled out by those questions. Given that it seemed to be our best choice, we went about testing its functionality (“can we reasonably model our data in this system?”) and load testing.

The load testing mostly focused on the write-path. In the medium/long term we’d like to be able to run without a cache in front of Cassandra, but for now we have plenty of memcache capacity and experience with scaling traffic that way.

MyNoSQL: If you draw a line, what were the top reasons for going with Cassandra?

Ryan King:

  • No single points of failure
  • Highly scalable writes (we have highly variable write traffic)
  • A healthy and productive open source community

MyNoSQL: Will Cassandra completely replace the current solution?

Ryan King: Over time, yes. We’re currently moving our largest (and most painful to maintain) table — the statuses table, which contains all tweets and retweets. After this we’ll start putting some new projects on Cassandra and migrating other tables.

MyNoSQL: How do you plan to migrate existing data?

Ryan King: We have a nice system for dynamically controlling features on our site. We commonly use this to roll out new features incrementally across our user base. We use the same system for rolling out new infrastructure.

So to roll out the new data store we do this:

  1. Write code that can write to Cassandra in parallel to Mysql, but keep it disabled by the tool I mentioned above
  2. Slowly turn up the writes to Cassandra (we can do this by user groups “turn this feature on for employees only” or by percentages “turn this feature on for 1.2% of users”)
  3. Find a bug :)
  4. Turn the feature off
  5. Fix the bug and deploy
  6. GOTO #2

Eventually we get to a point where we’re doing 100% doubling of our writes and comfortable that we’re going to stay there. Then we:

  1. Take a backup from the mysql databases
  2. Run an importer that imports the data to cassandra

    Some side notes here about importing. We were originally trying to use the BinaryMemtable[2] interface, but we actually found it to be too fast — it would saturate the backplane of our network. We’ve switched back to using the Thrift interface for bulk loading (and we still have to throttle it). The whole process takes about a week now. With infinite network bandwidth we could do it in about 7 hours on our current cluster.

  3. Once the data is imported we start turning on real read traffic to Cassandra (in parallel to the mysql traffic), again by user groups and percentages.

  4. Once we’re satisfied with the new system (we’re using the real production traffic with instrumentation in our application to QA the new datastore) we can start turning down traffic to the mysql databases.

A philosophical note here — our process for rolling out new major infrastructure can be summed up as “integrate first, then iterate”. We try to get new systems integrated into the application code base as early in their development as possible (but likely only activated for a small number of people). This allows us to iterate on many fronts in parallel: design, engineering, operations, etc.

MyNoSQL: Please include anything I’ve missed.

Ryan King: I can’t really think of anything else.

MyNoSQL: Thank you very much!


NoSQL Twitter Applications

Everyone is building these days a Twitter-like or Twitter-related project using some NoSQL solution. I guess they can use as a ‘scientific’ explanation for these experiments Nati Shalom’s (Gigaspaces) great ☞ post on the common principles behind NoSQL alternatives (the post was inspired by his talk at QCon on building a scalable Twitter application. The presentation is embedded below).


Even if the project code is not available and I couldn’t get the mentioned online version to work, I’d say that the combination of Redis and HTML5 WebSockets is making it worth mentioning. And it case you cannot get it to work either, there is a screencast for it:


TStore is a twitter search result backup tool build in Python and CouchDB. The source code is available on ☞ GitHub.


Retwis is a non-distributed Twitter clone built in PHP and using Redis. The source code and extended details about the implementation are available ☞ here.

According to this page, there is already a port of this solution to Ruby and Sinatra: ☞ Retwis-RB.

Update: Thanks to @koevert, now the list includes also a java port of Retwis: ☞ twayis


Floxee is a commercial tweetstream search and tagging platform built using MongoDB. You can read a bit more about MongoDB usage ☞ here

I am pretty sure I haven’t found all Twitter-like/Twitter-related NoSQL apps out there, so please feel free to send me more. I’ll be happy to update the post.

And in case you are not interested in NoSQL Twitter applications, then you can check the MongoDB-based forum/message-boards apps.

Nati Shalom: Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web Applications