This is the story of scaling Draw Something told by its CTO Jason Pearlman.
The early days, a custom key-value service built on top of Amazon S3:
The original backend for Draw Something was designed as a simple key/value store with versioning. The service was built into our existing ruby API (using the merb framework and thin web server). Our initial idea was why not use our existing API for all the stuff we’ve done before, like users, signup/login, virtual currency, inventory; and write some new key/value stuff for Draw Something? Since we design for scale, we initially chose Amazon S3 as our data store for all this key/value data. The idea behind this was why not sacrifice some latency but gain unlimited scalability and storage.
Then the early signs of growth and the same key-value service using a different Ruby stack:
Being always interested in the latest tech out there, we were looking at Ruby 1.9, fibers, and in particular Event Machine + synchrony for a while. Combined with the need for a solution ASAP - this lead us to Goliath, a non-blocking ruby app server written by the guys at PostRank. Over the next 24 hours I ported over the key/value code and other supporting libraries, wrote a few tests and we launched the service live. The result was great. We went from 115 app instances on over six servers to just 15 app instances.
The custom built key-value service didn’t last long though and the switch to a real key-value store was made:
We brought up a small cluster of Membase (a.k.a Couchbase) rewrote the entire app, and deployed it live at 3 a.m. that same night. Instantly, our cloud datastore issues slowed down, although we still relied on it to do a lazy migration of data to our new Couchbase cluster.
Finally, learning to scale, tune and operate Couchbase at scale:
Even with the issues we were having with Couchbase, we decided it was too much of a risk to move off our current infrastructure and go with something completely different. At this point, Draw Something was being played by 3-4 million players each day. We contacted Couchbase, got some advice, which really was to expand our clusters, eventually to really beefy machines with SSD hard drives and tons of ram. We did this, made multiple clusters, and sharded between them for even more scalability over the next few days. We were also continuing to improve and scale all of our backend services, as traffic continued to skyrocket. We were now averaging hundreds of drawings per second.
Scaling “Draw something” is a success story. But looking at the above steps and considering how fast things had to change and evolve, think how many could have stumbled at each of these phases, think what would have meant to not be able to tell which parts of the system had to change or to have to take the system offline for upgrading parts of it.
Original title and link: The Story of Scaling Draw Something From an Amazon S3 Custom Key-Value Service to Using Couchbase ( ©myNoSQL)