NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Powered by NoSQL: All content tagged as Powered by NoSQL in NoSQL databases and polyglot persistence

Bitly Forget Table - Building Categorical Distributions in Redis

In the comment thread of the post “Using Redis as an external index for surfacing interesting content“, Micha Gorelick pointed to a post covering a similar solution used at Bitly:

We store the categorical distribution as a set of event counts, along with a ‘normalising constant’ which is simply the number of all the events we’ve stored. […]

All this lives in a Redis sorted set where the key describes the variable which, in this case, would simply be bitly_country and the value would be a categorical distribution. Each element in the set would be a country and the score of each element would be the number of clicks from that country. We store a separate element in the set (traditionally called z) that records the total number of clicks stored in the set. When we want to report the categorical distribution, we extract the whole sorted set, divide each count by z, and report the result.

Storing the categorical distribution in this way allows us to make very rapid writes (simply increment the score of two elements of the sorted set) and means we can store millions of categorical distributions in memory. Storing a large number of these is important, as we’d often like to know the normal behavior of a particular key phrase, or the normal behavior of a topic, or a bundle, and so on.

The Bitly team has open sources their solution named Forget Table and the code is available on GitHub.

Original title and link: Bitly Forget Table - Building Categorical Distributions in Redis (NoSQL database©myNoSQL)


Now All Reads Come From Redis at YouPorn

Speaking of Redis as the primary data store, this post from Andrea reminded me of YouPorn usage of Redis:

Datastore is the most interesting part. Initially they used MySQL but more than 200 million of pageviews and 300K query per second are too much to be handled using only MySQL. First try was to add ActiveMQ to enqueue writes but a separate Java infrastructure is too expensive to be maintained Finally they add Redis in front of MySQL and use it as main datastore.

Now all reads come from Redis. MySQL is used to allow the building new sorted sets as requirements change and it’s highly normalized because it’s not used directly for the site. After the switchover additional Redis nodes were added, not because Redis was overworked, but because the network cards couldn’t keep up with Redis. Lists are stored in a sorted set and MySQL is used as source to rebuild them when needed. Pipelining allows Redis to be faster and Append-only-file (AOF) is an efficient strategy to easily backup data.

Original title and link: Now All Reads Come From Redis at YouPorn (NoSQL database©myNoSQL)


Using Redis as an External Index for Surfacing Interesting Content at Heyzap

Micah Fivecoate introduces a series of algorithms used at Heyzap for surfacing interesting content:

  1. currently popular
  2. hot stream
  3. drip stream
  4. friends stream

All of them are implemented using Redis ZSETs:

In all my examples, I’m using Redis as an external index. You could add a column and an index to your posts table, but it’s probably huge, which presents its own limitations. Additionally, since we only care about the most popular items, we can save memory by only indexing the top few thousand items.

Original title and link: Using Redis as an External Index for Surfacing Interesting Content at Heyzap (NoSQL database©myNoSQL)


Graph Based Recommendation Systems at eBay

Slidedeck from eBay explaining how they have implemented a graph based recommendation system based on,—surprise! not a graph database—Cassandra.

Original title and link: Graph Based Recommendation Systems at eBay (NoSQL database©myNoSQL)

Cassandra at MetricsHub for Cloud Monitoring

Charles Lamanna (CEO MetricsHub):

We use Cassandra for recording time series information (e.g. metrics) as well as special events (e.g. server failure) for our customers. We have a multi-tenant Cassandra cluster for this. We record over 16 data points per server per second, 24 hours a day, 7 days a week. We use Cassandra to store and crunch this data.

Many of the NoSQL databases can be used for monitoring. For example for small scale self-monitoring you could use Redis.

Original title and link: Cassandra at MetricsHub for Cloud Monitoring (NoSQL database©myNoSQL)


From MongoDB to Riak at Shareaholic

Robby Grossman talked at Boston Riak meetup about Shareaholic’s migration from MongoDB to Riak and their requirements and evaluation of top contenders: HBase, Cassandra, Riak.

Why not MongoDB?

  • working set needs to fit in memory
  • global write lock blocks all queries despite not having transactions/joins
  • standbys not “hot”

Bullet point format pros and cons for HBase, Cassandra, and Riak are in the slides.


Cassandra at Scandit

We use Cassandra in two ways: First, it holds our product database. Second, we use it to store and analyze the scans generated by the apps that integrate the Barcode Scanner SDK. We call this Scanalytics.

Scanalytics is a web-based analytics platform that lets app developers see what happens in their app: What kind of products do their users scan? Groceries, electronics, cosmetics, etc.? Where do they scan? At home? In the retail store? And so on. All that goes into Cassandra.

The Product database has 25 million records, so you could probably do it with any database. But I’d be interested to learn how data is modeled in Scanalytics.

Original title and link: Cassandra at Scandit (NoSQL database©myNoSQL)


From S3 to CouchDB and Redis and Then Half Way Back for Serving Ads

The story of going form S3 to CouchDB and Redis and then back to S3 and Redis for ad serving:

The solution to this situation has a touch of irony. With Redis in place, we replaced CouchDB for placement- and ad-data with S3. Since we weren’t using any CouchDB-specific features, we simply published all the documents to S3 buckets instead. We still did the Redis cache warming upfront and data updates in the background. So by decoupling the application from the persistence layer using Redis, we also removed the need for a super fast database backend. We didn’t care that S3 is slower than a local CouchDB, since we updated everything asynchronously.

Besides the detailed blog post there’s also a slidedeck:

Original title and link: From S3 to CouchDB and Redis and Then Half Way Back for Serving Ads (NoSQL database©myNoSQL)


Berkeley DB at Yammer: Application Specific NoSQL Data Stores for Everyone

Even if I’ve been using Berkley DB for over 6 years, I very rarely heard stories about it. This presentation from Yammer tells the story of taking Berkley DB a long way:

In early 2011 Yammer set out to replace an 11 billion row PostgreSQL message delivery database with something a bit more scale-ready. They reached for several databases with which they were familiar, but none proved to be a fit for various reasons. Following in the footsteps of so few before them, they took the wheel of the SS Berkeley DB Java Edition and piloted it into the uncharted waters of horizontal scalability.

In this talk, Ryan will cover Yammer’s journey through log cleaner infested waters, being hijacked on the high seas by the BDB B-tree cache, and their eventual flotilla of a 45 node, 256 partition BDB cluster.

Pinterest Architecture Numbers

Todd Hoff caught some new numbers about Pinterest architecture and from those the ones interesting from the data point of view:

  • 125 EC2 memcached instances, from which 90 for production and 35 for internal usage:

    Another 90 EC2 instances are dedicated towards caching, through memcache. “This allows us to keep a lot of data in memory that is accessed very often, so we can keep load off of our database system,” Park said. Another 35 instances are used for internal purposes.

  • 70 master MySQL databases on EC2

    • sharded at 50% capacity
    • backup databases in different regions

    Behind the application, Pinterest runs about 70 master databases on EC2, as well as another set of backup databases located in different regions around the world for redundancy.

    In order to serve its users in a timely fashion, Pinterest sharded its database tables across multiple servers. When a database server gets more than 50% filled, Pinterest engineers move half its contents to another server, a process called sharding. Last November, the company had eight master-slave database pairs. Now it has 64 pairs of databases. “The sharded architecture has let us grow and get the I/O capacity we need,” Park said.

  • 80 million/410TB objects stored in S3

  • no details about Redis

Original title and link: Pinterest Architecture Numbers (NoSQL database©myNoSQL)

Cassandra at Workware Systems: Data Model FTW

One of the stories in which the deciding factor for using Cassandra was primarily the data model and not its scalability characteristics:

We started working with relational databases, and began building things primarily with PostgreSQL at first.  But dealing with the kind of data that we do, the data model just wasn’t appropriate. We started with Cassandra in the beginning to solve one problem: we needed to persist large vector data that was updated frequently from many different sources. RDBMS’s just don’t do that very well, and the performance is really terrible for fast read operations. By contrast, Cassandra stores that type of data exceptionally well and the performance is fantastic. We went on from there and just decided to store everything in Cassandra.

Original title and link: Cassandra at Workware Systems: Data Model FTW (NoSQL database©myNoSQL)


The Story of Scaling Draw Something From an Amazon S3 Custom Key-Value Service to Using Couchbase

This is the story of scaling Draw Something told by its CTO Jason Pearlman.

The early days, a custom key-value service built on top of Amazon S3:

The original backend for Draw Something was designed as a simple key/value store with versioning. The service was built into our existing ruby API (using the merb framework and thin web server). Our initial idea was why not use our existing API for all the stuff we’ve done before, like users, signup/login, virtual currency, inventory; and write some new key/value stuff for Draw Something? Since we design for scale, we initially chose Amazon S3 as our data store for all this key/value data. The idea behind this was why not sacrifice some latency but gain unlimited scalability and storage.

Then the early signs of growth and the same key-value service using a different Ruby stack:

Being always interested in the latest tech out there, we were looking at Ruby 1.9, fibers, and in particular Event Machine + synchrony for a while. Combined with the need for a solution ASAP - this lead us to Goliath, a non-blocking ruby app server written by the guys at PostRank. Over the next 24 hours I ported over the key/value code and other supporting libraries, wrote a few tests and we launched the service live. The result was great. We went from 115 app instances on over six servers to just 15 app instances.

The custom built key-value service didn’t last long though and the switch to a real key-value store was made:

We brought up a small cluster of Membase (a.k.a Couchbase) rewrote the entire app, and deployed it live at 3 a.m. that same night. Instantly, our cloud datastore issues slowed down, although we still relied on it to do a lazy migration of data to our new Couchbase cluster.

Finally, learning to scale, tune and operate Couchbase at scale:

Even with the issues we were having with Couchbase, we decided it was too much of a risk to move off our current infrastructure and go with something completely different. At this point, Draw Something was being played by 3-4 million players each day. We contacted Couchbase, got some advice, which really was to expand our clusters, eventually to really beefy machines with SSD hard drives and tons of ram. We did this, made multiple clusters, and sharded between them for even more scalability over the next few days. We were also continuing to improve and scale all of our backend services, as traffic continued to skyrocket. We were now averaging hundreds of drawings per second.

Scaling “Draw something” is a success story. But looking at the above steps and considering how fast things had to change and evolve, think how many could have stumbled at each of these phases, think what would have meant to not be able to tell which parts of the system had to change or to have to take the system offline for upgrading parts of it.

Original title and link: The Story of Scaling Draw Something From an Amazon S3 Custom Key-Value Service to Using Couchbase (NoSQL database©myNoSQL)