NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



usecase: All content tagged as usecase in NoSQL databases and polyglot persistence

Tracking page views with MongoDB

After looking at 4 different alternatives — Google Analytics, sharding existing MySQL database, ETL process (nb log processing) and MongoDB, Eventbrite decided to go the MongoDB way dismissing the other approaches:

  • Google Analytics
    • Time-consuming to set up and test
    • Migrating existing page-view data is tricky
    • Not real time
  • Shard the MySQL table
    • Requires downtime to make schema changes
    • Introduces routing at the code level
    • Would still be prone to row locks if we outgrew # of shards
  • ETL process (aka, write to log file and have a process that aggregates and periodically writes to the database)
    • No data integrity
    • Not real time
    • Requires management of log files over multiple web servers

While the post doesn’t really detail the reasons why MongoDB would be able to solve this problem, there are a couple of MongoDB features described in this ☞ post that make it a good fit for this scenario.

Basically, the solution is based on a combination of the following MongoDB features:

  • upsert and $inc operators
  • write operations returning immediately (without waiting the server to effectively confirm a write)
  • $inc operator is a very good replacement for a read-modify-update scenario
As a side note, remember there’s another MongoDB based tool: Hummingbird that provides real time web traffic visualization.

This post is part of the MongoDB case studies series.


A Headless Web Site Screenshot Service using Redis and Resque

Just another example of using Redis-based queues to build a headless web site screenshot service.

#add a sample
#ruby sample.rb
QUEUE=* rake resque:work
rescue-web #optional

#run the webserver
ruby server.rb

wget http://localhost:4567/schedule?url=


CouchDB and The Distributed Web Data

Michael Hunger (@mesirii) reports on Jan Lehnardt’s talk at Berlin Buzzwords:

Jan also stressed privacy issues and ownership of data, mentioning facebook and the diaspora project. One proposed solution could be to keep more personal data (or at least a safe copy) in a local couchdb instance. (“Every machine should run a webserver anyway - that is one the original ideas of the web.”). If you have those instances on any of your machines then information as:

  • contacts
  • appointments
  • bookmarks and browse history
  • even email and IM messages

could be stored as documents in the database and automatically synched (on the database level) to couchdb instances on all the other devices you use.

CouchDB has quite a few features (i.e. a friendly HTTP/REST API, replication and some more) that make it interesting for such distributed web data scenarios.


E-commerce apps with MongoDB

Kyle Banker (@hwaet) has two articles — ☞ here and ☞ here — explaining how combining some of the MongoDB feature can make it useful for e-commerce applications.

These articles are looking at scenarios like catalog management, shopping carts, orders and inventory management. Most of the proposed solutions for these solutions are based on a combination of MongoDB rich document support, row level atomic operations and inline operators and also the extensive querying support.

A lot of the arguments levied against NoSQL in e-commerce center around transactions, consistency, and durability. A couple points, then, are worth noting.

Concerning transactions, it is true that MongoDB doesn’t support the multi-object kind; however, it does support atomic operations on individual documents. And this, combined with the document-oriented modeling just described and a little creativity, is adequate to many e-commerce problems. Certainly, if we needed to literally debit one account and credit another in the same operation, or if we wanted rollback, we’d need full-fledged transactions. However, the transactionality provided by MongoDB may be sufficient for most, if not all, e-commerce operations.

If you’re concerned about consistency and durability, know that write operations in MongoDB can be made consistent across connections. Furthermore, MongoDB 1.5 support near-real-time replication, so that it’s now possible to assure that an operation has been replicated before returning.

Redis PUB/SUB Used for Real-time Collaborative Web Editing

Lakshan Perera implemented a real-time collaborative editor using Redis PUB/SUB support which will become available with Redis 2.0 announced for May 21st and currently available as RC1, Node.js[1] and web sockets[2]. Redis is also used for storing the documents in the app.

On each pad, there are different types of messages that users send on different events. These messages needs to be propagated correctly to other users.

Mainly following messages needs to be sent:

  • When a user joins a pad
  • When a user leaves a pad
  • When a user sends a diff
  • When a user sends a chat message

For handling the message delivery, I used Redis’ newly introduced pub/sub implementation. Every time a user is connected (i.e. visits a pad) there would be two redis client instances initiated for him. One client is used for publishing his messages, while other will be used to listen to incoming messages from subscribed channels.

Source code of the project can be found on ☞ GitHub and there’s also a short video demo of the app:

Just in case you are wondering about the Node.js-Redis combination, Michael Bleigh’s presentation below will detail some of the benefits of using the asynchronous model:

Last, but not least, I should say that Node.js usage is not at all common in the NoSQL world.


  • [1] ☞ Node.js: evented I/O for V8 javascript ()
  • [2] ☞ WebSocket API: enables web pages to use the WebSocket protocol for tw-way communication with a remote host. WebSockets is a real solution to problems that were previously solved using long-poll, comet, etc. ()


Lessons from Redis-Land

Short summary of an adventurous journey in the NOSQL world with Redis.

Leaving aside the introductory part, what I enjoyed most is the analysis that went into a simple usecase determining how to map many-to-many relationships into Redis. This part follows pretty close the process of data modeling with key-value stores I’ve described before.


Palm webOS and CouchDB or NoSQL is Not Only About Scale

Last week, in the CouchDB case studies, based on a single twit, I was mentioning a very interesting CouchDB use case related to the Palm webOS. Now the ☞ Palm Developer Center Blog is giving more details about an upcoming webOS native JSON storage named db8 which is designed to sync with CouchDB in the cloud:

db8: what if you had access to a fantastic performant native JSON store? That is where db8 comes in, our new open source JSON datastore that includes: - Native JSON storage with query mechanism - Built-in primitives for easy cloud syncing (Easily query changed / deleted data, Designed to sync with CouchDB in the cloud) - Fine-grained access control for apps - Mobile-optimized and fast (especially for updates) - Pluggable back-end

While many still associate the whole NoSQL space with scalability or big data, these scenarios — there is also this atypical Riak usecase — are proving that NoSQL is about the best tool for the job.

Update: In a ☞ recent article on ArsTechnica, Ryan Paul expresses his concerns related to using CouchDB for desktop configuration storage and synching:

CouchDB can’t seem to handle the load of Gwibber’s messages, leading to excessive CPU consumption and poor performance in certain cases. For example, the overhead of computing the views causes lag when the user switches streams after Gwibber refreshes. The cost of pulling the account configuration data out of the database can also sometimes cause a noticeable lag that lasts up to four or five seconds when opening Gwibber’s account manager.

I’d really love to hear from CouchDB experts some comments related to these concerns.

Update 2: Make sure you are reading the comment below that clarifies the above reported issues.

CouchDB Case Studies

The guys from ☞ Couchio started to publish a series of CouchDB case studies. This is a very good initiative that is on par with myNoSQL intentions. Unfortunately, the three published so far — you can be read them ☞ here — are in my opinion too thin in technical details. Here is a short list of questions that I’d love to hear more about:

  1. Why and how have you got to CouchDB?

  2. Have you had to migrate existing data? How did you do that? Are you still using a relational or other storage solution?

  3. What kind of replication strategy are you using?

  4. Are you sharding you data? If yes, what strategy/solution are you using?

  5. What lessons have you learned while using CouchDB?

If you can help me get these answer I bet it would make these CouchDB case studies even more interesting for the NoSQL community.

Another case study that J.Chris pointed out while I was away is the webOS announcement ☞ mentioned by Ed Finkler @funktron

If you’re into CouchDB and JavaScript, webOS is geting 100% more awesome: #palmdev

Riak in Production: An Atypical Story

A non-enterprisey and non-twitteresque, but very interesting Riak deployment on a church’s kiosks:

Currently, we are running four Riak nodes (writing out to the Filesystem backend) outside of the three Kiosks themselves. I also have various Riak nodes on my random linux servers because I can use the CPU cycles on my other nodes to distribute MapReduce functions and store information in a redundant fashion.

Please note also the reduced complexity of bringing new kiosks up:

As I bring more kiosks into operation, the distributed map-reduce feature is becoming more valuable. Since I typically run reports during off hours, the kiosks aren’t overloaded by the extra processing power. So far I have been able to roll out a new kiosk within 2 hours of receiving the hardware. Most of this time is spent doing the installation and configuration of the touchscreen.

Atypical I’d say, but definitely exciting!


HBase @ Adobe: An Interview with Adobe SaaS Infrastructure Team

About one month ago, the Adobe SaaS Infrastructure Team (ASIT) has published two excellent articles[1] on their experience and work with HBase. I had a chance to get into some more details with the team driving this effort — thanks a lot guys! — and here is the final result of our conversation:

myNoSQL: It looks like during your evaluation phase[2], you’ve only considered column-based solutions (Cassandra, HBase, Hypertable). Would you mind describing some of your requirements that lead you to not consider others?

ASIT: There weren’t many solutions that could respond to our needs in 2008 (approximately 1 year before the NoSQL movement started).

We considered other solutions as well (e.g. memcachedb, Berkeley DB), but we only evaluated the three you mentioned.

We had two divergent requirements: dynamic schema and huge number of records.

We had to provide a data storage service that could handle 40 Million entities, each with many different 1-N relations. These required real-time read/write access. Due to performance reasons, we used to denormalize data to avoid joins. At the same time we had to offer schema flexibility (clients could add their custom metadata), and this was relatively weird to do in MySQL. In the columnar model you could dynamically and easily add new columns.

Every few hours data had to be aggregated in order to compute some sort of billboards and other statistics (client facing, not internal usage information). We were familiar with Hadoop / Map-Reduce, and we were looking for something that could distribute the processing jobs.

myNoSQL: You also seemed to be concerned by data corruption scenarios — this being mentioned in both articles[3] — and that sounds like a sign of HBase not being stable enough. Is that the case? Or was it something else to lead you to dig deeper in this direction?

ASIT: Although HBase stability was initially unknown to us (we didn’t have experience running the system in production), we would have went through the same draconic testing with any other system. We promised our clients that we won’t lose or corrupt their data so this criteria is paramount.

In a relational database you have referential integrity, databases rarely get corrupted and there are multiple ways to backup your data and restore it. Moreover, you understand where the data is, and know what the filesystem guarantees are. Many times there are teams that deal with backups and database operations for you (not our case however).

When you go with a different model, and you have huge volumes of data, it becomes very hard to back it up frequently and restore it fast. With HBase and Hadoop under active development we need to make sure all the building blocks are “safe”. Our clients have different requirements, but there is a very important common one: data safety.

myNoSQL: The whole setup for being able to move from HBase to MySQL and back sounds really interesting. Could you expand a bit on what tools have you used?

ASIT: We built most of the system in-house. Everything was running on a bunch of cron jobs. The system would:

  1. Export the actual HBase tables to HDFS, using map-reduce. This also performed data integrity checks.
  2. We had two backup “end-points”. Once the data was exported to HDFS, we compressed it and pushed it into another distributed file system.
  3. Another step was to create a MySQL backup out of it and import it on a stand-by MySQL cluster that could act as failover storage. The MySQL failover didn’t have any processing capabilities, so it was running in “degraded mode”, but it was good enough for a temporary solution.

We had the same thing running “in reverse”. The MySQL backup was kept in more places, and it could be imported back in HBase.

On top of this we had monitoring so whenever something went wrong we got alerted. We also had variable backup frequency (keep 1 for each of last 6 months, one for each last 4 weeks, and the last 7 days, also 2 hourly level) to help with disk utilization.

This was supposed to be a temporarily solution, because beyond a certain threshold the data would have gotten to big to fit in MySQL.

Today we look at multi-site data replication as a way of backing up large amounts of data in a scalable manner and soon HBase will support this. We cannot do failover to a MySQL system with our current data volume :).

myNoSQL: Based on the two articles I have had a hard time deciding if the system is currently in production or not. Is it? In case it is, how many ops people are responsible for managing your HBase cluster?

ASIT: There is actually more than one cluster, one is currently in production and the larger one we’re mentioning in the article will enter production soon. This is going to sound controversial, but we - the team - are solely responsible with managing our clusters. Of course, someone else is dealing with the basics: rack&stack, replacing defective hardware, prepping new servers and making sure we are allocated the right IP addresses.

We believe in the dev-ops movement. We automate everything and do monitoring so the operations work for the production system is really low. Last time we had to intervene was due to a bug in the backup frequency algorithm (we exceeded our quota, and we had alerts pouring in).

For full disclosure: These are not business critical systems, today. At some point we will need a dedicated operations team but we’ll still continue to work on the entire vertical, from the feature development part down to actual service deployment and monitoring.

myNoSQL: You mention that you are developing against the HBase trunk. That could mean that you are updating your cluster quite frequently. How do you achieve this without compromising uptime?

ASIT: This is the area we’re starting to focus more in the same manner we did with data integrity.

The current production cluster runs on HBase and Hadoop 0.18.X. Since the first time it failed in December 2008 we didn’t have any downtime, but we didn’t upgrade it either as we envisioned that the new system would replace it completely at some point.

Currently it’s possible to upgrade a running cluster one machine at a time, without downtime if RPC compatibility is maintained between the builds. Major upgrades that impact the RPC interface or data-model could be done without downtime using multi-site replication, but we haven’t done it yet. However in order to have something solid we would need rolling upgrades capabilities from both HDFS and HBase. It theory, this will be possible once they migrate the current RPC to Avro and enable RPC versioning. In the meantime we make sure our clients understand that we might have downtime when upgrading. We’ll probably contribute to the rolling upgrades features as well.

myNoSQL: Based on the following quote from your articles, I’d be tempted to say that some of the requirements were more about ACID guarantees. How would you comment?

So, for us, juggling with Consistency, Availability and Partition tolerance is not as important as making sure that data is 100% safe.

ASIT: It boils down to the requirements of the applications that run on top of your system. If you provide APIs that need to be fully consistent you have to be able to ensure that in every layer of the system. Some of our clients do need full consistency.

Another reason why we need full consistency is to be able to correctly build higher level features such as distributed transactions and indexing. These go hand in hand and are a lot easier to implement when the underlaying system can provide certain guarantees. It’s much like the way Zookeeper relies on TCP for data transmission guarantees, rather than implementing it from scratch.

We’re not consistency “zealots” and understand the benefits of eventual consistency like increased performance and availability in some scenarios. A good example is how we’ll deal with multi-site replication. It’s just that most of our use-cases fit better over a fully consistent system.

myNoSQL: There are a couple more related to the Full consistency section, but I don’t want to flood you with too many questions.

ASIT: This is indeed an interesting topic and we’ll probably post more about it on ☞

myNoSQL: Thanks a lot guys!


  • [1] The two articles referenced from this interview are ☞ Why we’re using HBase part 1 and ☞ Why we’re using HBase part 2. ()
  • [2] According to the ☞ first part: ()

    There were no benchmarking reports then, no “NoSQL” moniker, therefore no hype :). We had to do our own objective assessment. The list was (luckily) short: HBase, Hypertable and Cassandra were on the table. MySQL was there as a last resort.

  • [3] Just some fragments mentioning data integrity/corruption ()

    We had to be able to detect corruption and fix it. As we had an encryption expert in the team (who authored a watermarking attack), he designed a model that would check consistency on-the-fly with CRC checksums and allow versioning. The thrift serialized data was wrapped in another layer that contained both the inner data types and the HBase location (row, family and qualifier). (He’s a bit paranoid sometimes, but that tends to come in handy when disaster strikes :). Pretty much what Avro does.

    Most of our development efforts go towards data integrity. We have a draconian set of failover scenarios. We try to guarantee every byte regardless of the failure and we’re committed to fixing any HBase or HDFS bug that would imply data corruption or data loss before letting any of our production clients write a single byte.

Usecase: Superfeedr uses Redis to replace MySQL+Memcached

We all love “war stories” and the one of Superfeedr using Redis to replace their MySQL + Memcached setup is really interesting. The features that made Redis fit better Superfeedr scenario were mostly additions to the Redis 1.2.0 release

And I’m pretty sure that Superfeedr will benefit of Redis Virtual Memory once it becomes available considering their deployment on 2GB slices at Slicehost.