NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



SimpleDB: All content tagged as SimpleDB in NoSQL databases and polyglot persistence

Amazon SimpleDB, MongoDB, CouchDB, and RavenDB Compared

I wanted to share this before the weekend is over: Jesse Wolgamott’s video: “Battle of NoSQL starts: Amazon’s SDB vs MongoDB vs CouchDB vs RavenDB” from September’s Lone Star Ruby Conference.

You can download the video from the Confreaks site for watching offline.

Original title and link: Amazon SimpleDB, MongoDB, CouchDB, and RavenDB Compared (NoSQL databases © myNoSQL)

Amazon SimpleDB: An Intro

Tom Borthwick:

Amazon’s SimpleDB is a NoSql datastore with a whole lot of no: no sql, no datatypes (except utf-8 strings), no transactions, no joins, no indexes, no schema, no administration, and no cost for minimal usage. But when you google it, you find Amazon’s docs, a lot of bold predictions about it from 2007 and 2008… and not much else. SimpleDB seems like an interesting solution in search of a problem, but its ease of use and lack of administration effort make it worth at least checking out.

There are a few yes in the SimpleDB offering, but firstly you need to get by the nos.

Original title and link: Amazon SimpleDB: An Intro (NoSQL databases © myNoSQL)


Paper: Netflix’s Transition to High-Availability Storage Systems

A while ago, Sid Anand[1] has written a series of posts on challenges of a hybrid solution: Oracle - Amazon SimpleDB. This has become now a paper which offers a much better organized and detailed view on Netflix’s transition to using a hybrid Oracle - Amazon Web Services (SimpleDB, S3) architecture.

Go read the ☞ paper if one of these applies:

  • interested in Amazon SimpleDB and SimpleDB best practices
  • interested in running an on-premise and cloud hybrid architecture
  • interested in architecting a multi data source system

  1. Siddharth “Sid” Anand, Netflix cloud engineer, @r39132  ()

Original title and link: Paper: Netflix’s Transition to High-Availability Storage Systems (NoSQL databases © myNoSQL)

Practical Tips for Optimizing SimpleDB Access

Sid Anand, Netflix cloud engineer, shares a set of tips for optimizing access to SimpleDB based on his extensive experience using it:

I’ve been a heavy-user of SimpleDB since January 2009, storing, writing, and reading billions of items. Based on my experience, I’ve compiled a list of best practices and conventions to simplify working with SimpleDB.

His article talks about handling numerical and time data, UUIDs/GUIDs, composite value attributes, batched PUTs and a couple more tricks.

Update: Now the original post got split in three parts: ☞ part 2, covering tips on case sensitiveness, sharding, non-indexed queries, eventual consistency and batched PUTs, and ☞ part 3 sharing tips on attribute value length, default query limit,


NoSQL Smackdown at SXSW

The Changelog guys have ☞ published the audio recording from the NoSQL smackdown at SXSW. On stage we had Stu Hood (Cassandra), Jan Lehnardt (CouchDB) and Wynn Netherland (The Changelog) and they were quickly joined by Werner Vogels (CTO Amazon).

Werner was definitely the salt and pepper of the discussion, followed closely by Jan Lehnardt. Below you can find my notes (nb quotes are reproduced from memory so they may not be exact):

Werner Vogels: There are two reasons for replication:

  • gaining fault tolerance
  • getting higher level of concurrency this resulting in higher throughput (read or write)


Replication leads to decisions related to how writes should be implemented. […] So the whole consistency models is abstractions from the implementation leaking up.

Werner Vogels: The Dynamo system is not user-friendly. S3 is much better.

Werner Vogels: List operations are complication with Dynamo.

Stu Hood: Cassandra has an adavntage as it doesn’t use hashing being closer to BigTable.

Werner Vogels: You shouldn’t run your own database. These times are passed.

This is a bit self-contradicting what Werner was saying a bit later on the NoSQL smackdown show:

Werner Vogels: If you look at your applications, for each of your requirements you’ll find a dedicated solution.

Jan Lehnardt had a few great arguments against the “don’t run your own database”

Cloud is awesome as long as you are connected.

Plus the web is not meant to live in silos which is what Google, Amazon and a couple of others are proposing.

Jonathan Ellis has post on some reasons why the cloud may not be the best place for your data.

Jan Lehnardt: Having an HTTP-based database means that you don’t need all that crap in the middle.

This sounds very similar to our NoSQL protocols are important and re-emphasizes that CouchDB can change the architecture of your next web app.

Even if the recording quality is not great, it’s still highly entertaining.

Update: Some videos of the event are being published. You can watch them below

MapReduce: Hadoop and Cloud MapReduce

Ricky Ho has two great articles on how MapReduce is implemented by Hadoop and Cloud MapReduce:

Cloud MapReduce enjoys the inherit scalability and resiliency, which greatly simplifies its architecture.

  1. Cloud MapReduce doesn’t need to design a central coordinator components (like the NameNode and JobTracker in the Hadoop environment). They simply store the job progress status information in the distributed metadata store (SimpleDB).
  2. Cloud MapReduce doesn’t need to worry about scalability in the communication path and how data can be moved efficiently between nodes, all is taken care by the underlying CloudOS
  3. Cloud MapReduce doesn’t need to worry about disk I/O issue because all storage is effectively remote and being taken care by the Cloud OS.

Cloud MapReduce implementation is detailed in this ☞ paper (PDF).

These are very interesting details on how to build a scalable (probably also fault tolerant) solution.

NoSQL Protocols Are Important

The more mature the NoSQL solutions grow the more they realize the importance of the protocols they are using. And more and more NoSQL projects try not to repeat the LDAP protocol history.

I’d say that the flagship NoSQL projects that understood the benefits of the protocol simplicity are CouchDB, the relaxed document database and SimpleDB, Amazon’s key-value store, both of them looking like being built on the web and for the web (note: as one of the MyNoSQL readers correctly pointed out, the SimpleDB HTTP use is quite incorrect though). But they are definitely not the only one.

Riak, the decentralized key-value store, is also using JSON over HTTP. Not only that but the Basho team, producers of Riak, have decided lately to completely drop their custom protocol ☞ Jiak.

Terrastore, the consistent, partitioned and elastic document database, being quite young, made its homework and debuted as HTTP/JSON friendly.

Neo4j, the graph database, has added recently a RESTful interface, which even if not available in the Neo4j 1.0 release is making it accessible for a new range of programming languages.

There are some NoSQL solutions that are still using custom protocols. Redis has defined its own protocol, but made sure to keep it “easy to parse by a computer and easy to parse by a human”. Redis also got some help from 3rd party tools/libraries to make it even more accessible through HTTP/JSON: RedBottle, a REST app for Redis and Sikwamic, a Redis over HTTP library.

GT.M, a NoSQL solution about which you can learn more from the Introduction to GT.M and M/DB or these two talks at FOSDEM: GT.M and OpenStreetMap and MDB and MDBX: Open Source SimpleDB Projects based on GTM, has also realized the importance of the protocol and is now introducing ☞ M/Wire, which was inspired by the simplicity of Redis protocol.

MongoDB is another example of a NoSQL storage that uses a custom wire protocol. While the MongoDB ecosystem already includes a lot of libraries, I’d really love to see Kristina’s ☞ Sleepy.Mongoose moving forward (nb: Krsitina, I’m also pretty sure that Sleepy.Mongoose can get much nicer RESTful URIs too ;-) ).

And the story can go on and on, but the lesson to be learned should be quite obvious: the simpler and the easier your protocol is the more accessible your data will be and the easier it will be for the community to come up with (innovative) projects and libraries. The NoSQL libraries page should give you a feeling of what NoSQL solutions are using simple protocols and which are not.

Update: I received a hint from Mathias Meyer (@roidrage) that BSON, the binary JSON serialization used by MongoDB, has a new ☞ home

Practical tips for using SimpleDB

After loading 1 billion records into Amazon SimpleDB, you definitely learn some tips & tricks about it:

  • use multiple domains
  • don’t burst your writes (note: it doesn’t need to be SimpleDB to have issues with bursting writes
  • if service is failing, then back-off (note: not sure which one came first, but Twitter API documentation describes the same retry strategy)
  • make sure you understand the service API


Top 10 Reasons to Get More Information about SimpleDB

I am not really sure where ☞ this old article popped up from, but I think it is somehow good that others are ☞ still reacting and showing how wrong it can be:

1. Data integrity is not guaranteed

There are two and a half points I’d like to make. First, constraints are usually the first to go because they’re costly. Costly to implement and costly at runtime. Especially when the system is being designed with the ability to run on multiple machines. […]

3. Aggregate operations will require more coding.

If Ryan really wanted to make an argument about aggregates, the best thing would be to go on about how a non-RDBMS requires you to know what type of aggregates you’ll want up front and then do insert time calculations for these values. While that will work just fine, it makes ad-hoc queries harder.

4. Complicated reports, and ad hoc queries, will require a lot more coding.

As far as I can tell, the argument is that SQL makes complex reports easy even though it still might take hundreds of lines to get the data required. And the other thing that’s not mentioned, these reports can still take a substantial amount of time to generate.

5. Aggregate operations will be much slower if you don’t use an RDBMS.

I’ll just point out that there are non-RDBMS systems that provide aggregate functionality and anything that uses a b+tree probably uses binary search.

8. Relational databases are scalable, even with massive data sets.

I don’t have a better response than the commenter jackson on the original blog post. Once an RDBMS is scaled to multiple machines, lots of the benefits are nullified and you’re dealing with the same issues that the non-RDBMS folks are.

9. Super-scalability is overrated. Slowing the pace of your product development is even worse.

[…] But in reality, the issue isn’t adding the hundredth node to a system, its adding the second. […]

And there is also a set of “hmmm… are you serious?” points… (Paul Davis is a bit more “polite” about them).

2. Inconsistency will provide a terrible user experience

6. Data import, export, and backup will be slow and difficult.

7. SimpleDB isn’t that fast.

9. SimpleDB is useful, but only in certain contexts.

Now, I guess the only “excuse” would be that at the time of the article, there was no MyNoSQL to get informed about NoSQL solutions.

Challenges of a Hybrid Solution: Oracle - SimpleDB

I have covered before some hybrid solutions, most of these involving “tweaked” traditional databases to get rid of unnecessary constraints, so this is so far the only NoSQL hybrid solution I’ve read about involving a NoSQL storage and an RDBMS. Sid Anand (@r39132), Netflix cloud engineer, has a series of articles covering the challenges the team down there faced while working on this Oracle/SimpleDB hybrid NoSQL solution:

The challenges can be summarized in several parts:

  1. Pulling data out of Oracle Efficiently
  2. Solving the Oracle-SimpleDB Eventual Consistency Problem
  3. Defining the SimpleDB-Oracle translation

After reading the articles I still have some unanswered questions:

  • the first phase of data migration is still unclear.

    My understanding is that there is a secondary process going over the existing records and “updating” them so that triggers are activated.

  • how does the SimplyDB to Oracle synchronization work?

  • the part 3 covering the feature mismatch between SimpleDB and Oracle is not covering all presented aspects:

    • Triggers
    • Stored Procedures
    • Constraints (e.g. integrity, foreign key, unique, etc…)
    • Sequences
    • Sequences used as Primary Keys
    • Locks
    • Tables without Primary Keys or Unique Keys or both
    • Relationships between tables

The part I have found the most interesting was the one about the “simple” algorithm used for ensuring eventual consistency. And in the same piece, something to note:

Without the anticipated Amazon API, we cannot build an eventually-consistent Hybrid system optimized for AP (i.e. from CAP theorem). We would have had to rely on dual-writes, defeating our goal to be highly-available.

Introducing the Oracle-SimpleDB Hybrid

I have been building an eventually-consistent, multi-master data store at Netflix. This system is comprised of an Oracle replica and several SimpleDB replicas.

I think that loading 1 Billion Rows into Amazon SimpleDB was part of the experiment.