SimpleDB: All content tagged as SimpleDB in NoSQL databases and polyglot persistence
I wanted to share this before the weekend is over: Jesse Wolgamott’s video: “Battle of NoSQL starts: Amazon’s SDB vs MongoDB vs CouchDB vs RavenDB” from September’s Lone Star Ruby Conference.
You can download the video from the Confreaks site for watching offline.
Original title and link: Amazon SimpleDB, MongoDB, CouchDB, and RavenDB Compared (NoSQL databases © myNoSQL)
A while ago, Sid Anand has written a series of posts on challenges of a hybrid solution: Oracle - Amazon SimpleDB. This has become now a paper which offers a much better organized and detailed view on Netflix’s transition to using a hybrid Oracle - Amazon Web Services (SimpleDB, S3) architecture.
Go read the ☞ paper if one of these applies:
- interested in Amazon SimpleDB and SimpleDB best practices
- interested in running an on-premise and cloud hybrid architecture
- interested in architecting a multi data source system
Original title and link: Paper: Netflix’s Transition to High-Availability Storage Systems (NoSQL databases © myNoSQL)
The Changelog guys have ☞ published the audio recording from the NoSQL smackdown at SXSW. On stage we had Stu Hood (Cassandra), Jan Lehnardt (CouchDB) and Wynn Netherland (The Changelog) and they were quickly joined by Werner Vogels (CTO Amazon).
Werner was definitely the salt and pepper of the discussion, followed closely by Jan Lehnardt. Below you can find my notes (nb quotes are reproduced from memory so they may not be exact):
Werner Vogels: There are two reasons for replication:
- gaining fault tolerance
- getting higher level of concurrency this resulting in higher throughput (read or write)
Replication leads to decisions related to how writes should be implemented. […] So the whole consistency models is abstractions from the implementation leaking up.
Werner Vogels: The Dynamo system is not user-friendly. S3 is much better.
Werner Vogels: List operations are complication with Dynamo.
Stu Hood: Cassandra has an adavntage as it doesn’t use hashing being closer to BigTable.
Werner Vogels: You shouldn’t run your own database. These times are passed.
This is a bit self-contradicting what Werner was saying a bit later on the NoSQL smackdown show:
Werner Vogels: If you look at your applications, for each of your requirements you’ll find a dedicated solution.
Jan Lehnardt had a few great arguments against the “don’t run your own database”
Cloud is awesome as long as you are connected.
Plus the web is not meant to live in silos which is what Google, Amazon and a couple of others are proposing.
Jonathan Ellis has post on some reasons why the cloud may not be the best place for your data.
Jan Lehnardt: Having an HTTP-based database means that you don’t need all that crap in the middle.
This sounds very similar to our NoSQL protocols are important and re-emphasizes that CouchDB can change the architecture of your next web app.
Even if the recording quality is not great, it’s still highly entertaining.
Update: Some videos of the event are being published. You can watch them below
Ricky Ho has two great articles on how MapReduce is implemented by Hadoop and Cloud MapReduce:
Cloud MapReduce enjoys the inherit scalability and resiliency, which greatly simplifies its architecture.
- Cloud MapReduce doesn’t need to design a central coordinator components (like the NameNode and JobTracker in the Hadoop environment). They simply store the job progress status information in the distributed metadata store (SimpleDB).
- Cloud MapReduce doesn’t need to worry about scalability in the communication path and how data can be moved efficiently between nodes, all is taken care by the underlying CloudOS
- Cloud MapReduce doesn’t need to worry about disk I/O issue because all storage is effectively remote and being taken care by the Cloud OS.
Cloud MapReduce implementation is detailed in this ☞ paper (PDF).
These are very interesting details on how to build a scalable (probably also fault tolerant) solution.
The more mature the NoSQL solutions grow the more they realize the importance of the protocols they are using. And more and more NoSQL projects try not to repeat the LDAP protocol history.
I’d say that the flagship NoSQL projects that understood the benefits of the protocol simplicity are CouchDB, the relaxed document database and SimpleDB, Amazon’s key-value store, both of them looking like being built on the web and for the web (note: as one of the MyNoSQL readers correctly pointed out, the SimpleDB HTTP use is quite incorrect though). But they are definitely not the only one.
Riak, the decentralized key-value store, is also using JSON over HTTP. Not only that but the Basho team, producers of Riak, have decided lately to completely drop their custom protocol ☞ Jiak.
Terrastore, the consistent, partitioned and elastic document database, being quite young, made its homework and debuted as HTTP/JSON friendly.
Neo4j, the graph database, has added recently a RESTful interface, which even if not available in the Neo4j 1.0 release is making it accessible for a new range of programming languages.
There are some NoSQL solutions that are still using custom protocols. Redis has defined its own protocol, but made sure to keep it “easy to parse by a computer and easy to parse by a human”. Redis also got some help from 3rd party tools/libraries to make it even more accessible through HTTP/JSON: RedBottle, a REST app for Redis and Sikwamic, a Redis over HTTP library.
GT.M, a NoSQL solution about which you can learn more from the Introduction to GT.M and M/DB or these two talks at FOSDEM: GT.M and OpenStreetMap and MDB and MDBX: Open Source SimpleDB Projects based on GTM, has also realized the importance of the protocol and is now introducing ☞ M/Wire, which was inspired by the simplicity of Redis protocol.
MongoDB is another example of a NoSQL storage that uses a custom wire protocol. While the MongoDB ecosystem already includes a lot of libraries, I’d really love to see Kristina’s ☞ Sleepy.Mongoose moving forward (nb: Krsitina, I’m also pretty sure that Sleepy.Mongoose can get much nicer RESTful URIs too ;-) ).
And the story can go on and on, but the lesson to be learned should be quite obvious: the simpler and the easier your protocol is the more accessible your data will be and the easier it will be for the community to come up with (innovative) projects and libraries. The NoSQL libraries page should give you a feeling of what NoSQL solutions are using simple protocols and which are not.
1. Data integrity is not guaranteed
There are two and a half points I’d like to make. First, constraints are usually the first to go because they’re costly. Costly to implement and costly at runtime. Especially when the system is being designed with the ability to run on multiple machines. […]
3. Aggregate operations will require more coding.
If Ryan really wanted to make an argument about aggregates, the best thing would be to go on about how a non-RDBMS requires you to know what type of aggregates you’ll want up front and then do insert time calculations for these values. While that will work just fine, it makes ad-hoc queries harder.
4. Complicated reports, and ad hoc queries, will require a lot more coding.
As far as I can tell, the argument is that SQL makes complex reports easy even though it still might take hundreds of lines to get the data required. And the other thing that’s not mentioned, these reports can still take a substantial amount of time to generate.
5. Aggregate operations will be much slower if you don’t use an RDBMS.
I’ll just point out that there are non-RDBMS systems that provide aggregate functionality and anything that uses a b+tree probably uses binary search.
8. Relational databases are scalable, even with massive data sets.
I don’t have a better response than the commenter jackson on the original blog post. Once an RDBMS is scaled to multiple machines, lots of the benefits are nullified and you’re dealing with the same issues that the non-RDBMS folks are.
9. Super-scalability is overrated. Slowing the pace of your product development is even worse.
[…] But in reality, the issue isn’t adding the hundredth node to a system, its adding the second. […]
And there is also a set of “hmmm… are you serious?” points… (Paul Davis is a bit more “polite” about them).
2. Inconsistency will provide a terrible user experience
6. Data import, export, and backup will be slow and difficult.
7. SimpleDB isn’t that fast.
9. SimpleDB is useful, but only in certain contexts.
Now, I guess the only “excuse” would be that at the time of the article, there was no MyNoSQL to get informed about NoSQL solutions.
I have covered before some hybrid solutions, most of these involving “tweaked” traditional databases to get rid of unnecessary constraints, so this is so far the only NoSQL hybrid solution I’ve read about involving a NoSQL storage and an RDBMS. Sid Anand (@r39132), Netflix cloud engineer, has a series of articles covering the challenges the team down there faced while working on this Oracle/SimpleDB hybrid NoSQL solution:
The challenges can be summarized in several parts:
- Pulling data out of Oracle Efficiently
- Solving the Oracle-SimpleDB Eventual Consistency Problem
- Defining the SimpleDB-Oracle translation
After reading the articles I still have some unanswered questions:
the first phase of data migration is still unclear.
My understanding is that there is a secondary process going over the existing records and “updating” them so that triggers are activated.
how does the SimplyDB to Oracle synchronization work?
the part 3 covering the feature mismatch between SimpleDB and Oracle is not covering all presented aspects:
- Stored Procedures
- Constraints (e.g. integrity, foreign key, unique, etc…)
- Sequences used as Primary Keys
- Tables without Primary Keys or Unique Keys or both
- Relationships between tables
The part I have found the most interesting was the one about the “simple” algorithm used for ensuring eventual consistency. And in the same piece, something to note:
Without the anticipated Amazon API, we cannot build an eventually-consistent Hybrid system optimized for AP (i.e. from CAP theorem). We would have had to rely on dual-writes, defeating our goal to be highly-available.
-  ☞ Introducing the Oracle-SimpleDB Hybrid
-  ☞ Part1: Pulling data out of Oracle Efficiently
-  ☞ Part2: Solving the eventual consistency problem
-  ☞ Part 3: Defining the SimpleDB-Oracle Translation
-  The Beginning of an Interesting Friendship: MapReduce and RDBMS
-  Drizzle Replication: Opening the Doors to Hybrid Solutions
-  Bringing NoSQL to the people: Now Django