NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



AWS: All content tagged as AWS in NoSQL databases and polyglot persistence

Reliable, Scalable, and Kinda Sorta Cheap: A Cloud Hosting Architecture for MongoDB

Using MongoDB replicate sets:

At Famigo, we house all of our valuable data in MongoDB and we also serve all requests from Amazon EC2 instances. We’ve devoted many mental CPU cycles to finding the right architecture for our data in the cloud, focusing on 3 main factors: cost, reliability, and performance.

Original title and link: Reliable, Scalable, and Kinda Sorta Cheap: A Cloud Hosting Architecture for MongoDB (NoSQL database©myNoSQL)


MongoDB and Amazon: Why EBS?

After linking to the MongoDB in the Amazon cloud, MongoDB and EC2 and the older MongoDB on Amazon EC2 with EBS volumes , Arnout Kazemier commented:

The only thing I dislike about that EC2 guide is that it’s suggesting to use EBS instead of the regular EC2 instance storage

This is an apt question in the light of the prolongued Amazon outage, Reddit’s experience with EBS, the unpredictable EBS performance, and Netflix’s Adrian Cockcroft explanation of multi-tenancy impact on the Amazon EBS performance. Maybe someone could answer it.

Original title and link: MongoDB and Amazon: Why EBS? (NoSQL database©myNoSQL)

Setting Up MongoDB Replica Sets on Amazon EC2

Zachary Witte:

When you have the instance basically set, go back into the AWS control panel, right click the instance and choose Create Image. You can start up any number of these for the replica set, but you need to change the /etc/hostname and /etc/hosts file to reflect the individual IP address and hostname of the bot (db1, db2, db3, etc.)

Before you set up MongoDB on EC2 make sure you understand the various aspects of running MongoDB in the Amazon cloud:

Original title and link: Setting Up MongoDB Replica Sets on Amazon EC2 (NoSQL database©myNoSQL)


Hadoop Chaos Monkey: The Fault Injection Framework

Do you remember the 5 lessons Netflix learned while using the Amazon Web Services—judging by how much Netflix shared about their experience in the cloud including Amazon SimpleDB I’d say these 5 are only the tip of the iceberg—where they talked about the Chaos Monkey?

One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.

Hadoop provides a similar framework: Fault Injection Framework :

The idea of fault injection is fairly simple: it is an infusion of errors and exceptions into an application’s logic to achieve a higher coverage and fault tolerance of the system. Different implementations of this idea are available today. Hadoop’s FI framework is built on top of Aspect Oriented Paradigm (AOP) implemented by AspectJ toolkit.

As a sidenote, this is one of the neatest usages of AspectJ I’ve read about.

Update: Abhijit Belapurkar says that Fault injection using AOP was part of Recovery Oriented Computing research at Stanford/UCB many years ago: JAGR: An Autonomous Self-Recovering Application Server.

Original title and link: Hadoop Chaos Monkey: The Fault Injection Framework (NoSQL database©myNoSQL)

Building an Ad Network Ready for Failure

The architecture of a fault-tolerant ad network built on top of HAProxy, Apache with mod_wsgi and Python, Redis, a bit of PostgreSQL and ActiveMQ deployed on AWS:

The real workhorse of our ad targeting platform was Redis. Each box slaved from a master Redis, and on failure of the master (which happened once), a couple “slaveof” calls got us back on track after the creation of a new master. A combination of set unions/intersections with algorithmically updated targeting parameters (this is where experimentation in our setup was useful) gave us a 1 round-trip ad targeting call for arbitrary targeting parameters. The 1 round-trip thing may not seem important, but our internal latency was dominated by network round-trips in EC2. The targeting was similar in concept to the search engine example I described last year, but had quite a bit more thought regarding ad targeting. It relied on the fact that you can write to Redis slaves without affecting the master or other slaves. Cute and effective. On the Python side of things, I optimized the redis-py client we were using for a 2-3x speedup in network IO for the ad targeting results.

Original title and link: Building an Ad Network Ready for Failure (NoSQL database©myNoSQL)


HBase on EC2 using EBS volumes : Lessons Learned

There lies the answer! We have a requirement of recreating the cluster in case we accidentally delete entire data or if we loose our master. In such a case the reliable backup can only be taken if your HDFS data does not reside on the root devices. A reliable backup of the root device cannot be taken without rebooting the device. Furthermore it’s stored as an AMI which mean you have to create a new AMI every day and delete the old one. This means to solve all of our problems we need HBase installation and data both stored on attached EBS volumes that are not the root devices.

Update: after reading the post both Bradford Stephens[1] and Andrew Purtell[2] recommended using instance store instead of EBS:

EBS adds complexity, failure risk, and cost

  1. CEO of Drawn to Scale  

  2. Systems architect and HBase committer, @akpurtell  

Original title and link: HBase on EC2 using EBS volumes : Lessons Learned (NoSQL databases © myNoSQL)


Membase on Amazon EC2 with EBS

The decision was made and we decided to go with a 2 server solution, each server has 16G of memory and 100G of EBS volume attached to it.

Both will have membase latest stable version installed and perform as a cluster in case one falls or anything happens, a fail safe if you will.

In this post, I will walk you though what was done to perform this and how exactly it was done on the amazon cloud.

Wouldn’t it be easier if there would be an always up-to-date official Membase AMI and the corresponding guide (making sure important details about EBS are not left out)?

Original title and link: Membase on Amazon EC2 with EBS (NoSQL databases © myNoSQL)


Neo4j REST Server Image in Amazon EC2

OpenCredo created it, Jussi Heinonen shares the details:

Neo4j EC2 Components Image

Original title and link: Neo4j REST Server Image in Amazon EC2 (NoSQL databases © myNoSQL)


MongoDB on EC2

The basic setup:

MongoDB on EC2

The advanced guide can be found in the MongoDB in the Amazon cloud post.

Original title and link: MongoDB on EC2 (NoSQL databases © myNoSQL)


A Rake Task for Backing Up a MongoDB Database

Daniel Doubrovkine:

I tried mongodump and mongorestore. Those are straightforward tools that let you export and import Mongo data (Mongo people did their job very well there, much less hassle than with a traditional RDBMS where you have to backup the database, deal with the transaction log, bla bla bla). All is well when working with local machines. Remotely, you need to go the extra step of figuring out the database address, username and password. This gets messier with Heroku and eventually starts smelling bad.

I want to do this the “Rails Way” by invoking a single rake command that imports and exports Mongo data in any of my environments

So he wrote a Rake task for backing up MongoDB to Amazon S3.

Original title and link: A Rake Task for Backing Up a MongoDB Database (NoSQL databases © myNoSQL)


Amazon EC2 Cassandra Cluster with DataStax AMI

This AMI does the following:

  • installs Cassandra 0.7.4 on a Ubuntu 10.10 image
  • configures emphemeral disks in raid0, if applicable (EBS is a bad fit for Cassandra
  • configures Cassandra to use the root volume for the commitlog and the ephemeral disks for data files
  • configures Cassandra to use the local interface for intra-cluster communication
  • configures all Cassandra nodes with the same seed for gossip discovery

Note the “EBS is a bad fit for Cassandra”. That’s what Adrian Cockcroft explains in Multi-tenancy and Cloud Storage Performance.

Original title and link: Amazon EC2 Cassandra Cluster with DataStax AMI (NoSQL databases © myNoSQL)


Multi-tenancy and Cloud Storage Performance

Adrian Cockcroft[1] has a great explanation of the impact of multi-tenancy on cloud storage performance. The connection with NoSQL databases is not necessarily in the Amazon EBS and SSD Price, Performance, QoS comparison, but:


If you ever see public benchmarks of AWS that only use m1.small, they are useless, it shows that the people running the benchmark either didn’t know what they were doing or are deliberately trying to make some other system look better. You cannot expect to get consistent measurements of a system that has a very high probability of multi-tenant interference.

  1. Adrian Cockcroft: Netflix, @adrianco  

Original title and link: Multi-tenancy and Cloud Storage Performance (NoSQL databases © myNoSQL)