NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



aws: All content tagged as aws in NoSQL databases and polyglot persistence

Amazon Redshift - Now Broadly Available

Jeff Barr:

We announced Amazon Redshift, our fast and powerful, fully managed, petabyte-scale data warehouse service, late last year (see my earlier blog post for more info).


We’ve designed Amazon Redshift to be cost-effective, easy to use, and flexible.


  1. who is the ideal Redshift user? I assume it should be AWS users that already have data in the Amazon cloud. Otherwise I have a bit of a hard time imagining trucks carrying tons of hard drives into Amazon data centers.
  2. what happens if for some reason you decide to move your data our of Redshift? How would that work?
  3. what is the next move and counter-argument of Greenplum, Netezza, Vertica, etc. to Redshift?

Original title and link: Amazon Redshift - Now Broadly Available (NoSQL database©myNoSQL)


Deploying Riak on EC2 - What to Pick?

Deepak Bala sharing his recommendations for running Riak on EC2 based on his own experience:

There are a couple of problems to field when deploying Riak.

  1. The EC2 instances that are provisioned by default change the following on restart.

    • Private IP address
    • Public IP address
    • Private DNS
    • Public DNS
    • EBS instances provide stable durable storage while Ephemeral storage provides for better predictable performance at the cost of losing data on restarts.
  2. Performance.

Original title and link: Deploying Riak on EC2 - What to Pick? (NoSQL database©myNoSQL)


Deep Dive Into Amazon ElastiCache

Harish Ganesan published an in-depth article about Amazon Elasticache covering:

  1. Connection overhead of the connection buffer per TCP client connection approach used by ElastiCache
  2. Possible solutions for dealing with an elastic Amazon ElastiCache cluster (nb: memcached nodes are not cluster aware)
  3. Auto discovery (just recently added by the AWS team as a patch to the spymemcached Java client)
  4. ElastiCache node types
  5. Memory allocation and eviction policies

Original title and link: Deep Dive Into Amazon ElastiCache (NoSQL database©myNoSQL)


The Architecture of a Credit Card Analysis Platform: Using Project Voldemort, Elastic MapReduce, Pangool

Ivan de Prado and Pere Ferrera on

The solution we developed has an infrastructure cost of just a few thousands of dollars per month thanks to the use of the cloud (AWS), Hadoop and Voldemort.


This is one of the few projects outside LinkedIn that I know of that uses Project Voldemort. Plus the Voldemort backend storage is configured to use BerkleyDB.

Original title and link: The Architecture of a Credit Card Analysis Platform: Using Project Voldemort, Elastic MapReduce, Pangool (NoSQL database©myNoSQL)


The New EC2 High Storage Instance Family

The High Storage Eight Extra Large (hs1.8xlarge) instances are a great fit for applications that require high storage depth and high sequential I/O performance. Each instance includes 117 GiB of RAM, 16 virtual cores (providing 35 ECU of compute performance), and 48 TB of instance storage across 24 hard disk drives capable of delivering up to 2.4 GB per second of I/O performance.

This is local storage or ephemeral storage so from the perspective of data storages it should be used only with redundant highly available databases (e.g. Riak).

P.S.: I get the feeling Jeff Darcy will be happy reading this post.

Original title and link: The New EC2 High Storage Instance Family (NoSQL database©myNoSQL)


RiakCS Multi-Datacentre Redundancy

RiakCS, the Riak-based multi-tenant, distributed S3-compatible cloud storage solution from Basho, supports now multi-datacenter replication:

RiakCS has two data replication options for cloud administrators: full sync and real-time sync. Full sync copies data from a primary RiakCS store to a secondary site at a frequency of administrators’ choosing, though the default is six hours. The secondary data stores regularly ask the primary datastore whether anything has changed and, if it has, they will update their own data to bring it in line.

Real-time sync, meanwhile, triggers when a person requests information from a RiakCS pile of data. If they are requesting from a secondary site, the database will check with the primary to see if anything has changed and update accordingly, while if they are requesting data from the primary, there’s no wait.

The naming of the 2nd sync solution as real-time sounds strange1. I’d probably call it sync on-read.

  1. My first reaction was “there’s no way Basho guys implemented a 2PC or even a Paxos algorithm for synching in real-time, so what is this???”. 

Original title and link: RiakCS Multi-Datacentre Redundancy (NoSQL database©myNoSQL)


Amazon EBS, SSD, and Rackspace IOPS Per Dollar

Staying on the subject of IOPS in the cloud, Jeff Darcy did some testing with GlusterFS against Amazon EBS, Amazon SSD, Storm on Demand SS, and Rackspace instance storage and computed for each IOPS/$:

  • Amazon EBS: 1000 IOPS (provisioned) for $225/month or 4.4 IOPS/$ (server not included)
  • Amazon SSD: 4300 IOPS for $4464/month or 1.0 IOPS/month (that’s pathetic)
  • Storm on Demand SSD: 5500 IOPS for $590/month or 9.3 IOPS/$
  • Rackspace instance storage: 3400 IOPS for $692/month (8GB instances) or 4.9 IOPS/$
  • Rackspace with 4x block storage per server: 9600 IOPS for $811/month or 11.8 IOPS/$ (hypothetical, assuming CPU or network don’t become bottlenecks)

Original title and link: Amazon EBS, SSD, and Rackspace IOPS Per Dollar (NoSQL database©myNoSQL)


Benchmarking EC2 I/O: An Extensive Analysis by Scalyr

Way too much to be learned from this fantastic post.

EC2 I/O Performance

Original title and link: Benchmarking EC2 I/O: An Extensive Analysis by Scalyr (NoSQL database©myNoSQL)


YCSB Benchmark Results for Cassandra, HBase, MongoDB, MySQL Cluster, and Riak

Put together by the team at Altoros Systems Inc., this time run in the Amazon EC2 and including Cassandra, HBase, MongoDB, MySQL Cluster, sharded MySQL and Riak:

After some of the results had been presented to the public, some observers said MongoDB should not be compared to other NoSQL databases because it is more targeted at working with memory directly. We certainly understand this, but the aim of this investigation is to determine the best use cases for different NoSQL products. Therefore, the databases were tested under the same conditions, regardless of their specifics.

Teaser: HBase got the best results in most of the benchmarks (with flush turned off though). And I’m not sure the setup included the latest HBase read improvements from Facebook.

Original title and link: YCSB Benchmark Results for Cassandra, HBase, MongoDB, MySQL Cluster, and Riak (NoSQL database©myNoSQL)


Redis and Memcached Benchmark on Amazon Cloud

Garantia Data, providers of Redis and Memcached as-a-Service-in-the-Amazon-Cloud, published the results of a throughput and latency benchmark for different AWS deployment models:

The first thing we looked at when putting together our benchmark was the various architectural alternatives we wanted to compare. Users typically choose the most economical AWS instance based on the initial size estimate of their dataset, however, it’s crucial to also keep in mind that other AWS users might share the same physical server that runs your data (as nicely explained by Adrian Cockcroft here). This is especially true if you have a small-to-medium dataset, because instances between m1.small and m1.large are much more likely to be shared on a physical server than large instances like m2.2xlarge and m2.4.xlarge, which typically run on a dedicated physical server. Your “neighbours” may become “noisy” once they start consuming excess I/O and CPU resources from your physical server. In addition, small-to-medium instances are by nature weaker in processing power than large instances.

Only two comments:

  1. it’s not clear if there were multiple instances of Redis used per machine when the chosen instances had multi-cores
  2. I would have really liked to also have a pricing comparison in the conclusion section

Original title and link: Redis and Memcached Benchmark on Amazon Cloud (NoSQL database©myNoSQL)


Qubole: New On-Demand Hadoop Service by Hive Creators

Derrick Harris for GigaOm:

Two key members of the Facebook team that created the Hadoop query language Hive are launching their own big data startup called Qubole on Thursday. […] Qubole is also optimized to run on cloud-based resources that typically don’t offer performance on a par with their physical counterparts. Thusoo said the product incorporates a specially-designed cache system that lets queries run five times faster than traditional Hadoop jobs in the cloud, and users have the option to change the types of instances their jobs are running on if the situation requires.

Running on Amazon infrastructure.

Original title and link: Qubole: New On-Demand Hadoop Service by Hive Creators (NoSQL database©myNoSQL)


Notes on the Hadoop and HBase Markets

Curt Monash shares what he heard from his customers:

  • Over half of Cloudera’s customers (nb 100 subscription customers) use HBase
  • Hortonworks thinks a typical enterprise Hadoop cluster has 20-50 nodes, with 50-100 already being on the large side.
  • There are huge amounts of Elastic MapReduce/Hadoop processing in the Amazon cloud. Some estimates say it’s the majority of all Amazon Web Services processing.

Original title and link: Notes on the Hadoop and HBase Markets (NoSQL database©myNoSQL)