ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

aws: All content tagged as aws in NoSQL databases and polyglot persistence

Amazon EBS, SSD, and Rackspace IOPS Per Dollar

Staying on the subject of IOPS in the cloud, Jeff Darcy did some testing with GlusterFS against Amazon EBS, Amazon SSD, Storm on Demand SS, and Rackspace instance storage and computed for each IOPS/$:

  • Amazon EBS: 1000 IOPS (provisioned) for $225/month or 4.4 IOPS/$ (server not included)
  • Amazon SSD: 4300 IOPS for $4464/month or 1.0 IOPS/month (that’s pathetic)
  • Storm on Demand SSD: 5500 IOPS for $590/month or 9.3 IOPS/$
  • Rackspace instance storage: 3400 IOPS for $692/month (8GB instances) or 4.9 IOPS/$
  • Rackspace with 4x block storage per server: 9600 IOPS for $811/month or 11.8 IOPS/$ (hypothetical, assuming CPU or network don’t become bottlenecks)

Original title and link: Amazon EBS, SSD, and Rackspace IOPS Per Dollar (NoSQL database©myNoSQL)

via: http://pl.atyp.us/wordpress/index.php/2012/10/rackspace-block-storage/


Benchmarking EC2 I/O: An Extensive Analysis by Scalyr

Way too much to be learned from this fantastic post.

EC2 I/O Performance

Original title and link: Benchmarking EC2 I/O: An Extensive Analysis by Scalyr (NoSQL database©myNoSQL)

via: http://blog.scalyr.com/2012/10/16/a-systematic-look-at-ec2-io/


YCSB Benchmark Results for Cassandra, HBase, MongoDB, MySQL Cluster, and Riak

Put together by the team at Altoros Systems Inc., this time run in the Amazon EC2 and including Cassandra, HBase, MongoDB, MySQL Cluster, sharded MySQL and Riak:

After some of the results had been presented to the public, some observers said MongoDB should not be compared to other NoSQL databases because it is more targeted at working with memory directly. We certainly understand this, but the aim of this investigation is to determine the best use cases for different NoSQL products. Therefore, the databases were tested under the same conditions, regardless of their specifics.

Teaser: HBase got the best results in most of the benchmarks (with flush turned off though). And I’m not sure the setup included the latest HBase read improvements from Facebook.

Original title and link: YCSB Benchmark Results for Cassandra, HBase, MongoDB, MySQL Cluster, and Riak (NoSQL database©myNoSQL)

via: http://www.networkworld.com/cgi-bin/mailto/x.cgi?pagetosend=/news/tech/2012/102212-nosql-263595.html&pagename=/news/tech/2012/102212-nosql-263595.html&pageurl=http://www.networkworld.com/news/tech/2012/102212-nosql-263595.html&site=printpage&nsdr=n


Redis and Memcached Benchmark on Amazon Cloud

Garantia Data, providers of Redis and Memcached as-a-Service-in-the-Amazon-Cloud, published the results of a throughput and latency benchmark for different AWS deployment models:

The first thing we looked at when putting together our benchmark was the various architectural alternatives we wanted to compare. Users typically choose the most economical AWS instance based on the initial size estimate of their dataset, however, it’s crucial to also keep in mind that other AWS users might share the same physical server that runs your data (as nicely explained by Adrian Cockcroft here). This is especially true if you have a small-to-medium dataset, because instances between m1.small and m1.large are much more likely to be shared on a physical server than large instances like m2.2xlarge and m2.4.xlarge, which typically run on a dedicated physical server. Your “neighbours” may become “noisy” once they start consuming excess I/O and CPU resources from your physical server. In addition, small-to-medium instances are by nature weaker in processing power than large instances.

Only two comments:

  1. it’s not clear if there were multiple instances of Redis used per machine when the chosen instances had multi-cores
  2. I would have really liked to also have a pricing comparison in the conclusion section

Original title and link: Redis and Memcached Benchmark on Amazon Cloud (NoSQL database©myNoSQL)

via: https://garantiadata.com/blog/its-true-even-modest-datasets-can-enjoy-the-speediest-performance


Qubole: New On-Demand Hadoop Service by Hive Creators

Derrick Harris for GigaOm:

Two key members of the Facebook team that created the Hadoop query language Hive are launching their own big data startup called Qubole on Thursday. […] Qubole is also optimized to run on cloud-based resources that typically don’t offer performance on a par with their physical counterparts. Thusoo said the product incorporates a specially-designed cache system that lets queries run five times faster than traditional Hadoop jobs in the cloud, and users have the option to change the types of instances their jobs are running on if the situation requires.

Running on Amazon infrastructure.

Original title and link: Qubole: New On-Demand Hadoop Service by Hive Creators (NoSQL database©myNoSQL)

via: http://gigaom.com/cloud/exclusive-the-brains-behind-hive-launch-on-demand-hadoop-service/


Notes on the Hadoop and HBase Markets

Curt Monash shares what he heard from his customers:

  • Over half of Cloudera’s customers (nb 100 subscription customers) use HBase
  • Hortonworks thinks a typical enterprise Hadoop cluster has 20-50 nodes, with 50-100 already being on the large side.
  • There are huge amounts of Elastic MapReduce/Hadoop processing in the Amazon cloud. Some estimates say it’s the majority of all Amazon Web Services processing.

Original title and link: Notes on the Hadoop and HBase Markets (NoSQL database©myNoSQL)

via: http://www.dbms2.com/2012/04/24/notes-on-the-hadoop-and-hbase-markets/


What Are the Pros and Cons of Running Cloudera’s Distribution for Hadoop vs Amazon Elastic MapReduce Service?

Old Quora question, but still very relevant. Top response from Jeff Hammerbacher:

Elastic MapReduce Pros:

  • Dynamic MapReduce cluster sizing.
  • Ease of use for simple jobs via their proprietary web console.
  • Great documentation.
  • Integrates nicely with other Amazon Web Services.

Cloudera Distribution for Hadoop:

  • CDH is open source; you have access to the source code and can inspect it for debugging purposes and make modifications as required.
  • CDH can be run on a number of public or private clouds using an open source framework, Whirr, so you’re not tied to a single cloud provider
  • With CDH, you can move your cluster to dedicated hardware with little disruption when the economics make sense. Most non-trivial applications will benefit from this move.
  • CDH packages a number of open source projects that are not included with EMR: Sqoop, Flume, HBase, Oozie, ZooKeeper, Avro, and Hue. You have access to the complete platform composed of data collection, storage, and processing tools.
  • CDH packages a number of critical bug fixes and features and the most recent stable releases, so you’re usually using a more stable and feature-rich product.
  • You can purchase support and management tools for CDH via Cloudera Enterprise.
  • CDH uses the open source Oozie framework for workflow management. EMR implemented a proprietary “job flow” system before major Hadoop users standardized on Oozie for workload management.
  • CDH uses the open source Hue framework for its user interface. If you require new features from your web interface, you can easily implement them using the Hue SDK.
  • CDH includes a number of integrations with other software components of the data management stack, including Talend, Informatica, Netezza, Teradata, Greenplum, Microstrategy, and others. […]
  • CDH has been designed and deployed in common Linux environments and you can use standard tools to debug your programs. […]

Make sure you also read Hadoop in the Cloud: Pros and Cons which addresses (almost) the same question.

A Twitter-style answer to this question would be: “Control and customization vs Automated and Managed Service”. 80 characters left to add your own perspective.

Original title and link: What Are the Pros and Cons of Running Cloudera’s Distribution for Hadoop vs Amazon Elastic MapReduce Service? (NoSQL database©myNoSQL)


The Total Cost of (Non) Ownership of a NoSQL Database Service

The Amazon team released a whitepaper comparing the total cost of ownership for 3 scenarios:

  1. on-premise NoSQL database
  2. NoSQL database deployed on Amazon EC2 and Amazon EBS
  3. Amazon DynamoDB

The Total Cost of Ownership of a NoSQL Database service

As you can imagine DynamoDB comes out as the most cost-effective solution (79% more effective than on-premise NoSQL database and 61% more cost-effective than AWS hosted NoSQL database). Read or download the paper after the break.


Wordnik: Migrating From a Monolythic Platform to Micro Services

The story of how Wordnik changed a monolithic platform to one based on Micro Services and the implications at the data layer (MongoDB):

To address this, we made a significant architectural shift. We have split our application stack into something called Micro Services — a term that I first heard from the folks at Netflix. […] This translates to the data tier as well. We have low cost servers, and they work extremely well when they stay relatively small. Make them too big and things can go sour, quickly. So from the data tier, each service gets its own data cluster. This keeps services extremely focused, compact, and fast — there’s almost no fear that some other consumer of a shared data tier is going to perform some ridiculously slow operation which craters the runtime performance. Have you ever seen what happens when a BI tool is pointed at the runtime database? This is no different.

Original title and link: Wordnik: Migrating From a Monolythic Platform to Micro Services (NoSQL database©myNoSQL)

via: http://blog.wordnik.com/with-software-small-is-the-new-big


The Design of 99designs - A Clean Tens of Millions Pageviews Architecture

By pure coincidence, General Chicken just published on High Scalability a bullet point summary of the 99designs architecture I’ve linked and commented on earlier.

Original title and link: The Design of 99designs - A Clean Tens of Millions Pageviews Architecture (NoSQL database©myNoSQL)


MongoDB vs MySQL: A DevOps point of view

Pierre Bailet and Mathieu Poumeyrol of fotopedia (a French photo site) share their experience of operating a small MongoDB cluster since Sep.2009 compared to a MySQL cluster.

Some details about fotopedia:

  • fotopedia is 100% on AWS
  • Amazon RDS for MySQL
  • 4 nodes MongoDB cluster
  • 150mil. photo views

MongoDB advantages:

  • no alter table
  • background index creation
  • data backup & restoration
    • note: as far as I can tell MySQL is able to do the same
  • replica sets
  • hardware migration
    • note: the same procedure can be used for MySQL

Before leaving you with the slides, here is an interesting accepted trade-off:

Quietly losing seconds of writes is preferable to:

  • weekly minutes-long maintenance periods
  • minutes-long unscheduled downtime and manual failover in case of hardware failures


Thoughts on SimpleDB, DynamoDB and Cassandra

Adrian Cockcroft:

So the lesson here is that for a first step into NoSQL, we went with a hosted solution so that we didn’t have to build a team of experts to run it, and we didn’t have to decide in advance how much scale we needed. Starting again from scratch today, I would probably go with DynamoDB. It’s a low “friction” and developer friendly solution.

You can look at this in two ways: 1) a biased opinion of someone that has already betted on Amazon with the infrastructure of a multi-billion business; 2) the opinion of someone that has accumulated a ton of experience in the NoSQL space and that is successfully1 running the infrastructure of a multi-billion business on NoSQL solutions. I’d strongly suggest you to think of it as the latter.


  1. Netflix was one of the few companies that continued to operate during Amazon’s EBS major failure. 

Original title and link: Thoughts on SimpleDB, DynamoDB and Cassandra (NoSQL database©myNoSQL)

via: http://perfcap.blogspot.com/2012/01/thoughts-on-simpledb-dynamodb-and.html