Amazon: All content tagged as Amazon in NoSQL databases and polyglot persistence
Thursday, 19 July 2012
Amazon Introduces High I/O SSD-backed EC2 Instances
Jeff Barr:
In order to meet this need, we are introducing a new family of EC2 instances1 that are designed to run low-latency, I/O-intensive applications, and are an exceptionally good host for NoSQL databases such as Cassandra and MongoDB.
Many complains about running databases on EC2 instances were about the I/O. I guess Amazon has been hearing this loud and clear.
-
Specs of the new EC2 instace: ↩
- 8 virtual cores (35 ECU)
- HVM and PVM virtualization.
- 60.5 GB of RAM.
- 10 Gigabit Ethernet connectivity with support for cluster placement groups.
- 2 TB of local SSD-backed storage, visible as a pair of 1 TB volumes.
Original title and link: Amazon Introduces High I/O SSD-backed EC2 Instances (©myNoSQL)
via: http://aws.typepad.com/aws/2012/07/new-high-io-ec2-instance-type-hi14xlarge.html
Monday, 16 July 2012
From S3 to CouchDB and Redis and Then Half Way Back for Serving Ads
The story of going form S3 to CouchDB and Redis and then back to S3 and Redis for ad serving:
The solution to this situation has a touch of irony. With Redis in place, we replaced CouchDB for placement- and ad-data with S3. Since we weren’t using any CouchDB-specific features, we simply published all the documents to S3 buckets instead. We still did the Redis cache warming upfront and data updates in the background. So by decoupling the application from the persistence layer using Redis, we also removed the need for a super fast database backend. We didn’t care that S3 is slower than a local CouchDB, since we updated everything asynchronously.
Besides the detailed blog post there’s also a slidedeck:
Original title and link: From S3 to CouchDB and Redis and Then Half Way Back for Serving Ads (©myNoSQL)
via: http://dev.adcloud.com/blog/2012/07/13/nosql-not-only-a-fairy-tale/
Tuesday, 10 July 2012
The Behavior of EC2/EBS Metadata Replicated Datastore
The Amazon post about the service disruption that happened late last month provides an interesting description of the behavior of the Amazon EC2 and EBS metadata datastores:
The EC2 and EBS APIs are implemented on multi-Availability Zone replicated datastores. These datastores are used to store metadata for resources such as instances, volumes, and snapshots. To protect against datastore corruption, currently when the primary copy loses power, the system automatically flips to a read-only mode in the other Availability Zones until power is restored to the affected Availability Zone or until we determine it is safe to promote another copy to primary.
Original title and link: The Behavior of EC2/EBS Metadata Replicated Datastore (©myNoSQL)
Monday, 9 July 2012
Google Cloud Platform Is the Biggest Deal in IT Since Amazon Launched EC2
Remember what I was writing in the state of Hadoop market about having a second option for on-demand cloud-based Hadoop services? Benjamin Black compares Google Cloud Platform with Amazon services:
- Cloud Engine is a lot like EC2 & EBS
- Cloud Storage is a lot like S3
- Cloud SQL is a lot like RDS
- Analytics can be used like CloudWatch (and I know of people putting billions of their own data points in Analytics)
- BigQuery has no AWS equivalent, but maybe you could build it with EMR?
- PageSpeed has no AWS equivalent
Hadoop and MapR are already listed as possible use cases for Google Cloud Platform.
I don’t think I could write a better conclusion than Black did in his post:
This is big, planetary scale infrastructure. This is cloud legitimized and super-sized. In the words of the prophet: Shit just got real.
Original title and link: Google Cloud Platform Is the Biggest Deal in IT Since Amazon Launched EC2 (©myNoSQL)
via: http://blog.b3k.us/2012/07/04/cloud-independence-day.html
Friday, 15 June 2012
Qubole: New On-Demand Hadoop Service by Hive Creators
Derrick Harris for GigaOm:
Two key members of the Facebook team that created the Hadoop query language Hive are launching their own big data startup called Qubole on Thursday. […] Qubole is also optimized to run on cloud-based resources that typically don’t offer performance on a par with their physical counterparts. Thusoo said the product incorporates a specially-designed cache system that lets queries run five times faster than traditional Hadoop jobs in the cloud, and users have the option to change the types of instances their jobs are running on if the situation requires.
Running on Amazon infrastructure.
Original title and link: Qubole: New On-Demand Hadoop Service by Hive Creators (©myNoSQL)
via: http://gigaom.com/cloud/exclusive-the-brains-behind-hive-launch-on-demand-hadoop-service/
MapR Hadoop Distribution on Amazon Elastic MapReduce
Another very interesting news for the Hadoop space, this time coming from Amazon and MapR announcing support for the MapR Hadoop distribution on Amazon Elastic MapReduce:
MapR introduces enterprise-focused features for Hadoop such as high availability, data snapshotting, cluster mirroring across AZs, and NFS mounts. Combined with Amazon Elastic MapReduce’s managed Hadoop environment, seamless integration with other AWS services, and hourly pricing with no upfront fees or long-term commitments, Amazon EMR with the MapR Distribution for Hadoop offers customers a powerful tool for generating insights from their data.
Following the logic of the Amazon Relational Database Services which started with MySQL, the most popular and open source database and then added support for the commercial, but also very popular Oracle and SQL Server, what does this announcement tell us? It’s either that Amazon has got a lot of requests for MapR or that some very big AWS customers have mentioned MapR in their talks with Amazon. I go with the second option.
Original title and link: MapR Hadoop Distribution on Amazon Elastic MapReduce (©myNoSQL)
Pricing for Hadoop Support: Cloudera, Hortonworks, MapR
Found the following bits in a post on The Register by Timothy Prickett Morgan:
While Cloudera and MapR are charging $4,000 per node for their enterprise-class Hadoop distributions (including their proprietary extensions and tech support), Hortonworks doesn’t have any proprietary extensions and is living off of the support contracts for the HDP 1.0 stack. […] Hortonworks is not providing its full list price, but for a starter ten-node cluster, you can get a standard support contract for $12,000 per year.
Hortonworks’s pricing looks a bit aggressive, but this could be explained by the fact that Hortonworks Data Platform 1.0 was made available only this week.
For running Hadoop in the cloud, there’s also Amazon Elastic MapReduce whose pricing was always clear. And Amazon has recently announced support for MapR Hadoop distribution on Elastic MapReduce.
Original title and link: Pricing for Hadoop Support: Cloudera, Hortonworks, MapR (©myNoSQL)
Friday, 11 May 2012
Calculating the Cost of Storing PHP Sessions Using Amazon DynamoDB
Aside from nominal data storage and data transfer fees, the costs associated with using Amazon DynamoDB are calculated based on provisioned throughput capacity and item size (see the Amazon DynamoDB pricing details). Throughput is measured in units of Read Capacity and Write Capacity. Ultimately, the throughput and costs required for your sessions table is going to be based on your website traffic, but the following is a list of the capacity units required for each session-related operation with the assumption that your sessions are less than 1KB in size:
Reading via
session_start()
With locking enabled: 1 unit of Write Capacity + 1 unit of Write Capacity for each time it must retry acquiring the lock
With locking disabed: 1 unit of Read Capacity (or 0.5 units of Read Capacity if consistent reads are disabled)
Writing via
session_write_close(): 1 unit of Write CapacityDeleting via
session_destroy(): 1 unit of Write CapacityGarbage Collecting via
DyanamoDBSessionHandler::garbage_collect(): 0.5 units of Read Capacity per KB of data in the sessions table + 1 unit of Write Capacity per expired item
Nice translation of PHP function calls to effective Amazon DynamoDB capacity units.
Original title and link: Calculating the Cost of Storing PHP Sessions Using Amazon DynamoDB (©myNoSQL)
via: http://aws.typepad.com/aws/2012/04/scalable-session-handling-in-php-using-amazon-dynamodb.html
Friday, 6 April 2012
What Are the Pros and Cons of Running Cloudera’s Distribution for Hadoop vs Amazon Elastic MapReduce Service?
Old Quora question, but still very relevant. Top response from Jeff Hammerbacher:
Elastic MapReduce Pros:
- Dynamic MapReduce cluster sizing.
- Ease of use for simple jobs via their proprietary web console.
- Great documentation.
- Integrates nicely with other Amazon Web Services.
Cloudera Distribution for Hadoop:
- CDH is open source; you have access to the source code and can inspect it for debugging purposes and make modifications as required.
- CDH can be run on a number of public or private clouds using an open source framework, Whirr, so you’re not tied to a single cloud provider
- With CDH, you can move your cluster to dedicated hardware with little disruption when the economics make sense. Most non-trivial applications will benefit from this move.
- CDH packages a number of open source projects that are not included with EMR: Sqoop, Flume, HBase, Oozie, ZooKeeper, Avro, and Hue. You have access to the complete platform composed of data collection, storage, and processing tools.
- CDH packages a number of critical bug fixes and features and the most recent stable releases, so you’re usually using a more stable and feature-rich product.
- You can purchase support and management tools for CDH via Cloudera Enterprise.
- CDH uses the open source Oozie framework for workflow management. EMR implemented a proprietary “job flow” system before major Hadoop users standardized on Oozie for workload management.
- CDH uses the open source Hue framework for its user interface. If you require new features from your web interface, you can easily implement them using the Hue SDK.
- CDH includes a number of integrations with other software components of the data management stack, including Talend, Informatica, Netezza, Teradata, Greenplum, Microstrategy, and others. […]
- CDH has been designed and deployed in common Linux environments and you can use standard tools to debug your programs. […]
Make sure you also read Hadoop in the Cloud: Pros and Cons which addresses (almost) the same question.
A Twitter-style answer to this question would be: “Control and customization vs Automated and Managed Service”. 80 characters left to add your own perspective.
Original title and link: What Are the Pros and Cons of Running Cloudera’s Distribution for Hadoop vs Amazon Elastic MapReduce Service? (©myNoSQL)
DynamoDB Libraries, Mappers, and Mock Implementations
A list of DynamoDB libraries covering quite a few popular languages and frameworks:

A couple of things I’ve noticed (and that could be helpful to other NoSQL database companies):
- Amazon provides official libraries for a couple of major programming languages (Java, .NET, PHP, Ruby)
- Amazon is not shy to promote libraries that are not official, but established themselves as good libraries (e.g. Python’s Boto)
- The list doesn’t seem to include anything for C and Objective C (Objective C is the language of iOS and Mac apps)
Original title and link: DynamoDB Libraries, Mappers, and Mock Implementations (©myNoSQL)
Monday, 2 April 2012
The Total Cost of (Non) Ownership of a NoSQL Database Service
The Amazon team released a whitepaper comparing the total cost of ownership for 3 scenarios:
- on-premise NoSQL database
- NoSQL database deployed on Amazon EC2 and Amazon EBS
- Amazon DynamoDB

As you can imagine DynamoDB comes out as the most cost-effective solution (79% more effective than on-premise NoSQL database and 61% more cost-effective than AWS hosted NoSQL database). Read or download the paper after the break.
Wednesday, 28 March 2012
Basho Announces Riak-Based Multi-Tenant, Distributed, S3-Compatible Cloud Storage Platform
Coverage of the announcement of a new product from Basho: Riak CS: a multi-tenant, distributed, S3-compatible cloud storage platform:
- Klint Finley got the scoop: NoSQL Company Basho Unveils New Cloud Storage Software
- PR announcement
- Barb Darrow for GigaOm: Basho arms would-be Amazon killers with AWS-compatible storage
- Sudheer Raju for Tools Journal: Riak CS From Basho Enables Enterprise Cloud Storage
- Joe Brockmeier for RWW: Cloud Storage Competition Heats Up With RiakCS
- Liam Eagle for thewhir: Basho Launches Riak CS Cloud Storage Platform, Aims at Service Providers
My notes about Riak CS will follow shortly.
Original title and link: Basho Announces Riak-Based Multi-Tenant, Distributed, S3-Compatible Cloud Storage Platform (©myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling