NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Amazon: All content tagged as Amazon in NoSQL databases and polyglot persistence

Amazon Introduces High I/O SSD-backed EC2 Instances

Jeff Barr:

In order to meet this need, we are introducing a new family of EC2 instances1 that are designed to run low-latency, I/O-intensive applications, and are an exceptionally good host for NoSQL databases such as Cassandra and MongoDB.

Many complains about running databases on EC2 instances were about the I/O. I guess Amazon has been hearing this loud and clear.

  1. Specs of the new EC2 instace: 

    • 8 virtual cores (35 ECU)
    • HVM and PVM virtualization.
    • 60.5 GB of RAM.
    • 10 Gigabit Ethernet connectivity with support for cluster placement groups.
    • 2 TB of local SSD-backed storage, visible as a pair of 1 TB volumes.

Original title and link: Amazon Introduces High I/O SSD-backed EC2 Instances (NoSQL database©myNoSQL)


High Performance I/O Instances for Amazon EC2

Werner Vogels:

Databases are one particular area that for scaling can benefit tremendously from high performance I/O. The I/O requirements of database engines, regardless whether they a Relational or Non-Relation (NoSQL) DBMS’s can be very demanding. Increasingly randomized access, and burst IO through aggregation put strains on any IO subsystem, physical or virtual, attached or remote. One area where we have seen this particularly culminate is in modern NoSQL DBMSs that are often the core of scalable modern web applications that exhibit a great deal of random access patterns. They require high replication factors to get to the aggregate random IO they require. Early users of these High I/O instances have been able to reduce their replication factors significantly while achieving rock solid performance and substantially reducing their cost in the process.

Going from around 100 IOPS for 15K RPM spinning disks to over 100000 IOPS for random reads and 10000-85000 for random writes with SSDs.

Original title and link: High Performance I/O Instances for Amazon EC2 (NoSQL database©myNoSQL)


From S3 to CouchDB and Redis and Then Half Way Back for Serving Ads

The story of going form S3 to CouchDB and Redis and then back to S3 and Redis for ad serving:

The solution to this situation has a touch of irony. With Redis in place, we replaced CouchDB for placement- and ad-data with S3. Since we weren’t using any CouchDB-specific features, we simply published all the documents to S3 buckets instead. We still did the Redis cache warming upfront and data updates in the background. So by decoupling the application from the persistence layer using Redis, we also removed the need for a super fast database backend. We didn’t care that S3 is slower than a local CouchDB, since we updated everything asynchronously.

Besides the detailed blog post there’s also a slidedeck:

Original title and link: From S3 to CouchDB and Redis and Then Half Way Back for Serving Ads (NoSQL database©myNoSQL)


The Behavior of EC2/EBS Metadata Replicated Datastore

The Amazon post about the service disruption that happened late last month provides an interesting description of the behavior of the Amazon EC2 and EBS metadata datastores:

The EC2 and EBS APIs are implemented on multi-Availability Zone replicated datastores. These datastores are used to store metadata for resources such as instances, volumes, and snapshots. To protect against datastore corruption, currently when the primary copy loses power, the system automatically flips to a read-only mode in the other Availability Zones until power is restored to the affected Availability Zone or until we determine it is safe to promote another copy to primary.

Original title and link: The Behavior of EC2/EBS Metadata Replicated Datastore (NoSQL database©myNoSQL)


Google Cloud Platform Is the Biggest Deal in IT Since Amazon Launched EC2

Remember what I was writing in the state of Hadoop market about having a second option for on-demand cloud-based Hadoop services? Benjamin Black compares Google Cloud Platform with Amazon services:

  • Cloud Engine is a lot like EC2 & EBS
  • Cloud Storage is a lot like S3
  • Cloud SQL is a lot like RDS
  • Analytics can be used like CloudWatch (and I know of people putting billions of their own data points in Analytics)
  • BigQuery has no AWS equivalent, but maybe you could build it with EMR?
  • PageSpeed has no AWS equivalent

Hadoop and MapR are already listed as possible use cases for Google Cloud Platform.

I don’t think I could write a better conclusion than Black did in his post:

This is big, planetary scale infrastructure. This is cloud legitimized and super-sized. In the words of the prophet: Shit just got real.

Original title and link: Google Cloud Platform Is the Biggest Deal in IT Since Amazon Launched EC2 (NoSQL database©myNoSQL)


Qubole: New On-Demand Hadoop Service by Hive Creators

Derrick Harris for GigaOm:

Two key members of the Facebook team that created the Hadoop query language Hive are launching their own big data startup called Qubole on Thursday. […] Qubole is also optimized to run on cloud-based resources that typically don’t offer performance on a par with their physical counterparts. Thusoo said the product incorporates a specially-designed cache system that lets queries run five times faster than traditional Hadoop jobs in the cloud, and users have the option to change the types of instances their jobs are running on if the situation requires.

Running on Amazon infrastructure.

Original title and link: Qubole: New On-Demand Hadoop Service by Hive Creators (NoSQL database©myNoSQL)


MapR Hadoop Distribution on Amazon Elastic MapReduce

Another very interesting news for the Hadoop space, this time coming from Amazon and MapR announcing support for the MapR Hadoop distribution on Amazon Elastic MapReduce:

MapR introduces enterprise-focused features for Hadoop such as high availability, data snapshotting, cluster mirroring across AZs, and NFS mounts. Combined with Amazon Elastic MapReduce’s managed Hadoop environment, seamless integration with other AWS services, and hourly pricing with no upfront fees or long-term commitments, Amazon EMR with the MapR Distribution for Hadoop offers customers a powerful tool for generating insights from their data.

Following the logic of the Amazon Relational Database Services which started with MySQL, the most popular and open source database and then added support for the commercial, but also very popular Oracle and SQL Server, what does this announcement tell us? It’s either that Amazon has got a lot of requests for MapR or that some very big AWS customers have mentioned MapR in their talks with Amazon. I go with the second option.

Original title and link: MapR Hadoop Distribution on Amazon Elastic MapReduce (NoSQL database©myNoSQL)

Pricing for Hadoop Support: Cloudera, Hortonworks, MapR

Found the following bits in a post on The Register by Timothy Prickett Morgan:

While Cloudera and MapR are charging $4,000 per node for their enterprise-class Hadoop distributions (including their proprietary extensions and tech support), Hortonworks doesn’t have any proprietary extensions and is living off of the support contracts for the HDP 1.0 stack. […] Hortonworks is not providing its full list price, but for a starter ten-node cluster, you can get a standard support contract for $12,000 per year.

Hortonworks’s pricing looks a bit aggressive, but this could be explained by the fact that Hortonworks Data Platform 1.0 was made available only this week.

For running Hadoop in the cloud, there’s also Amazon Elastic MapReduce whose pricing was always clear. And Amazon has recently announced support for MapR Hadoop distribution on Elastic MapReduce.

Original title and link: Pricing for Hadoop Support: Cloudera, Hortonworks, MapR (NoSQL database©myNoSQL)

Calculating the Cost of Storing PHP Sessions Using Amazon DynamoDB

Aside from nominal data storage and data transfer fees, the costs associated with using Amazon DynamoDB are calculated based on provisioned throughput capacity and item size (see the Amazon DynamoDB pricing details). Throughput is measured in units of Read Capacity and Write Capacity. Ultimately, the throughput and costs required for your sessions table is going to be based on your website traffic, but the following is a list of the capacity units required for each session-related operation with the assumption that your sessions are less than 1KB in size:

  • Reading via session_start()

    • With locking enabled: 1 unit of Write Capacity + 1 unit of Write Capacity for each time it must retry acquiring the lock

    • With locking disabed: 1 unit of Read Capacity (or 0.5 units of Read Capacity if consistent reads are disabled)

  • Writing via session_write_close(): 1 unit of Write Capacity

  • Deleting via session_destroy(): 1 unit of Write Capacity

  • Garbage Collecting via DyanamoDBSessionHandler::garbage_collect(): 0.5 units of Read Capacity per KB of data in the sessions table + 1 unit of Write Capacity per expired item

Nice translation of PHP function calls to effective Amazon DynamoDB capacity units.

Original title and link: Calculating the Cost of Storing PHP Sessions Using Amazon DynamoDB (NoSQL database©myNoSQL)


What Are the Pros and Cons of Running Cloudera’s Distribution for Hadoop vs Amazon Elastic MapReduce Service?

Old Quora question, but still very relevant. Top response from Jeff Hammerbacher:

Elastic MapReduce Pros:

  • Dynamic MapReduce cluster sizing.
  • Ease of use for simple jobs via their proprietary web console.
  • Great documentation.
  • Integrates nicely with other Amazon Web Services.

Cloudera Distribution for Hadoop:

  • CDH is open source; you have access to the source code and can inspect it for debugging purposes and make modifications as required.
  • CDH can be run on a number of public or private clouds using an open source framework, Whirr, so you’re not tied to a single cloud provider
  • With CDH, you can move your cluster to dedicated hardware with little disruption when the economics make sense. Most non-trivial applications will benefit from this move.
  • CDH packages a number of open source projects that are not included with EMR: Sqoop, Flume, HBase, Oozie, ZooKeeper, Avro, and Hue. You have access to the complete platform composed of data collection, storage, and processing tools.
  • CDH packages a number of critical bug fixes and features and the most recent stable releases, so you’re usually using a more stable and feature-rich product.
  • You can purchase support and management tools for CDH via Cloudera Enterprise.
  • CDH uses the open source Oozie framework for workflow management. EMR implemented a proprietary “job flow” system before major Hadoop users standardized on Oozie for workload management.
  • CDH uses the open source Hue framework for its user interface. If you require new features from your web interface, you can easily implement them using the Hue SDK.
  • CDH includes a number of integrations with other software components of the data management stack, including Talend, Informatica, Netezza, Teradata, Greenplum, Microstrategy, and others. […]
  • CDH has been designed and deployed in common Linux environments and you can use standard tools to debug your programs. […]

Make sure you also read Hadoop in the Cloud: Pros and Cons which addresses (almost) the same question.

A Twitter-style answer to this question would be: “Control and customization vs Automated and Managed Service”. 80 characters left to add your own perspective.

Original title and link: What Are the Pros and Cons of Running Cloudera’s Distribution for Hadoop vs Amazon Elastic MapReduce Service? (NoSQL database©myNoSQL)

DynamoDB Libraries, Mappers, and Mock Implementations

A list of DynamoDB libraries covering quite a few popular languages and frameworks:

DynamoDB Libraries, Mappers, and Mock Implementations

A couple of things I’ve noticed (and that could be helpful to other NoSQL database companies):

  1. Amazon provides official libraries for a couple of major programming languages (Java, .NET, PHP, Ruby)
  2. Amazon is not shy to promote libraries that are not official, but established themselves as good libraries (e.g. Python’s Boto)
  3. The list doesn’t seem to include anything for C and Objective C (Objective C is the language of iOS and Mac apps)

Original title and link: DynamoDB Libraries, Mappers, and Mock Implementations (NoSQL database©myNoSQL)


The Total Cost of (Non) Ownership of a NoSQL Database Service

The Amazon team released a whitepaper comparing the total cost of ownership for 3 scenarios:

  1. on-premise NoSQL database
  2. NoSQL database deployed on Amazon EC2 and Amazon EBS
  3. Amazon DynamoDB

The Total Cost of Ownership of a NoSQL Database service

As you can imagine DynamoDB comes out as the most cost-effective solution (79% more effective than on-premise NoSQL database and 61% more cost-effective than AWS hosted NoSQL database). Read or download the paper after the break.