AWS: All content tagged as AWS in NoSQL databases and polyglot persistence
What Are the Pros and Cons of Running Cloudera’s Distribution for Hadoop vs Amazon Elastic MapReduce Service?
Old Quora question, but still very relevant. Top response from Jeff Hammerbacher:
Elastic MapReduce Pros:
- Dynamic MapReduce cluster sizing.
- Ease of use for simple jobs via their proprietary web console.
- Great documentation.
- Integrates nicely with other Amazon Web Services.
Cloudera Distribution for Hadoop:
- CDH is open source; you have access to the source code and can inspect it for debugging purposes and make modifications as required.
- CDH can be run on a number of public or private clouds using an open source framework, Whirr, so you’re not tied to a single cloud provider
- With CDH, you can move your cluster to dedicated hardware with little disruption when the economics make sense. Most non-trivial applications will benefit from this move.
- CDH packages a number of open source projects that are not included with EMR: Sqoop, Flume, HBase, Oozie, ZooKeeper, Avro, and Hue. You have access to the complete platform composed of data collection, storage, and processing tools.
- CDH packages a number of critical bug fixes and features and the most recent stable releases, so you’re usually using a more stable and feature-rich product.
- You can purchase support and management tools for CDH via Cloudera Enterprise.
- CDH uses the open source Oozie framework for workflow management. EMR implemented a proprietary “job flow” system before major Hadoop users standardized on Oozie for workload management.
- CDH uses the open source Hue framework for its user interface. If you require new features from your web interface, you can easily implement them using the Hue SDK.
- CDH includes a number of integrations with other software components of the data management stack, including Talend, Informatica, Netezza, Teradata, Greenplum, Microstrategy, and others. […]
- CDH has been designed and deployed in common Linux environments and you can use standard tools to debug your programs. […]
Make sure you also read Hadoop in the Cloud: Pros and Cons which addresses (almost) the same question.
A Twitter-style answer to this question would be: “Control and customization vs Automated and Managed Service”. 80 characters left to add your own perspective.
Original title and link: What Are the Pros and Cons of Running Cloudera’s Distribution for Hadoop vs Amazon Elastic MapReduce Service? ( ©myNoSQL)
The Amazon team released a whitepaper comparing the total cost of ownership for 3 scenarios:
- on-premise NoSQL database
- NoSQL database deployed on Amazon EC2 and Amazon EBS
- Amazon DynamoDB
As you can imagine DynamoDB comes out as the most cost-effective solution (79% more effective than on-premise NoSQL database and 61% more cost-effective than AWS hosted NoSQL database). Read or download the paper after the break.
By pure coincidence, General Chicken just published on High Scalability a bullet point summary of the 99designs architecture I’ve linked and commented on earlier.
Original title and link: The Design of 99designs - A Clean Tens of Millions Pageviews Architecture ( ©myNoSQL)
Pierre Bailet and Mathieu Poumeyrol of fotopedia (a French photo site) share their experience of operating a small MongoDB cluster since Sep.2009 compared to a MySQL cluster.
Some details about fotopedia:
- fotopedia is 100% on AWS
- Amazon RDS for MySQL
- 4 nodes MongoDB cluster
- 150mil. photo views
- no alter table
- background index creation
- data backup & restoration
- note: as far as I can tell MySQL is able to do the same
- replica sets
- hardware migration
- note: the same procedure can be used for MySQL
Before leaving you with the slides, here is an interesting accepted trade-off:
Quietly losing seconds of writes is preferable to:
- weekly minutes-long maintenance periods
- minutes-long unscheduled downtime and manual failover in case of hardware failures