Amazon Elastic MapReduce: All content tagged as Amazon Elastic MapReduce in NoSQL databases and polyglot persistence
Monday, 19 December 2011
How to Run a MapReduce Job Against Common Crawl Data Using Amazon Elastic MapReduce
Steve Salevan’s 7 step guide to setting up, compiling, deploying, and running a basic MapReduce job.
When Google unveiled its MapReduce algorithm to the world in an academic paper in 2004, it shook the very foundations of data analysis. By establishing a basic pattern for writing data analysis code that can run in parallel against huge datasets, speedy analysis of data at massive scale finally became a reality, turning many orthodox notions of data analysis on their head.
Google published the paper. Yahoo open sourced this. And Amazon is offering (unlimited) resources.
Update: The Hacker News thread where the main question answered is what other corporations are using MapReduce (besides the Internet companies). The answer is unfortunately extremely short: too many to be able to enumerate them all.
Original title and link: How to Run a MapReduce Job Against Common Crawl Data Using Amazon Elastic MapReduce (©myNoSQL)
Monday, 12 December 2011
Amazon Elastic MapReduce Upgrades to Hadoop 0.20.205, Pig 0.9.1, AMI Versioning, and Amazon VPC
Starting today you can run your job flows using Hadoop 0.20.205 and Pig 0.9.1. To simplify the upgrade process, we have also introduced the concept of AMI versions. You can now provide a specific AMI version to use at job flow launch or specify that you would like to use our “latest” AMI, ensuring that you are always using our most up-to-date features. The following AMI versions are now available:
- Version 2.0: Hadoop 0.20.205, Hive 0.7.1, Pig 0.9.1, Debian 6.0.2 (Squeeze)
- Version 1.0: Hadoop 0.18.3 and 0.20.2, Hive 0.5 and 0.7.1, Pig 0.3 and 0.6, Debian 5.0 (Lenny)
Amazon Elastic MapReduce is the perfect solution for:
- learning and experimenting with Hadoop
- running huge processing jobs in cases where your company doesn’t already have the necessary resources
Original title and link: Amazon Elastic MapReduce Upgrades to Hadoop 0.20.205, Pig 0.9.1, AMI Versioning, and Amazon VPC (©myNoSQL)
Thursday, 23 June 2011
Moving Away From Amazon’s EMR Service to an In-House Hadoop Cluster
Many of our systems use Amazon’s S3 as a backup repository for log data. Our data became too large to process by traditional techniques, so we started using Amazon’s Elastic MapReduce (EMR) to do more expensive queries on our data stored in S3. The major advantage of EMR for us was the lack of operational overhead. With a simple API call, we could have a 20 or 40 node cluster running to crunch our data, which we shutdown at the conclusion of the run. We had two systems interacting with EMR. The first consisted of shell scripts to start an EMR cluster, run a pig script, and load the output data from S3 into our data warehousing system. The second was a Java application that launched pig jobs on an EMR cluster via the Java API and consumed the data in S3 produced by EMR.
What might make you consider moving from the cloud version of MapReduce, the Amazon Elastic MapReduce, to an on-premise Hadoop cluster:
- performance and tuning
- monitoring
- API access
- lack of latest features
Original title and link: Moving Away From Amazon’s EMR Service to an In-House Hadoop Cluster (NoSQL database©myNoSQL)
Monday, 10 January 2011
Amazon Elastic MapReduce Updates
Updates from Amazon including upgraded Hive, multipart upload, optimized JDBC drivers:
- Support for S3’s Large Objects and Multipart Upload
Amazon Elastic MapReduce supports this feature too allowing MapREduce to behin the upload before the Hadoop task is finished
- Upgraded Hive Support
Currently you can run both Hive 0.5 and 0.7 concurrenty in the same cluster
- JDBC Drivers for Hive
Optimized JDBC drivers for Hive.
Original title and link: Amazon Elastic MapReduce Updates (NoSQL databases © myNoSQL)