ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Amazon Elastic MapReduce: All content tagged as Amazon Elastic MapReduce in NoSQL databases and polyglot persistence

How to Run a MapReduce Job Against Common Crawl Data Using Amazon Elastic MapReduce

Steve Salevan’s 7 step guide to setting up, compiling, deploying, and running a basic MapReduce job.

When Google unveiled its MapReduce algorithm to the world in an academic paper in 2004, it shook the very foundations of data analysis. By establishing a basic pattern for writing data analysis code that can run in parallel against huge datasets, speedy analysis of data at massive scale finally became a reality, turning many orthodox notions of data analysis on their head.

Google published the paper. Yahoo open sourced this. And Amazon is offering (unlimited) resources.

Update: The Hacker News thread where the main question answered is what other corporations are using MapReduce (besides the Internet companies). The answer is unfortunately extremely short: too many to be able to enumerate them all.

Original title and link: How to Run a MapReduce Job Against Common Crawl Data Using Amazon Elastic MapReduce (NoSQL database©myNoSQL)

via: http://www.commoncrawl.org/mapreduce-for-the-masses/


Amazon Elastic MapReduce Upgrades to Hadoop 0.20.205, Pig 0.9.1, AMI Versioning, and Amazon VPC

Starting today you can run your job flows using Hadoop 0.20.205 and Pig 0.9.1. To simplify the upgrade process, we have also introduced the concept of AMI versions. You can now provide a specific AMI version to use at job flow launch or specify that you would like to use our “latest” AMI, ensuring that you are always using our most up-to-date features. The following AMI versions are now available:

  • Version 2.0: Hadoop 0.20.205, Hive 0.7.1, Pig 0.9.1, Debian 6.0.2 (Squeeze)
  • Version 1.0: Hadoop 0.18.3 and 0.20.2, Hive 0.5 and 0.7.1, Pig 0.3 and 0.6, Debian 5.0 (Lenny)

Amazon Elastic MapReduce is the perfect solution for:

  1. learning and experimenting with Hadoop
  2. running huge processing jobs in cases where your company doesn’t already have the necessary resources

Original title and link: Amazon Elastic MapReduce Upgrades to Hadoop 0.20.205, Pig 0.9.1, AMI Versioning, and Amazon VPC (NoSQL database©myNoSQL)

via: https://forums.aws.amazon.com/ann.jspa?annID=1275


Moving Away From Amazon’s EMR Service to an In-House Hadoop Cluster

Many of our systems use Amazon’s S3 as a backup repository for log data.  Our data became too large to process by traditional techniques, so we started using Amazon’s Elastic MapReduce (EMR) to do more expensive queries on our data stored in S3.  The major advantage of EMR for us was the lack of operational overhead.  With a simple API call, we could have a 20 or 40 node cluster running to crunch our data, which we shutdown at the conclusion of the run. We had two systems interacting with EMR.  The first consisted of shell scripts to start an EMR cluster, run a pig script, and load the output data from S3 into our data warehousing system.  The second was a Java application that launched pig jobs on an EMR cluster via the Java API and consumed the data in S3 produced by EMR.

What might make you consider moving from the cloud version of MapReduce, the Amazon Elastic MapReduce, to an on-premise Hadoop cluster:

  1. performance and tuning
  2. monitoring
  3. API access
  4. lack of latest features

Original title and link: Moving Away From Amazon’s EMR Service to an In-House Hadoop Cluster (NoSQL database©myNoSQL)

via: http://www.cloudera.com/blog/2011/06/migrating-from-elastic-mapreduce-to-a-cloudera?s-distribution-including-apache-hadoop-cluster/


Amazon Elastic MapReduce Updates

Updates from Amazon including upgraded Hive, multipart upload, optimized JDBC drivers:

  • Support for S3’s Large Objects and Multipart Upload

Amazon Elastic MapReduce supports this feature too allowing MapREduce to behin the upload before the Hadoop task is finished

  • Upgraded Hive Support

Currently you can run both Hive 0.5 and 0.7 concurrenty in the same cluster

  • JDBC Drivers for Hive

Optimized JDBC drivers for Hive.

Original title and link: Amazon Elastic MapReduce Updates (NoSQL databases © myNoSQL)

via: http://aws.typepad.com/aws/2011/01/elastic-mapreduce-updates-hive-multipart-upload-jdbc-squirrel-sql.html