NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



Elastic MapReduce: All content tagged as Elastic MapReduce in NoSQL databases and polyglot persistence

Hadoop: Amazon Elastic MapReduce and Microsoft Project Isotop

This is how things are rolling these days. Microsoft talks about offerring Hadoop integration with Project Isotop in 2012, Amazon is announcing immediate availability of new beefed instances (Cluster Compute Eight Extra Large (cc2.8xlarge)) and reduced prices for some of the existing instances.

Original title and link: Hadoop: Amazon Elastic MapReduce and Microsoft Project Isotop (NoSQL database©myNoSQL)

Tanuki: A 30000 Cores AWS Cluster

Sometimes the only valid comment is wow.

We have now launched a cluster 3 times the size of Tanuki, or 30,000 cores, which cost $1279/hour to operate for a Top 5 Pharma. It performed genuine scientific work — in this case molecular modeling — and a ton of it. The complexity of this environment did not necessarily scale linearly with the cores.

In fact, we had to implement a triad of features within CycleCloud to make it a reality:

  1. MultiRegion support: To achieve the mind boggling core count of this cluster, we launched in three distinct AWS regions simultaneously, including Europe.
  2. Massive Spot instance support: This was a requirement given the potential savings at this scale by going through the spot market. Besides, our scheduling environment and the workload had no issues with the possibility of early termination and rescheduling.
  3. Massive CycleServer monitoring & Grill GUI app for Chef monitoring: There is no way that any mere human could keep track of all of the moving parts on a cluster of this scale.

Facebook runs a 30PB Hadoop analytic data warehouse and Yahoo! has a 100,000 cores/40,000 machines Hadoop cluster. I’m wondering what are the largest Amazon Elastic MapReduce jobs ever run. Any ideas?

Original title and link: Tanuki: A 30000 Cores AWS Cluster (NoSQL database©myNoSQL)


Hadoop, Hive and Redis for Foursquare Analytics

Foursquare’s move from querying the production databases to a data analytics system using Hadoop and Hive with Redis playing the role of a cache:

  • Provide an easy-to-use end-point to run data exploration queries (using SQL and simple web-forms).
  • Cache the results of queries (in a database) to power reports, so that the data is available to everyone, whenever it is needed.
  • Allow our hadoop cluster to be totally dynamic without having to move data around (we shut it down at night and on weekends).
  • Add new data in a simple way (just put it in Amazon S3!).
  • Analyse data from several data sources (mongodb, postgres, log-files).

Foursquare Analytics Architecture

One of the most often heard complains about NoSQL databases is about their reduced querying capabilities. Running reports and analysis against the production servers is only going to work when you have little data and the set of queries is limitted and stable over time. Otherwise you’ll want to run these against a copy of your data to avoid bringing down production databases and avoid corrupting data.

Original title and link: Hadoop, Hive and Redis for Foursquare Analytics (NoSQL databases © myNoSQL)