Elastic MapReduce: All content tagged as Elastic MapReduce in NoSQL databases and polyglot persistence
Thursday, 22 September 2011
Tanuki: A 30000 Cores AWS Cluster
Sometimes the only valid comment is wow.
We have now launched a cluster 3 times the size of Tanuki, or 30,000 cores, which cost $1279/hour to operate for a Top 5 Pharma. It performed genuine scientific work — in this case molecular modeling — and a ton of it. The complexity of this environment did not necessarily scale linearly with the cores.
In fact, we had to implement a triad of features within CycleCloud to make it a reality:
- MultiRegion support: To achieve the mind boggling core count of this cluster, we launched in three distinct AWS regions simultaneously, including Europe.
- Massive Spot instance support: This was a requirement given the potential savings at this scale by going through the spot market. Besides, our scheduling environment and the workload had no issues with the possibility of early termination and rescheduling.
- Massive CycleServer monitoring & Grill GUI app for Chef monitoring: There is no way that any mere human could keep track of all of the moving parts on a cluster of this scale.
Facebook runs a 30PB Hadoop analytic data warehouse and Yahoo! has a 100,000 cores/40,000 machines Hadoop cluster. I’m wondering what are the largest Amazon Elastic MapReduce jobs ever run. Any ideas?
Original title and link: Tanuki: A 30000 Cores AWS Cluster (©myNoSQL)
Monday, 14 March 2011
Hadoop, Hive and Redis for Foursquare Analytics
Foursquare’s move from querying the production databases to a data analytics system using Hadoop and Hive with Redis playing the role of a cache:
- Provide an easy-to-use end-point to run data exploration queries (using SQL and simple web-forms).
- Cache the results of queries (in a database) to power reports, so that the data is available to everyone, whenever it is needed.
- Allow our hadoop cluster to be totally dynamic without having to move data around (we shut it down at night and on weekends).
- Add new data in a simple way (just put it in Amazon S3!).
- Analyse data from several data sources (mongodb, postgres, log-files).

One of the most often heard complains about NoSQL databases is about their reduced querying capabilities. Running reports and analysis against the production servers is only going to work when you have little data and the set of queries is limitted and stable over time. Otherwise you’ll want to run these against a copy of your data to avoid bringing down production databases and avoid corrupting data.
Original title and link: Hadoop, Hive and Redis for Foursquare Analytics (NoSQL databases © myNoSQL)
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling