NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



aws: All content tagged as aws in NoSQL databases and polyglot persistence

Mahout as a Service in Apache Whirr 0.7.0

What’s included with Whirr 0.7.0 will definitely cut down the 2-3 hours required to get Mahout up and running on Amazon. At least that’s what Frank Scholten’s post made me believe.

Original title and link: Mahout as a Service in Apache Whirr 0.7.0 (NoSQL database©myNoSQL)


MongoDB and Amazon Elastic Block Storage (EBS)

The topic of running MongoDB on Amazon Web Services using Elastic Block Storage came up again among the 10 tips for running MongoDB from Engine Yard:

you should know that the performance of Amazon’s Elastic Block Storage (EBS) can be inconsistent.

Following up on that Mahesh P-Subramanya aptly added:

Indeed!  I’d actually take it a step further and say Do not use EBS in any environment where reliability and/or performance characteristics of your disk-access are important.  Or, to put it differently, asynchronous backups - OK, disk-based databases - Not So Much.  

Interestingly though, some presentations earlier this year–MongoDB in the Amazon Cloud and Running MongoDB on the Cloud—left me, and others with the impression that EBS should not be dismissed so fast.

Original title and link: MongoDB and Amazon Elastic Block Storage (EBS) (NoSQL database©myNoSQL)

Hadoop: Amazon Elastic MapReduce and Microsoft Project Isotop

This is how things are rolling these days. Microsoft talks about offerring Hadoop integration with Project Isotop in 2012, Amazon is announcing immediate availability of new beefed instances (Cluster Compute Eight Extra Large (cc2.8xlarge)) and reduced prices for some of the existing instances.

Original title and link: Hadoop: Amazon Elastic MapReduce and Microsoft Project Isotop (NoSQL database©myNoSQL)

How to Run a MapReduce Job Against Common Crawl Data Using Amazon Elastic MapReduce

Steve Salevan’s 7 step guide to setting up, compiling, deploying, and running a basic MapReduce job.

When Google unveiled its MapReduce algorithm to the world in an academic paper in 2004, it shook the very foundations of data analysis. By establishing a basic pattern for writing data analysis code that can run in parallel against huge datasets, speedy analysis of data at massive scale finally became a reality, turning many orthodox notions of data analysis on their head.

Google published the paper. Yahoo open sourced this. And Amazon is offering (unlimited) resources.

Update: The Hacker News thread where the main question answered is what other corporations are using MapReduce (besides the Internet companies). The answer is unfortunately extremely short: too many to be able to enumerate them all.

Original title and link: How to Run a MapReduce Job Against Common Crawl Data Using Amazon Elastic MapReduce (NoSQL database©myNoSQL)


Amazon Elastic MapReduce Upgrades to Hadoop 0.20.205, Pig 0.9.1, AMI Versioning, and Amazon VPC

Starting today you can run your job flows using Hadoop 0.20.205 and Pig 0.9.1. To simplify the upgrade process, we have also introduced the concept of AMI versions. You can now provide a specific AMI version to use at job flow launch or specify that you would like to use our “latest” AMI, ensuring that you are always using our most up-to-date features. The following AMI versions are now available:

  • Version 2.0: Hadoop 0.20.205, Hive 0.7.1, Pig 0.9.1, Debian 6.0.2 (Squeeze)
  • Version 1.0: Hadoop 0.18.3 and 0.20.2, Hive 0.5 and 0.7.1, Pig 0.3 and 0.6, Debian 5.0 (Lenny)

Amazon Elastic MapReduce is the perfect solution for:

  1. learning and experimenting with Hadoop
  2. running huge processing jobs in cases where your company doesn’t already have the necessary resources

Original title and link: Amazon Elastic MapReduce Upgrades to Hadoop 0.20.205, Pig 0.9.1, AMI Versioning, and Amazon VPC (NoSQL database©myNoSQL)


Extracting and Tokenizing 30TB of Web Crawl Data

All code for this 5 step process of extracting and tokenizing Common crawl’s 30TB of data is available on GitHub:

  1. Distributed copy to get data into a Hadoop cluster
  2. Filtering text/html
  3. Using boilerpipe for extracting visible text
  4. Using Apache Tika LanguageIdentifier for filtering English content
  5. Tokenizing using the Stanford parser.

Original title and link: Extracting and Tokenizing 30TB of Web Crawl Data (NoSQL database©myNoSQL)


Mahout on Amazon EC2: Installing Hadoop/Mahout on High Performance Instance

Danny Bickson:

Full procedure should take around 2-3  hours.. :-(

I think this would be considered a good provisioning speed for ramping up a new machine in your data center. But it is not a good getting up to speed time.

Original title and link: Mahout on Amazon EC2: Installing Hadoop/Mahout on High Performance Instance (NoSQL database©myNoSQL)


Backing Up HBase to Amazon S3

This is a guest post by Bizosys Team creators of HSearch, an opensource, NoSQL, distributed, real-time search engine built on Hadoop and HBase.

We have evaluated various options to backup data inside HBase and built a solution. This post will explain the options and also provide the solution for anyone to download and implement it for their own HBase installations.

Option Pros Cons
Backup the Hadoop DFS Block data files are backed up quickly.

Even if there is no visible external load on HBase, HBase internal processes such as region balancing, compaction goes on updating the HDFS blocks. So a raw copy may result in an inconsistence state.

Secondly, Hadoop, HBase as well as Hadoop HDFS keeps data in memory and flush at periodic intervals. So raw copy may result in an inconsistent state.

HBase Import and Export tool The Map-Reduce Job downloads data to the given output path. Providing a path like s3://backupbucket/ the program fails with exceptions like: Jets3tFileSystemStore failed with AWSCredentials.
HBase Table Copy tools Another parallel replicated setup to switch. Huge investment to keep running another parallel environment to replicate production data.

After considering these options we developed a simple tool, which backs up  data to Amazon S3 and restores it when needed. Another requirement is to take a full backup over weekend and a daily incremental backup.

In a recovery scenario, it should firstly initiate a clean environment with all tables created and populated with latest full backup data. Then it should apply all incremental backups sequentially. However, with this method, deletes are not captured and this may lead to some unnecessary data in tables. This is a known disadvantage for this method of backup and restore.

This backup program uses internally the HBase Import and Export tools to execute the programs in a Map-Reduce way.

Top 10 Features of the backup tool

  1. Export complete data for the given set of tables to S3 bucket.
  2. Export incrementally data for the given set of tables to S3 bucket.
  3. List all complete as well as incremental backup repositories.
  4. Restore a table from backup based on the given backup repository.
  5. Runs in Map-Reduce
  6. In case of connection failure, retries with increasing delays
  7. Handles special characters like _ which creates the export and import activities.
  8. Enhancement of existing Export and Import tool with detail logging to report a failure than just exiting with a program status of 1.
  9. Works in human readable time format for taking, listing and restoring of backup than using system tick time or unix EPOCH time (Time represented as a Number than readabale format as YYYY.MM.DD 24HH:MINUTE:SECOND:MILLSECOND TIMEZONE)
  10. All parameters are taken from command line which allows the cron job to run this at regular interval.

Setting up the tool

  1. Download the package from hbackup.install.tar
    This package includes the necessary jar files and the source code.
  2. Setup a configuration file. Download the hbase-site.xml file. Add to this fs.s3.awsAccessKeyId, fs.s3.awsSecretAccessKey, fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey properties
  3. Setup the class path with all jars existing inside the hbase/lib directory, hbase.jar file, java-xmlbuilder-0.4.jar, jets3t-0.8.1a.jar and hbackup-1.0-core.jar file bundled inside the downloaded hbackup.install.tar. Make sure hbackup-1.0-core.jar at the beginning of the classpath. In addition to this add the configuration directory to CLASSPATH which has kept hbase-site.xml file.

Running the tool

Usage: It runs in 4 modes as [backup.full], [backup.incremental], [backup.history] and [restore].


mode=backup.full tables="comma separated tables" backup.folder=S3-Path  date="YYYY.MM.DD 24HH:MINUTE:SECOND:MILLSECOND TIMEZONE"


mode=backup.full tables=tab1,tab2,tab3 backup.folder=s3://S3BucketABC/ date="2011.12.01 17:03:38:546 IST"
mode=backup.full tables=tab1,tab2,tab3 backup.folder=s3://S3BucketABC/


mode=backup.incremental tables="comma separated tables" backup.folder=S3-Path duration.mins=Minutes

Example of backup of changes occurred in the last 30 minutes:

mode=backup.incremental backup.folder=s3://S3BucketABC/ duration.mins=30 tables=tab1,tab2,tab3


mode=backup.history backup.folder=S3-Path

Example of listing past archives. Incremental ones end with .incr

mode=backup.history backup.folder=s3://S3BucketABC/


mode=restore  backup.folder=S3-Path/ArchieveDate tables="comma separated tables"

Example of adding the rows archived during that date. First apply a full backup and then apply incremental backups.

mode=backup.history backup.folder=s3://S3-Path/DAY_MON_HH_MI_SS_SSS_ZZZ_YYYY tables=tab1,tab2,tab3

Sample scripts to run the backup tool


$ cat
 for file in `ls /mnt/hbase/lib`
 export CLASSPATH=$CLASSPATH:/mnt/hbase/lib/$file;

 export CLASSPATH=/mnt/hbase/hbase-0.90.4.jar:$CLASSPATH

 export CLASSPATH=/mnt/hbackup/hbackup-1.0-core.jar:/mnt/hbackup/java-xmlbuilder-0.4.jar:/mnt/hbackup/jets3t-0.8.1a.jar:/mnt/hbackup/conf:$CLASSPATH

Full backup:

 $ cat
 . /mnt/hbackup/bin/

 dd=`date "+%Y.%m.%d %H:%M:%S:000 %Z"`
 echo Backing up for date $dd
 for table in `echo table1 table2 table3`
 /usr/lib/jdk/bin/java com.bizosys.oneline.maintenance.HBaseBackup mode=backup.full backup.folder=s3://mybucket/ tables=$table "date=$dd"
 sleep 10

List of backups:

 $ cat
 . /mnt/hbackup/bin/
 /usr/lib/jdk/bin/java com.bizosys.oneline.maintenance.HBaseBackup mode=backup.history backup.folder=s3://mybucket

Original title and link: Backin Up HBase to Amazon S3 (NoSQL database©myNoSQL)

Google Launches Google Cloud SQL a Relational Database as a Service

Google has just announced a new (lab) product: Google Cloud SQL which is Google’s Database-as-a-Service version of Amazon RDS—based on initial information, Google Cloud SQL could be characterized as a very basic/intro version of Amazon RDS.

Main features listed in the announcement:

  • Managed environment
  • High reliability and availability - your data is replicated synchronously to multiple data centers. Machine, rack and data center failures are handled automatically to minimize end-user impact. It also support asynchronous replication
  • Familiar MySQL database environment with JDBC support (for Java-based App Engine applications) and DB-API support (for Python-based App Engine applications). It even support data import and export using mysqldump
  • Simple and powerful integration with Google App Engine.
  • Command line tool
  • SQL prompt in the Google APIs Console

The service is free for now and Google promises a 30 days notice without giving any hints on the pricing model though.

Original title and link: Google Launches Google Cloud SQL a Relational Database as a Service (NoSQL database©myNoSQL)

Tanuki: A 30000 Cores AWS Cluster

Sometimes the only valid comment is wow.

We have now launched a cluster 3 times the size of Tanuki, or 30,000 cores, which cost $1279/hour to operate for a Top 5 Pharma. It performed genuine scientific work — in this case molecular modeling — and a ton of it. The complexity of this environment did not necessarily scale linearly with the cores.

In fact, we had to implement a triad of features within CycleCloud to make it a reality:

  1. MultiRegion support: To achieve the mind boggling core count of this cluster, we launched in three distinct AWS regions simultaneously, including Europe.
  2. Massive Spot instance support: This was a requirement given the potential savings at this scale by going through the spot market. Besides, our scheduling environment and the workload had no issues with the possibility of early termination and rescheduling.
  3. Massive CycleServer monitoring & Grill GUI app for Chef monitoring: There is no way that any mere human could keep track of all of the moving parts on a cluster of this scale.

Facebook runs a 30PB Hadoop analytic data warehouse and Yahoo! has a 100,000 cores/40,000 machines Hadoop cluster. I’m wondering what are the largest Amazon Elastic MapReduce jobs ever run. Any ideas?

Original title and link: Tanuki: A 30000 Cores AWS Cluster (NoSQL database©myNoSQL)


Running MongoDB on the Cloud

I’ve been posting a lot about deployments in the cloud and especially about deploying MongoDB in the Amazon cloud:

In this video Jared Rosoff covers topics like scaling and performance characteristics of running MongoDB in the cloud and he also shares some best practices when using Amazon EC2.

Memcached in the Cloud: Amazon ElastiCache

Amazon announced today a new service Amazon ElastiCache or Memcached in the cloud. The new service is still in beta and available only in the US East (Virginia) Region.

While many will find this new service useful, it is a bit of a disappointement that Amazon took the safe route and went with pure Memcached. The only notable feature of Amazon ElastiCache is automatic failure detection and recovery. But compared with Membase (and the soon to be released Couchbase 2.0) it is missing clustering, replication, support for virtual nodes, etc. Even if advertising a push-button scaling, ElastiCache will lose cached data on adding or removing instances.

The pace at which Amazon is launching new services is indeed impressive. I’m wondering what will be the first NoSQL database that will get official Amazon support.

Original title and link: Memcached in the Cloud: Amazon ElastiCache (NoSQL database©myNoSQL)