aws: All content tagged as aws in NoSQL databases and polyglot persistence
Monday, 12 December 2011
Extracting and Tokenizing 30TB of Web Crawl Data
All code for this 5 step process of extracting and tokenizing Common crawl’s 30TB of data is available on GitHub:
- Distributed copy to get data into a Hadoop cluster
- Filtering text/html
- Using boilerpipe for extracting visible text
- Using Apache Tika
LanguageIdentifierfor filtering English content - Tokenizing using the Stanford parser.
Original title and link: Extracting and Tokenizing 30TB of Web Crawl Data (©myNoSQL)
via: http://matpalm.com/blog/2011/12/10/common_crawl_visible_text/
Friday, 9 December 2011
Mahout on Amazon EC2: Installing Hadoop/Mahout on High Performance Instance
Danny Bickson:
Full procedure should take around 2-3 hours.. :-(
I think this would be considered a good provisioning speed for ramping up a new machine in your data center. But it is not a good getting up to speed time.
Original title and link: Mahout on Amazon EC2: Installing Hadoop/Mahout on High Performance Instance (©myNoSQL)
via: http://bickson.blogspot.com/2011/02/mahout-on-amazon-ec2-part-5-installing.html
Wednesday, 7 December 2011
Backing Up HBase to Amazon S3
This is a guest post by Bizosys Team creators of HSearch, an opensource, NoSQL, distributed, real-time search engine built on Hadoop and HBase.
We have evaluated various options to backup data inside HBase and built a solution. This post will explain the options and also provide the solution for anyone to download and implement it for their own HBase installations.
After considering these options we developed a simple tool, which backs up data to Amazon S3 and restores it when needed. Another requirement is to take a full backup over weekend and a daily incremental backup.
In a recovery scenario, it should firstly initiate a clean environment with all tables created and populated with latest full backup data. Then it should apply all incremental backups sequentially. However, with this method, deletes are not captured and this may lead to some unnecessary data in tables. This is a known disadvantage for this method of backup and restore.
This backup program uses internally the HBase Import and Export tools to execute the programs in a Map-Reduce way.
Top 10 Features of the backup tool
- Export complete data for the given set of tables to S3 bucket.
- Export incrementally data for the given set of tables to S3 bucket.
- List all complete as well as incremental backup repositories.
- Restore a table from backup based on the given backup repository.
- Runs in Map-Reduce
- In case of connection failure, retries with increasing delays
- Handles special characters like _ which creates the export and import activities.
- Enhancement of existing Export and Import tool with detail logging to report a failure than just exiting with a program status of 1.
- Works in human readable time format for taking, listing and restoring of backup than using system tick time or unix
EPOCHtime (Time represented as a Number than readabale format asYYYY.MM.DD 24HH:MINUTE:SECOND:MILLSECOND TIMEZONE) - All parameters are taken from command line which allows the cron job to run this at regular interval.
Setting up the tool
- Download the package from hbackup.install.tar
This package includes the necessary jar files and the source code. - Setup a configuration file. Download the
hbase-site.xmlfile. Add to thisfs.s3.awsAccessKeyId,fs.s3.awsSecretAccessKey,fs.s3n.awsAccessKeyIdandfs.s3n.awsSecretAccessKeyproperties - Setup the class path with all jars existing inside the
hbase/libdirectory,hbase.jarfile,java-xmlbuilder-0.4.jar,jets3t-0.8.1a.jarandhbackup-1.0-core.jarfile bundled inside the downloaded hbackup.install.tar. Make surehbackup-1.0-core.jarat the beginning of the classpath. In addition to this add the configuration directory to CLASSPATH which has kept hbase-site.xml file.
Running the tool
Usage: It runs in 4 modes as [backup.full], [backup.incremental], [backup.history] and [restore].
[backup.full]
mode=backup.full tables="comma separated tables" backup.folder=S3-Path date="YYYY.MM.DD 24HH:MINUTE:SECOND:MILLSECOND TIMEZONE"
Example:
mode=backup.full tables=tab1,tab2,tab3 backup.folder=s3://S3BucketABC/ date="2011.12.01 17:03:38:546 IST"mode=backup.full tables=tab1,tab2,tab3 backup.folder=s3://S3BucketABC/
[backup.incremental]
mode=backup.incremental tables="comma separated tables" backup.folder=S3-Path duration.mins=Minutes
Example of backup of changes occurred in the last 30 minutes:
mode=backup.incremental backup.folder=s3://S3BucketABC/ duration.mins=30 tables=tab1,tab2,tab3
backup.history
mode=backup.history backup.folder=S3-Path
Example of listing past archives. Incremental ones end with .incr
mode=backup.history backup.folder=s3://S3BucketABC/
[restore]
mode=restore backup.folder=S3-Path/ArchieveDate tables="comma separated tables"
Example of adding the rows archived during that date. First apply a full backup and then apply incremental backups.
mode=backup.history backup.folder=s3://S3-Path/DAY_MON_HH_MI_SS_SSS_ZZZ_YYYY tables=tab1,tab2,tab3
Sample scripts to run the backup tool
Setup:
$ cat setenv.sh
for file in `ls /mnt/hbase/lib`
do
export CLASSPATH=$CLASSPATH:/mnt/hbase/lib/$file;
done
export CLASSPATH=/mnt/hbase/hbase-0.90.4.jar:$CLASSPATH
export CLASSPATH=/mnt/hbackup/hbackup-1.0-core.jar:/mnt/hbackup/java-xmlbuilder-0.4.jar:/mnt/hbackup/jets3t-0.8.1a.jar:/mnt/hbackup/conf:$CLASSPATH
Full backup:
$ cat backup_full.sh
. /mnt/hbackup/bin/setenv.sh
dd=`date "+%Y.%m.%d %H:%M:%S:000 %Z"`
echo Backing up for date $dd
for table in `echo table1 table2 table3`
do
/usr/lib/jdk/bin/java com.bizosys.oneline.maintenance.HBaseBackup mode=backup.full backup.folder=s3://mybucket/ tables=$table "date=$dd"
sleep 10
done
List of backups:
$ cat list.sh
. /mnt/hbackup/bin/setenv.sh
/usr/lib/jdk/bin/java com.bizosys.oneline.maintenance.HBaseBackup mode=backup.history backup.folder=s3://mybucket
Original title and link: Backin Up HBase to Amazon S3 (©myNoSQL)
Thursday, 6 October 2011
Google Launches Google Cloud SQL a Relational Database as a Service
Google has just announced a new (lab) product: Google Cloud SQL which is Google’s Database-as-a-Service version of Amazon RDS—based on initial information, Google Cloud SQL could be characterized as a very basic/intro version of Amazon RDS.
Main features listed in the announcement:
- Managed environment
- High reliability and availability - your data is replicated synchronously to multiple data centers. Machine, rack and data center failures are handled automatically to minimize end-user impact. It also support asynchronous replication
- Familiar MySQL database environment with JDBC support (for Java-based App Engine applications) and DB-API support (for Python-based App Engine applications). It even support data import and export using
mysqldump - Simple and powerful integration with Google App Engine.
- Command line tool
- SQL prompt in the Google APIs Console
The service is free for now and Google promises a 30 days notice without giving any hints on the pricing model though.
Original title and link: Google Launches Google Cloud SQL a Relational Database as a Service (©myNoSQL)
Thursday, 22 September 2011
Tanuki: A 30000 Cores AWS Cluster
Sometimes the only valid comment is wow.
We have now launched a cluster 3 times the size of Tanuki, or 30,000 cores, which cost $1279/hour to operate for a Top 5 Pharma. It performed genuine scientific work — in this case molecular modeling — and a ton of it. The complexity of this environment did not necessarily scale linearly with the cores.
In fact, we had to implement a triad of features within CycleCloud to make it a reality:
- MultiRegion support: To achieve the mind boggling core count of this cluster, we launched in three distinct AWS regions simultaneously, including Europe.
- Massive Spot instance support: This was a requirement given the potential savings at this scale by going through the spot market. Besides, our scheduling environment and the workload had no issues with the possibility of early termination and rescheduling.
- Massive CycleServer monitoring & Grill GUI app for Chef monitoring: There is no way that any mere human could keep track of all of the moving parts on a cluster of this scale.
Facebook runs a 30PB Hadoop analytic data warehouse and Yahoo! has a 100,000 cores/40,000 machines Hadoop cluster. I’m wondering what are the largest Amazon Elastic MapReduce jobs ever run. Any ideas?
Original title and link: Tanuki: A 30000 Cores AWS Cluster (©myNoSQL)
Saturday, 27 August 2011
Running MongoDB on the Cloud
I’ve been posting a lot about deployments in the cloud and especially about deploying MongoDB in the Amazon cloud:
- MongoDB on Amazon EC2 with EBS Volumes
- MongoDB on EC2
- MongoDB in the Amazon Cloud
- Setting Up MongoDB Replica Sets on Amazon EC2
- MongoDB and Amazon: Why EBS?
- Amazon EBS vs SSD: Price, Performance, QoS
- Multi-tenancy and Cloud Storage Performance
In this video Jared Rosoff covers topics like scaling and performance characteristics of running MongoDB in the cloud and he also shares some best practices when using Amazon EC2.
Tuesday, 23 August 2011
Memcached in the Cloud: Amazon ElastiCache
Amazon announced today a new service Amazon ElastiCache or Memcached in the cloud. The new service is still in beta and available only in the US East (Virginia) Region.
While many will find this new service useful, it is a bit of a disappointement that Amazon took the safe route and went with pure Memcached. The only notable feature of Amazon ElastiCache is automatic failure detection and recovery. But compared with Membase (and the soon to be released Couchbase 2.0) it is missing clustering, replication, support for virtual nodes, etc. Even if advertising a push-button scaling, ElastiCache will lose cached data on adding or removing instances.
The pace at which Amazon is launching new services is indeed impressive. I’m wondering what will be the first NoSQL database that will get official Amazon support.
Original title and link: Memcached in the Cloud: Amazon ElastiCache (©myNoSQL)
Monday, 22 August 2011
Reliable, Scalable, and Kinda Sorta Cheap: A Cloud Hosting Architecture for MongoDB
Using MongoDB replicate sets:
At Famigo, we house all of our valuable data in MongoDB and we also serve all requests from Amazon EC2 instances. We’ve devoted many mental CPU cycles to finding the right architecture for our data in the cloud, focusing on 3 main factors: cost, reliability, and performance.
Original title and link: Reliable, Scalable, and Kinda Sorta Cheap: A Cloud Hosting Architecture for MongoDB (©myNoSQL)
via: http://www.codypowell.com/taods/2011/08/a-cloud-hosting-architecture-for-mongodb.html
Friday, 12 August 2011
MongoDB and Amazon: Why EBS?
After linking to the MongoDB in the Amazon cloud, MongoDB and EC2 and the older MongoDB on Amazon EC2 with EBS volumes , Arnout Kazemier commented:
The only thing I dislike about that EC2 guide is that it’s suggesting to use EBS instead of the regular EC2 instance storage
This is an apt question in the light of the prolongued Amazon outage, Reddit’s experience with EBS, the unpredictable EBS performance, and Netflix’s Adrian Cockcroft explanation of multi-tenancy impact on the Amazon EBS performance. Maybe someone could answer it.
Original title and link: MongoDB and Amazon: Why EBS? (©myNoSQL)
Tuesday, 9 August 2011
Setting Up MongoDB Replica Sets on Amazon EC2
Zachary Witte:
When you have the instance basically set, go back into the AWS control panel, right click the instance and choose Create Image. You can start up any number of these for the replica set, but you need to change the /etc/hostname and /etc/hosts file to reflect the individual IP address and hostname of the bot (db1, db2, db3, etc.)
Before you set up MongoDB on EC2 make sure you understand the various aspects of running MongoDB in the Amazon cloud:
Original title and link: Setting Up MongoDB Replica Sets on Amazon EC2 (©myNoSQL)
via: http://www.zacwitte.com/how-to-set-up-ubuntu-w-mongodb-replica-sets-on-amazon-ec2
Monday, 4 July 2011
Hadoop Chaos Monkey: The Fault Injection Framework
Do you remember the 5 lessons Netflix learned while using the Amazon Web Services—judging by how much Netflix shared about their experience in the cloud including Amazon SimpleDB I’d say these 5 are only the tip of the iceberg—where they talked about the Chaos Monkey?
One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.
Hadoop provides a similar framework: Fault Injection Framework :
The idea of fault injection is fairly simple: it is an infusion of errors and exceptions into an application’s logic to achieve a higher coverage and fault tolerance of the system. Different implementations of this idea are available today. Hadoop’s FI framework is built on top of Aspect Oriented Paradigm (AOP) implemented by AspectJ toolkit.
As a sidenote, this is one of the neatest usages of AspectJ I’ve read about.
Update: Abhijit Belapurkar says that Fault injection using AOP was part of Recovery Oriented Computing research at Stanford/UCB many years ago: JAGR: An Autonomous Self-Recovering Application Server.
Original title and link: Hadoop Chaos Monkey: The Fault Injection Framework (©myNoSQL)
Monday, 27 June 2011
Building an Ad Network Ready for Failure
The architecture of a fault-tolerant ad network built on top of HAProxy, Apache with mod_wsgi and Python, Redis, a bit of PostgreSQL and ActiveMQ deployed on AWS:
The real workhorse of our ad targeting platform was Redis. Each box slaved from a master Redis, and on failure of the master (which happened once), a couple “slaveof” calls got us back on track after the creation of a new master. A combination of set unions/intersections with algorithmically updated targeting parameters (this is where experimentation in our setup was useful) gave us a 1 round-trip ad targeting call for arbitrary targeting parameters. The 1 round-trip thing may not seem important, but our internal latency was dominated by network round-trips in EC2. The targeting was similar in concept to the search engine example I described last year, but had quite a bit more thought regarding ad targeting. It relied on the fact that you can write to Redis slaves without affecting the master or other slaves. Cute and effective. On the Python side of things, I optimized the redis-py client we were using for a 2-3x speedup in network IO for the ad targeting results.
Original title and link: Building an Ad Network Ready for Failure (©myNoSQL)
via: http://dr-josiah.blogspot.com/2011/06/building-ad-network-ready-for-failure.html
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling
