analysis: All content tagged as analysis in NoSQL databases and polyglot persistence
Wednesday, 12 January 2011
Big Data Analysis at BackType
RWW has a nice post diving into the data flow and the tools used by BackType, a company with only 3 engineers, to deal and analyze large amounts of data.
They’ve invented their own language, Cascalog, to make analysis easy, and their own database, ElephantDB, to simplify delivering the results of their analysis to users. They’ve even written a system to update traditional batch processing of massive data sets with new information in near real-time.
Some highlights:
- 25 terabytes of compressed binary data, over 100 billion individual records
- all services and data storage are on Amazon S3 and EC2
- 60 up to 150 EC2 instances servicing an average of 400 requests/s
- Clojure and Python as platform languages
- Hadoop, Cascading and Cascalog are central pieces of BackType’s platform
- Cascalog, a Clojure-based query language for Hadoop, was created and open sourced by BackType’s engineer Nathan Marz
- ElephantDB, the storage solution, is a read-only cluster built on top of BerkleyDB files
- Crawlers place data in Gearman queues for processing and storing
BackType data flow is presented in the following diagram:
Included below is an interview with Nathan about Cascalog:
Original title and link: Big Data Analysis at BackType (NoSQL databases © myNoSQL)
via: http://www.readwriteweb.com/hack/2011/01/secrets-of-backtypes-data-engineers.php
