hadoop: All content tagged as hadoop in NoSQL databases and polyglot persistence
Friday, 3 February 2012
A bit of history around Hadoop Companies
Imagine this story 5 years from now… with all the scars from the battle competition with the other companies trying to monetize Hadoop and … the millions in the bank.
Original title and link: A bit of history around Hadoop Companies (©myNoSQL)
Thursday, 2 February 2012
Hadoop and Seismic Data Processing
Geophysicists have been pushing the limits of high-performance computing for more than three decades; they were early adopters of the first Cray supercomputers as well as the massively parallel Connection Machine. Today, the most challenging seismic data processing tasks are performed on custom compute clusters that take advantage of multiple GPUs per node, high-performance networking and storage systems for fast data access.
How many fields we’ve never heard of have handcrafted over years their own solutions to deal with big data that would fit so nicely in Hadoop today?
Original title and link: Hadoop and Seismic Data Processing (©myNoSQL)
via: http://www.cloudera.com/blog/2012/01/seismic-data-science-hadoop-use-case/
Doug Cutting About Hadoop's Adoption
Doug Cutting expressing his suprise with Hadoop’s growth in an interview with Audrey Watters over O’Reillly Radar:
Yes. I didn’t expect Hadoop to become such a central component of data processing. I recognized that Google’s techniques would be useful to other search engines and that open source was the best way to spread these techniques. But I did not realize how many other folks had big data problems nor how many of these Hadoop applied to.
Hadoop is not Doug Cutting’s first widely successful open source project, so I’m tempted to think this is just pure modesty.
Original title and link: Doug Cutting About Hadoop’s Adoption (©myNoSQL)
via: http://radar.oreilly.com/2012/02/hadoop-doug-cutting-apache-data-processing.html
Wednesday, 1 February 2012
Microsoft, Hadoop, and Open Source Contributions
Edd Dumbill:
Microsoft’s goals go beyond integrating Hadoop into Windows. It intends to contribute the adaptions it makes back to the Apache Hadoop project, so that anybody can run a purely open source Hadoop on Windows.
In the open source world contributions are measured in code or documentation or donations. Less so in interviews or PR announcements.
So far Microsoft doesn’t seem to know this game. But if its intentions are true, the community will help.
Original title and link: Microsoft, Hadoop, and Open Source Contributions (©myNoSQL)
via: http://radar.oreilly.com/2012/01/microsoft-big-data.html
How to Hadoop: Maximizing the value of big data
Brian Christian1 (Zettaset) suggests two roads for adopting Hadoop:
The first, building the capability internally, seems to hold out the promise of flexibility and control for organizations that employ it. While this has sometimes been the case for some large companies, a variety of studies indicate that even among Fortune 500 companies, less than 20 percent that began Hadoop development succeeded in deploying a solution.
The second approach entails working with a big-data, Hadoop-focused third party to develop a bespoke solution. In addition to eliminating the requirement of enormous equipment and human capital investment, this approach also enables organizations, their executives, and IT staff to focus on their core value propositions rather than being forced to become Hadoop specialists.
It would be easy if the decision what be just about CAPEX vs OPEX. Or on-premise vs managed deployments. But there are tons of variables that must be considered when going the Big Data way. Eventually pretty much everyone will do something around Big Data, but those at the forefront still have to figure out many important aspects.
-
Brian Christian is CEO of Zettaset, which delivers a fault-tolerant and highly available solution for big data aggregation ↩
Original title and link: How to Hadoop: Maximizing the value of big data (©myNoSQL)
via: http://venturebeat.com/2012/01/24/big-data-server-efficiency/
Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support
Starting today customers can view graphs of 23 job flow metrics within the EMR Console by selecting the Monitoring tab in the Job Flow Details page. These metrics are pushed CloudWatch every five minutes at no cost to you and include information on:
- Job flow progress including metrics on the number of map and reduce tasks running and remaining in your job flow and the number of bytes read and written to S3 and HDFS.
- Job flow contention including metrics on HDFS utilization, map and reduce slots open, jobs running, and the ratio between map tasks remaining and map slots.
- Job flow health including metrics on whether your job flow is idle, if there are missing data blocks, and if there are any dead nodes.
That’s like free pr0n for operations teams.
On a different note, I’ve noticed that the Hadoop stack (Hadoop, Hive, Pig) on Amazon Elastic MapReduce is based on second to last versions, which says that extensive testing is performed on Amazon side before rolling new versions out:
- Hadoop: 0.20.205 precursor of Hadoop 1.0.0 supports append and security, but doesn’t have RAID, symlinks or MR2
- Hive: 0.7.1 (precursor of latest 0.8.0)
- Pig: 0.9.1 (precursor of latest 0.9.2)
Original title and link: Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support (©myNoSQL)
Tuesday, 31 January 2012
Vertica and Hadoop for Big Data
Here is what I’ve jotted down during Vertica’s webinar Hadoop vs. RDBMS for Big Data Analytics: Why Choose?
- the webinar has focused on clarifying where and how Vertica and Hadoop fit in the Big Data space
- Vertica’s strenghts:
- support for SQL, extended SQL, and analytics making it interactive investigation of data
- storage space efficiency — I don’t think it’s correct to interpret Hadoop data redundancy as storage space inneficiency
- analytics SDK (allows customizing in-database analytic functions)
- ease of operating and maintenance (auto-tunning features)
- the following slide is pretty eloquent about Hadoop and Vertica being complementary solutions :

-
when covering a scenario for using both Hadoop and Vertica, they chose the ease one: Hadoop as ETL. It’s not that it’s not a good one, but it’s the only one databases vendors are using these days when speaking about integration with Hadoop.

-
other possible Hadoop + Vertica use cases:
- Filter, join, and aggregation in Vertica with intermediate results fed into MR jobs
- parallel import and export to HDFS
- Hadoop MapReduce for data transformation and Vertica for optimized storage and retrieval
- there will be a community edition of Vertica. It was announced in October for the end of 2011, but I don’t think it’s out yet
- there’s a GitHub repo for user defined extensions for Vertica
-
the following categorization of Big Data tools is interesting but feels in favor of Vertica which would be placed somewhere close to the center of the triangle

Original title and link: Vertica and Hadoop for Big Data (©myNoSQL)
NoSQL Tutorial: Setting Up a Hadoop Cluster with MongoDB Support on EC2
A complete and detailed guide for setting up a Hadoop cluster using MongoDB by Arten Yankov. It uses the MongoDB Hadoop adapter mongo-hadoop , which provides input and output adapters, support for InputSplits, and write-only Pig.
What is covered in the tutorial:
- Creating an AMI with the custom settings (installed hadoop and mongo-hadoop)
- Launching a hadoop cluster on EC2
- Adding more nodes to the cluster
- Running some sample jobs
Original title and link: NoSQL Tutorial: Setting Up a Hadoop Cluster with MongoDB Support on EC2 (©myNoSQL)
via: http://artemyankov.com/post/16717104998/how-to-set-up-a-hadoop-cluster-with-mongo-support-on
Monday, 30 January 2012
Powered by Hadoop and Hive: Budgeting for snow removal in your local community
I don’t know how I ended up becoming the head of our local community association. Anyhow, I’m now responsible for laying out next year’s budget. Most of our expenses seem to be fixed from one year to another, but then there’s the expense for the snow removal service. This year, no snow. Last year, most snow on record in 30 years! How do you budget for something as volatile as snow? I need more data!
Instead of just googling the answer, we’re going to fetch some raw data and feed it into Hadoop Hive.
Hadoop FTW!
Original title and link: Powered by Hadoop and Hive: Budgeting for snow removal in your local community (©myNoSQL)
via: http://magnusljadas.wordpress.com/2012/01/29/search-for-snow-with-hadoop-hive/
Friday, 27 January 2012
MapReduce With Hadoop: What Happens During Mapping
An interesting look at what happens during the map phase in Hadoop and the impact of emitting key-value pairs:
- a direct negative impact on the map time and CPU usage, due to more serialization
- an indirect negative impact on CPU due to more spilling and additional deserialization in the combine step
- a direct impact on the map task, due to more intermediate files, which makes the final merge more expensive

The main point of the dynaTrace blog post is that even if Hadoop makes it easy to throw more hardware at a problem, wasting resources with bad code in MapReduce tasks comes with a noticeable and measurable cost.
Original title and link: MapReduce With Hadoop: What Happens During Mapping (©myNoSQL)
via: http://blog.dynatrace.com/2012/01/25/about-the-performance-of-map-reduce-jobs/
Analysts' Predictions for Hadoop Market
With so many players in the market[1], it’s easy to see that not all of them will flourish. IDC has predicted that this year will see a lot of merger and acquisition activity as large technology companies rush to buy smaller companies with expertise in big data. By 2015, the analysts say it’s likely that none of the current “major players” in the Hadoop market will still exist.
These predictions have also a dark scary side. Not in the sense that existing companies that bring value to the market do not deserve good exits in the next 3-4 years. But most of the time, if not ignored, these statements will lead to an applification of BS and the creation of a ton of copy-cats bringing no value to a market that still has to see a lot of innovation, adoption, and return on investment for the users.
-
According to Benjamin Woo, program vice president for worldwide storage systems at IDC, there are over 200 companies that claim to be in the big data space. ↩
Original title and link: Analysts’ Predictions for Hadoop Market (©myNoSQL)
Measuring User Retention With Hadoop and Hive
A very practical example of how Hive and Hadoop could deliver value when applied to clickstreams, the most common data for each web property:
Hadoop, Hive, and related technologies are formidable tools for unlocking value from data. […] Retention measurements are particularly significant because they paint a detailed picture about the overall stickiness of a product across the entire userbase.
The same clickstream data can be used to calculate visitors’ conversion with the Bayesian discriminant using Hadoop.
Original title and link: Measuring User Retention With Hadoop and Hive (©myNoSQL)
via: http://blog.polarmobile.com/2012/01/measuring-user-retention-with-hadoop-and-hive/
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling