ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Membase Amazon SimpleDB MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

hadoop: All content tagged as hadoop in NoSQL databases and polyglot persistence

A bit of history around Hadoop Companies

Imagine this story 5 years from now… with all the scars from the battle competition with the other companies trying to monetize Hadoop and … the millions in the bank.

Original title and link: A bit of history around Hadoop Companies (NoSQL database©myNoSQL)

via: http://www.businessinsider.com/this-former-yahoo-ers-startup-is-so-hot-even-the-cia-invested-in-it-2012-1


Hadoop and Seismic Data Processing

Geophysicists have been pushing the limits of high-performance computing for more than three decades; they were early adopters of the first Cray supercomputers as well as the massively parallel Connection Machine. Today, the most challenging seismic data processing tasks are performed on custom compute clusters that take advantage of multiple GPUs per node, high-performance networking and storage systems for fast data access.

How many fields we’ve never heard of have handcrafted over years their own solutions to deal with big data that would fit so nicely in Hadoop today?

Original title and link: Hadoop and Seismic Data Processing (NoSQL database©myNoSQL)

via: http://www.cloudera.com/blog/2012/01/seismic-data-science-hadoop-use-case/


Doug Cutting About Hadoop's Adoption

Doug Cutting expressing his suprise with Hadoop’s growth in an interview with Audrey Watters over O’Reillly Radar:

Yes. I didn’t expect Hadoop to become such a central component of data processing. I recognized that Google’s techniques would be useful to other search engines and that open source was the best way to spread these techniques. But I did not realize how many other folks had big data problems nor how many of these Hadoop applied to.

Hadoop is not Doug Cutting’s first widely successful open source project, so I’m tempted to think this is just pure modesty.

Original title and link: Doug Cutting About Hadoop’s Adoption (NoSQL database©myNoSQL)

via: http://radar.oreilly.com/2012/02/hadoop-doug-cutting-apache-data-processing.html


Microsoft, Hadoop, and Open Source Contributions

Edd Dumbill:

Microsoft’s goals go beyond integrating Hadoop into Windows. It intends to contribute the adaptions it makes back to the Apache Hadoop project, so that anybody can run a purely open source Hadoop on Windows.

In the open source world contributions are measured in code or documentation or donations. Less so in interviews or PR announcements.

So far Microsoft doesn’t seem to know this game. But if its intentions are true, the community will help.

Original title and link: Microsoft, Hadoop, and Open Source Contributions (NoSQL database©myNoSQL)

via: http://radar.oreilly.com/2012/01/microsoft-big-data.html


How to Hadoop: Maximizing the value of big data

Brian Christian1 (Zettaset) suggests two roads for adopting Hadoop:

The first, building the capability internally, seems to hold out the promise of flexibility and control for organizations that employ it. While this has sometimes been the case for some large companies, a variety of studies indicate that even among Fortune 500 companies, less than 20 percent that began Hadoop development succeeded in deploying a solution.

The second approach entails working with a big-data, Hadoop-focused third party to develop a bespoke solution. In addition to eliminating the requirement of enormous equipment and human capital investment, this approach also enables organizations, their executives, and IT staff to focus on their core value propositions rather than being forced to become Hadoop specialists.

It would be easy if the decision what be just about CAPEX vs OPEX. Or on-premise vs managed deployments. But there are tons of variables that must be considered when going the Big Data way. Eventually pretty much everyone will do something around Big Data, but those at the forefront still have to figure out many important aspects.


  1. Brian Christian is CEO of Zettaset, which delivers a fault-tolerant and highly available solution for big data aggregation 

Original title and link: How to Hadoop: Maximizing the value of big data (NoSQL database©myNoSQL)

via: http://venturebeat.com/2012/01/24/big-data-server-efficiency/


Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support

Starting today customers can view graphs of 23 job flow metrics within the EMR Console by selecting the Monitoring tab in the Job Flow Details page. These metrics are pushed CloudWatch every five minutes at no cost to you and include information on:

  • Job flow progress including metrics on the number of map and reduce tasks running and remaining in your job flow and the number of bytes read and written to S3 and HDFS.
  • Job flow contention including metrics on HDFS utilization, map and reduce slots open, jobs running, and the ratio between map tasks remaining and map slots.
  • Job flow health including metrics on whether your job flow is idle, if there are missing data blocks, and if there are any dead nodes.

That’s like free pr0n for operations teams.

On a different note, I’ve noticed that the Hadoop stack (Hadoop, Hive, Pig) on Amazon Elastic MapReduce is based on second to last versions, which says that extensive testing is performed on Amazon side before rolling new versions out:

Original title and link: Amazon Elastic MapReduce New Features: Metrics, Updates, VPC, and Cluster Compute Support (NoSQL database©myNoSQL)

via: http://aws.typepad.com/aws/2012/01/new-elastic-mapreduce-features-metrics-updates-vpc-and-cluster-compute-support-guest-post.html


Vertica and Hadoop for Big Data

Here is what I’ve jotted down during Vertica’s webinar Hadoop vs. RDBMS for Big Data Analytics: Why Choose?

  • the webinar has focused on clarifying where and how Vertica and Hadoop fit in the Big Data space
  • Vertica’s strenghts:
    • support for SQL, extended SQL, and analytics making it interactive investigation of data
    • storage space efficiency — I don’t think it’s correct to interpret Hadoop data redundancy as storage space inneficiency
    • analytics SDK (allows customizing in-database analytic functions)
    • ease of operating and maintenance (auto-tunning features)
  • the following slide is pretty eloquent about Hadoop and Vertica being complementary solutions : Vertica vs Hadoop - Analytics Feature Comparison
  • when covering a scenario for using both Hadoop and Vertica, they chose the ease one: Hadoop as ETL. It’s not that it’s not a good one, but it’s the only one databases vendors are using these days when speaking about integration with Hadoop.

    Hadoop + Vertica Use Case Example

  • other possible Hadoop + Vertica use cases:

    • Filter, join, and aggregation in Vertica with intermediate results fed into MR jobs
    • parallel import and export to HDFS
    • Hadoop MapReduce for data transformation and Vertica for optimized storage and retrieval
  • there will be a community edition of Vertica. It was announced in October for the end of 2011, but I don’t think it’s out yet
  • there’s a GitHub repo for user defined extensions for Vertica
  • the following categorization of Big Data tools is interesting but feels in favor of Vertica which would be placed somewhere close to the center of the triangle

    Triangle of Big Data Tools

Original title and link: Vertica and Hadoop for Big Data (NoSQL database©myNoSQL)


NoSQL Tutorial: Setting Up a Hadoop Cluster with MongoDB Support on EC2

A complete and detailed guide for setting up a Hadoop cluster using MongoDB by Arten Yankov. It uses the MongoDB Hadoop adapter mongo-hadoop , which provides input and output adapters, support for InputSplits, and write-only Pig.

What is covered in the tutorial:

  • Creating an AMI with the custom settings (installed hadoop and mongo-hadoop)
  • Launching a hadoop cluster on EC2
  • Adding more nodes to the cluster
  • Running some sample jobs

Original title and link: NoSQL Tutorial: Setting Up a Hadoop Cluster with MongoDB Support on EC2 (NoSQL database©myNoSQL)

via: http://artemyankov.com/post/16717104998/how-to-set-up-a-hadoop-cluster-with-mongo-support-on


Powered by Hadoop and Hive: Budgeting for snow removal in your local community

I don’t know how I ended up becoming the head of our local community association. Anyhow, I’m now responsible for laying out next year’s budget. Most of our expenses seem to be fixed from one year to another, but then there’s the expense for the snow removal service. This year, no snow. Last year, most snow on record in 30 years! How do you budget for something as volatile as snow? I need more data!

Instead of just googling the answer, we’re going to fetch some raw data and feed it into Hadoop Hive.

Hadoop FTW!

Original title and link: Powered by Hadoop and Hive: Budgeting for snow removal in your local community (NoSQL database©myNoSQL)

via: http://magnusljadas.wordpress.com/2012/01/29/search-for-snow-with-hadoop-hive/


MapReduce With Hadoop: What Happens During Mapping

An interesting look at what happens during the map phase in Hadoop and the impact of emitting key-value pairs:

  • a direct negative impact on the map time and CPU usage, due to more serialization
  • an indirect negative impact on CPU due to more spilling and additional deserialization in the combine step
  • a direct impact on the map task, due to more intermediate files, which makes the final merge more expensive

Map Reduce Combine

The main point of the dynaTrace blog post is that even if Hadoop makes it easy to throw more hardware at a problem, wasting resources with bad code in MapReduce tasks comes with a noticeable and measurable cost.

Original title and link: MapReduce With Hadoop: What Happens During Mapping (NoSQL database©myNoSQL)

via: http://blog.dynatrace.com/2012/01/25/about-the-performance-of-map-reduce-jobs/


Analysts' Predictions for Hadoop Market

With so many players in the market[1], it’s easy to see that not all of them will flourish. IDC has predicted that this year will see a lot of merger and acquisition activity as large technology companies rush to buy smaller companies with expertise in big data. By 2015, the analysts say it’s likely that none of the current “major players” in the Hadoop market will still exist.

These predictions have also a dark scary side. Not in the sense that existing companies that bring value to the market do not deserve good exits in the next 3-4 years. But most of the time, if not ignored, these statements will lead to an applification of BS and the creation of a ton of copy-cats bringing no value to a market that still has to see a lot of innovation, adoption, and return on investment for the users.


  1. According to Benjamin Woo, program vice president for worldwide storage systems at IDC, there are over 200 companies that claim to be in the big data space.  

Original title and link: Analysts’ Predictions for Hadoop Market (NoSQL database©myNoSQL)

via: http://www.devx.com/Java/Article/47799/0/page/1


Measuring User Retention With Hadoop and Hive

A very practical example of how Hive and Hadoop could deliver value when applied to clickstreams, the most common data for each web property:

Hadoop, Hive, and related tech­nologies are formi­dable tools for unlocking value from data. […] Retention measure­ments are partic­u­larly signif­icant because they paint a detailed picture about the overall stick­iness of a product across the entire userbase.

The same clickstream data can be used to calculate visitors’ conversion with the Bayesian discriminant using Hadoop.

Original title and link: Measuring User Retention With Hadoop and Hive (NoSQL database©myNoSQL)

via: http://blog.polarmobile.com/2012/01/measuring-user-retention-with-hadoop-and-hive/