NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



yahoo: All content tagged as yahoo in NoSQL databases and polyglot persistence

Mapr: a Competitor to Hadoop Leader Cloudera

They are said to be building a proprietary replacement for the Hadoop Distributed File System that’s allegedly three times faster than the current open-source version. It comes with snapshots and no NameNode single point of failure (SPOF), and is supposed to be API-compatible with HDFS, so it can be a drop-in replacement.

Where can one get Mapr product from?

Considering Yahoo is now focusing on Apache Hadoop and their plans for the next generation Hadoop MapReduce, I wouldn’t hold my breath for Mapr improvements.

Original title and link: Mapr: a Competitor to Hadoop Leader Cloudera (NoSQL databases © myNoSQL)


Cloudera’s Distribution for Apache Hadoop version 3 Beta 4

New version of Cloudera’s Hadoop distro — complete release notes available here:

CDH3 Beta 4 also includes new versions of many components. Highlights include:

  • HBase 0.90.1, including much improved stability and operability.
  • Hive 0.7.0rc0, including the beginnings of authorization support, support for multiple databases, and many other new features.
  • Pig 0.8.0, including many new features like scalar types, custom partitioners, and improved UDF language support.
  • Flume 0.9.3, including support for Windows and improved monitoring capabilities.
  • Sqoop 1.2, including improvements to usability and Oracle integration.
  • Whirr 0.3, including support for starting HBase clusters on popular cloud platforms.

Plus many scalability improvements contributed by Yahoo!.

Cloudera’s CDH is the most popular Hadoop distro bringing together many components of the Hadoop ecosystem. Yahoo remains the main innovator behind Hadoop.

Original title and link: Cloudera’s Distribution for Apache Hadoop version 3 Beta 4 (NoSQL databases © myNoSQL)


The Next Generation of Apache Hadoop MapReduce

I’m not sure how many companies have already hit this limit, but Yahoo! is showing again its Hadoop leadership:

The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.

Hadoop Next Generation Architecture

There are way too many interesting aspects covered in the post to spoil the pleasure of diving into them.

Original title and link: The Next Generation of Apache Hadoop MapReduce (NoSQL databases © myNoSQL)


YCSB Benchmark Results for Cassandra, HBase, MongoDB, Riak

A recent slide deck presenting results of the YCSB a new benchmark run against the latest versions of Cassandra (0.6.10), HBase (0.20.6), MongoDB (1.6.5), and Riak (0.14.0):

Some of the results are striking, so I cannot wonder if there weren’t some configuration issues.

Update: A few users that had more luck reading the details on the slides have pointed out that this is not the YCBS benchmark, but rather a new one developed by the presenter. Another detail that’s important is that data used was rather small and could easily fit in memory.

Original title and link: YCBS Benchmark Results for Cassandra, HBase, MongoDB, Riak (NoSQL databases © myNoSQL)

The Backstory of Yahoo and Hadoop

We currently have nearly 100 people working on Apache Hadoop and related projects, such as Pig, ZooKeeper, Hive, Howl, HBase and Oozie. Over the last 5 years, we’ve invested nearly 300 person-years into these projects. […] Today Yahoo runs on over 40,000 Hadoop machines (>300k cores). They are used by over a thousand regular users from our science and development teams. Hadoop is at the center of our research in search, advertising, spam detection, personalization and many other topics.

I assume there’s no surpise to anyone I’m a big fan of Yahoo! open source initiatives.

Original title and link: The Backstory of Yahoo and Hadoop (NoSQL databases © myNoSQL)


Yahoo is Focusing on Apache Hadoop discontinues “The Yahoo Distribution of Hadoop”

This is big:

Yahoo! has decided to discontinue the “The Yahoo Distribution of Hadoop” and focus on Apache Hadoop. We plan to remove all references to a Yahoo distribution from our website (, close our github repo ( and focus on working more closely with the Apache community. Our intent is to return to helping Apache produce binary releases of Apache Hadoop that are so bullet proof that Yahoo and other production Hadoop users can run them unpatched on their clusters.

What does this mean? Off the top of my head:

  • every Apache Hadoop user will benefit directly from Yahoo!’s Hadoop extensive expertise, testing, and improvements (nb: not only is Yahoo! the creator of Hadoop but it is running the largest Hadoop clusters out there)
  • probably Cloudera will have to refocus on creating even better Hadoop tools packages

Hats off, again, to Yahoo-ers!

Original title and link: Yahoo is Focusing on Apache Hadoop discontinues “The Yahoo Distribution of Hadoop” (NoSQL databases © myNoSQL)


Y! News: An inside look at rebuilding the largest news site on the web

Handling Yahoo! News data:

All this data is then pushed to a massive NoSQL data grid. A core goal when designing our data grid was the ability to easily attach new information to any existing piece of content. That means any team within Yahoo! can analyze and enhance the content. Yahoo! scientists are currently using technologies such as PIG and Hadoop to do things like find related clusters of news stories to show our users.

That, plus JSON format for handling different content types and a consolidated storage strategy for “quick iterations and continuous innovation”. No mention though on what is the storage engine. PNUTS?

Original title and link: Y! News: An inside look at rebuilding the largest news site on the web (NoSQL databases © myNoSQL)


Videos from Hadoop World

There was one NoSQL conference that I’ve missed and I was really pissed off: Hadoop World. Even if I’ve followed and curated the Twitter feed, resulting in Hadoop World in tweets, the feeling of not being there made me really sad. But now, thanks to Cloudera I’ll be able to watch most of the presentations. Many of them have already been published and the complete list can be found ☞ here.

Based on the twitter activity on that day, I’ve selected below the ones that seemed to have generated most buzz. The list contains names like Facebook, Twitter, eBay, Yahoo!, StumbleUpon, comScore, Mozilla, AOL. And there are quite a few more …

Hadoop Best Practices and Anti-Patterns

An extensive post about Hadoop best practices and anti-patterns from Yahoo!:

This blog post represents compendium of best practices for applications running on Apache Hadoop. In fact, we introduce the notion of aGrid Pattern which, similar to a Design Pattern, represents a general reusable solution for applications running on the Grid.

This blog post enumerates characteristics of well behaved applications and provides guidance on appropriate uses of various features and capabilities of the Hadoop framework.

Original title and link: Hadoop Best Practices and Anti-Patterns (NoSQL databases © myNoSQL)


Real-Time MapReduce

Yahoo! Labs! Advertising Sciences has built a general-purpose, real-time, distributed, fault-tolerant, scalable, event driven, expandable platform called S4 which allows programmers to easily implement applications for processing continuous unbounded streams of data.

I cannot say it enough: Yahoo! is kicking ass again. After hearing about Google Caffeine, many were quick to announce the death of MapReduce/Hadoop . But Yahoo! was busy at work open sourcing real-time MapReduce.

Original title and link: Real-Time MapReduce (NoSQL databases © myNoSQL)


New HBase YCSB changes - improves speed drastically

Ryan Rawson:

There is a new commit to YCSB […] This fixes performance problems in the HBase DB adapter. In my own tests I found that my short scans, which were configured to read 100-column rows, 1-300 in zipfian, went from 60ms to 35ms.

Also there is column selection pushdown enabled, which will improve the speed of any tests that are doing single column gets on a wide row (eg: readallfields=false, fieldcount=X). This is all due to changing how YCSB uses the Result object. Check out the commit for some hints. I have a longer email and patch about this stuff coming really soon.

☞ mail thread

YCSB is probably the most complete and correct NoSQL benchmark. And that’s basically a 40% speed improvement.

Original title and link: New HBase YCSB changes - improves speed drastically (NoSQL databases © myNoSQL)

MapReduce and Hadoop Future

In the light of ☞ Google Caffeine announcement — a summary of a summary would be that Google replaced MapReduce-based index updates with a new engine that would provide more timely updates — ☞ Tony Bain is wondering if Michael Stonebraker and DeWitt’ paper ☞ MapReduce: a major step backwards hasn’t thus been proved to be correct:

Firstly, was Stonebraker and Dewitt right? It is red faced time for those who came out and aggressively defended the Map/Reduce architecture?

And secondly what impact does this have on the future of Map/Reduce now those responsible for its popularity seem to have migrated their key use case? Is the proposition for Map/Reduce today still just as good now the Google don’t do it? (Yes I am sure Google still use Map/Reduce extensively and this is a bit tongue in cheek. But the primary quoted example relates to building the search index which is what, reportedly, has been moved away from MR).

While all these questions seem to be appropriate, I think some details could help with finding the correct answers.

Firstly, I think Google’s decission to “drop” MapReduce-based index updates was determined by their particular implementation and their storage strategy. Simply put, Google’s MapReduce-based index updates required reprocessing of data, so providing timely updates was more or less impossible. But as proved by CouchDB mapreduce implementation this approach is not the only one possible. CouchDB views are built as a result of running a pair of map and reduce functions and storing it in btrees. As for updates, CouchDB doesn’t need to reprocess all initial data and rebuild the index from scratch, but only apply changes from the updates. In this regard, Stonebraker seem to have been right when saying that it is “a sub-optimal implementation, in that it uses brute force instead of indexing”.

While Hadoop, the most well know mapreduce implementation, is following closely Google’s design, that doesn’t mean that there isn’t work done to improve its behavior for special scenarios like real-time stream processing, cascading, etc.

As regards the questions related to the impact of Google’s announcement on MapReduce adoption, I’d say that taking a look at the reports from the Hadoop Summit we all would agree that for quite some time the biggest proponents of MapReduce (in its Hadoop incarnation) have been Yahoo!, Facebook, Twitter, and other such companies. And, as I said it before, it sounds like Hadoop is actually processing more data than Google’s MapReduce .

Last, but not least, as with any NoSQL technology all these do not mean that MapReduce or Hadoop will fit all scenarios.

Original title and link: MapReduce Future (NoSQL databases © myNoSQL)