Yahoo: All content tagged as Yahoo in NoSQL databases and polyglot persistence
A recent slide deck presenting results of
the YCSB a new benchmark run against the latest versions of Cassandra (0.6.10), HBase (0.20.6), MongoDB (1.6.5), and Riak (0.14.0):
Some of the results are striking, so I cannot wonder if there weren’t some configuration issues.
Update: A few users that had more luck reading the details on the slides have pointed out that this is not the YCBS benchmark, but rather a new one developed by the presenter. Another detail that’s important is that data used was rather small and could easily fit in memory.
Original title and link: YCBS Benchmark Results for Cassandra, HBase, MongoDB, Riak (NoSQL databases © myNoSQL)
There was one NoSQL conference that I’ve missed and I was really pissed off: Hadoop World. Even if I’ve followed and curated the Twitter feed, resulting in Hadoop World in tweets, the feeling of not being there made me really sad. But now, thanks to Cloudera I’ll be able to watch most of the presentations. Many of them have already been published and the complete list can be found ☞ here.
Based on the twitter activity on that day, I’ve selected below the ones that seemed to have generated most buzz. The list contains names like Facebook, Twitter, eBay, Yahoo!, StumbleUpon, comScore, Mozilla, AOL. And there are quite a few more …
There is a new commit to YCSB […] This fixes performance problems in the HBase DB adapter. In my own tests I found that my short scans, which were configured to read 100-column rows, 1-300 in zipfian, went from 60ms to 35ms.
Also there is column selection pushdown enabled, which will improve the speed of any tests that are doing single column gets on a wide row (eg: readallfields=false, fieldcount=X). This is all due to changing how YCSB uses the Result object. Check out the commit for some hints. I have a longer email and patch about this stuff coming really soon.
YCSB is probably the most complete and correct NoSQL benchmark. And that’s basically a 40% speed improvement.
Original title and link: New HBase YCSB changes - improves speed drastically (NoSQL databases © myNoSQL)
In the light of ☞ Google Caffeine announcement — a summary of a summary would be that Google replaced MapReduce-based index updates with a new engine that would provide more timely updates — ☞ Tony Bain is wondering if Michael Stonebraker and DeWitt’ paper ☞ MapReduce: a major step backwards hasn’t thus been proved to be correct:
Firstly, was Stonebraker and Dewitt right? It is red faced time for those who came out and aggressively defended the Map/Reduce architecture?
And secondly what impact does this have on the future of Map/Reduce now those responsible for its popularity seem to have migrated their key use case? Is the proposition for Map/Reduce today still just as good now the Google don’t do it? (Yes I am sure Google still use Map/Reduce extensively and this is a bit tongue in cheek. But the primary quoted example relates to building the search index which is what, reportedly, has been moved away from MR).
While all these questions seem to be appropriate, I think some details could help with finding the correct answers.
Firstly, I think Google’s decission to “drop” MapReduce-based index updates was determined by their particular implementation and their storage strategy. Simply put, Google’s MapReduce-based index updates required reprocessing of data, so providing timely updates was more or less impossible. But as proved by CouchDB mapreduce implementation this approach is not the only one possible. CouchDB views are built as a result of running a pair of map and reduce functions and storing it in btrees. As for updates, CouchDB doesn’t need to reprocess all initial data and rebuild the index from scratch, but only apply changes from the updates. In this regard, Stonebraker seem to have been right when saying that it is “a sub-optimal implementation, in that it uses brute force instead of indexing”.
While Hadoop, the most well know mapreduce implementation, is following closely Google’s design, that doesn’t mean that there isn’t work done to improve its behavior for special scenarios like real-time stream processing, cascading, etc.
As regards the questions related to the impact of Google’s announcement on MapReduce adoption, I’d say that taking a look at the reports from the Hadoop Summit we all would agree that for quite some time the biggest proponents of MapReduce (in its Hadoop incarnation) have been Yahoo!, Facebook, Twitter, and other such companies. And, as I said it before, it sounds like Hadoop is actually processing more data than Google’s MapReduce .
Last, but not least, as with any NoSQL technology all these do not mean that MapReduce or Hadoop will fit all scenarios.