ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

mapreduce: All content tagged as mapreduce in NoSQL databases and polyglot persistence

Apache Hadoop 2.4.0 released with operational improvements

Hadoop 2.4.0 continues that momentum, with additional enhancements to both HDFS & YARN:

  • Support for Access Control Lists in HDFS
  • Native support for Rolling Upgrades in HDFS
  • Smooth operational upgrades with protocol buffers for HDFS FSImage
  • Full HTTPS support for HDFS
  • Support for Automatic Failover of the YARN ResourceManager (a.k.a Phase 1 of YARN ResourceManager High Availability)
  • Enhanced support for new applications on YARN with Application History Server and Application Timeline Server
  • Support for strong SLAs in YARN CapacityScheduler via Preemption

Original title and link: Apache Hadoop 2.4.0 released with operational improvements (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/apache-hadoop-2-4-0-released/


Hydra takes on Hadoop

A good interview on InfoQ comparing Hadoop with AddThis’s open source Hydra:

What use case(s) is Hydra better suited for compared to Hadoop. When would Hadoop be a better choice?

Hydra is better at data exploration. You can follow a number of interesting leads from the results of a single, probably rather fast, map job. Queries on the resultant tree usually take on the order of seconds (or milliseconds).

Non-programmers can produce functioning products with a small amount of guidance. The web UI provides most everything that might be needed; it might be as simple as pressing clone on an existing job, changing the tree to use a couple different features and hitting go. In minutes they have a new URL endpoint to show your impressive new KPI on your company home page.

Hadoop has a few advantages though. It has stronger native support for very large, one-off joins. Technically speaking this just means more implicit sorting of files. Sorting huge numbers of things is expensive so we try pretty hard to avoid it, and as a result first order support for it is a little lacking. On the other hand, you might find that you don’t really need the full, perfect join and are instead content with a Bloom-filter-based probabilistic hybrid — in which case Hydra will once again save you some sweet cycles.

Original title and link: Hydra takes on Hadoop (NoSQL database©myNoSQL)

via: http://www.infoq.com/news/2014/04/hydra


Intel kills a Hadoop and feeds another

I seriously doubt you could have missed the 2nd part of this, but here’s the shortest executive summary:

  1. Intel has killed its own distribution of Hadoop — is there anyone that would disagree this is a good idea?
  2. Intel has invested $740mil in Cloudera (for 18%) — there’s no typo. 740 millions.

The main questions:

  1. where will Cloudera put the $900mil raised in the last round(s)?
  2. why Intel invested so much?

These questions were also asked by Dan Primack for CNN Money and after looking at different angles he comes out empty.

So let’s check other sources:

  1. TechCrunch has initially speculated that much of the investment went to existing shareholders.

    The post was later updated with a comment from Cloudera’s VP of marketing stating that the majority of the money went to the company. But no word on how they’ll be used.

  2. Reuters writes that Intel made the investment to ensure their leading position in server processors:

    Intel hopes that encouraging more companies to leap into Big Data analysis will lead to higher sales of its high- end Xeon server processors. The chipmaker believes that hitching its wagon to Cloudera’s version of Hadoop, instead of pushing its own version, will make that happen faster.

    Still no word on how Cloudera will be using the money.

  3. Derrick Harris for GigaOm writes that the deal makes a lot of sense for both companies1:

    Cloudera needs capital and Intel’s huge sales force to keep up its engineering efforts and grow the company internationally.

    As part of the deal, Cloudera will be an early adopter of Intel gear and will optimize its Hadoop software to run on Intel’s latest technologies. Intel will port some of its work into the Cloudera distribution and will maintain its own Hadoop engineering team that will work alongside Cloudera’s engineers to help unite the two company’s goals.

  4. Jeff Kelly for SiliconAngle emphasizes the same channel advantages:

    Cloudera’s biggest reseller partner is Oracle. Based on my reading of the Intel announcement, the deal is not an official reseller partnership, but Intel will “market and promote CDH and Cloudera Enterprise to its customers as its preferred Hadoop platform.” Not quite as nice as having the Intel salesforce closing deals for it, but Cloudera stands to gain significant new business from the arrangement.


So how about this short list on how this round will be used by Cloudera:

  1. a part goes for international expansion
  2. a larger part goes to early shareholders
  3. the largest part goes into acquisitions

As for Intel, what if this investment also sealed an exclusive deal for Hadoop-centric Cloudera-supported Intel-powered appliance?


  1. Insert snarky comment here about a $740m deal that would not make sense to one of the parties. How about not making sense to any of them? 

Original title and link: Intel kills a Hadoop and feeds another (NoSQL database©myNoSQL)


Three opinions about the future of Hadoop and Data Warehouse

Building on the same data coming from Gartner and a talk from Hadoop Summit (exactly the same), Matt Asay1 and Timo Elliott2 place Hadoop on the data warehouse map.

Matt Asay writes in the ReadWrite article that Hadoop is not replacing existing data warehouses, but it’s taking all new projects:

Hadoop (and its kissing cousin, the NoSQL database) isn’t replacing legacy technology so much as it’s usurping its place in modern workloads. This means enterprises will end up supporting both legacy technology and Hadoop/NoSQL to manage both existing and new workloads […]

Of course, given “the effective price of core Hadoop distribution software and support services is nearly zero” at this point, as Jeff Kelly highlights, more and more workloads will gravitate to Hadoop. So while data warehouse vendors aren’t dead—they’re not even gasping for breath—they risk being left behind for modern data workloads if they don’t quickly embrace Hadoop and other 21st Century data infrastructure.

On his blog, Timo Elliott makes sure that there’s some SAP in that future picture and uses their Hadoop partner, Hortonworks to depict it:

No. Ignoring the many advantages of Hadoop would be dumb. But it would be just as dumb to ignore the other revolutionary technology breakthroughs in the DW space. In particular, new in- memory processing opportunities have created a brand-new category that Gartner calls “hybrid transactional/analytic platforms” (HTAP)

hadoopmodernarchitecture_thumb

The future I’d like to see is the one where:

  1. there is an integrated data platform. Note that in this ideal world, integrated does not mean any form of ETL
  2. it supports and runs in isolation different workloads from online transactions and bulk upload to various forms of analytics
  3. data is stored on dedicated mediums (spinning disks, flash, memory) depending on the workloads that touch it
  4. data would move between these storage mediums automatically, but the platform would allow fine tuning for maintaining the SLAs of the different components

  1. Matt Asay is VP of business development and corporate strategy at MongoDB 

  2. Timo Elliott is an Innovation Evangelist for SAP 

Original title and link: Three opinions about the future of Hadoop and Data Warehouse (NoSQL database©myNoSQL)


Thoughts on The Future of Hadoop in Enterprise Environments

In case you are looking for some sort of reassurance that big companies are into Hadoop, check SAP’s Innovation Evangelist, Timo Elliott’s perspective on the Hadoop market. It should be no surprise what he sees as the main trend:

Companies want to take advantage of the cost advantages of Hadoop systems, but they realize that Hadoop doesn’t yet do everything they need (for example, Gartner surveys show a steady decline in the proportion of CIOs that believe that NoSQL will replace existing data warehousing rather than augmenting it – now just 3%). And companies see the performance advantages of in-memory processing, but aren’t sure how it can make a difference to their business.

Original title and link: Thoughts on The Future of Hadoop in Enterprise Environments (NoSQL database©myNoSQL)

via: http://timoelliott.com/blog/2014/03/thoughts-on-the-future-of-hadoop-in-enterprise-environments.html


Continuent Replication to Hadoop – Now in Stereo!

Hopefully by now you have already seen that we are working on Hadoop replication. I’m happy to say that it is going really well. I’ve managed to push a few terabytes of data and different data sets through into Hadoop on Cloudera, HortonWorks, and Amazon’s Elastic MapReduce (EMR). For those who have been following my long association with the IBM InfoSphere BigInsights Hadoop product, and I’m pleased to say that it’s working there too.

Continuent is the company behing Tungsten connector and replicator products which, in their words:

Continuent Tungsten allows enterprises running business- critical MySQL applications to provide high-availability (HA) and globally reduntant disaster recover (DR) capabilities for cloud-based and private data center installations. Tungsten Replicator provides high performance open source data replication for MySQL and Oracle and is a key part of Continuent Tungsten.

Original title and link: Continuent Replication to Hadoop – Now in Stereo! (NoSQL database©myNoSQL)

via: http://mcslp.wordpress.com/2014/03/31/continuent-replication-to-hadoop-now-in-stereo/


A practical comparison of Map-Reduce in MongoDB and RavenDB

Ben Foster looks at MongoDB’s Map-Reduce and aggregation framework and then compares them with RavenDB’s Map-Reduce:

I thought it would be interesting to do a practical comparison of Map-Reduce in both MongoDB and RavenDB.

There are more differences than similarities — I’m not referring to the API differences, but to fundamental differences to the ways they operate.

✚ RavenDB’s author has a follow up post in which he underlines another major difference: RavenDB’s Map-Reduce operates as an index, while MongoDB’s Map-Reduce is an online operation.

Original title and link: A practical comparison of Map-Reduce in MongoDB and RavenDB (NoSQL database©myNoSQL)

via: http://benfoster.io/blog/map-reduce-in-mongodb-and-ravendb


SSDs and MapReduce performance

Conclusions of comparing SSDs and HDDs for different cluster scenarios from the cost perspective of performance and storage capacity:

  • For a new cluster, SSDs deliver up to 70 percent higher MapReduce performance compared to HDDs of equal aggregate IO bandwidth.
  • For an existing HDD cluster, adding SSDs lead to more gains if configured properly.
  • On average, SSDs show 2.5x higher cost-per-performance, a gap far narrower than the 50x difference in cost-per-capacity.

The post offers many details of the tests run and also various results. But the 3 bullets above should be enough to drive your decision.

Original title and link: SSDs and MapReduce performance (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2014/03/the-truth-about-mapreduce-performance-on-ssds/


Cloudera Search Interface: Inside Cloudera's customer support Enterprise Data Hub

Great use of their own technologies to better server the customer:

This application goes way beyond simple indexing and searching. We are using Cloudera Search, HBase, and MapReduce to process, store, and visualize stack traces that wouldn’t be possible with just a search index. How Monocle Stack Trace integrates with the larger CSI application goes way beyond that, though. It’s a great feeling when you are able to execute a search in Monocle Stack Trace that links directly to a point in time in a customer log file that an Impala query returned after churning through tens of GBs of data — done interactively from a Web UI on the order of a second or two.

I can easily see this becoming a real product used by software companies that offer direct customer support.

Original title and link: Cloudera Search Interface: Inside Cloudera’s customer support Enterprise Data Hub (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2014/02/secrets-of-cloudera-support-inside-our-own-enterprise-data-hub/


The Forrester Wave for Hadoop market

Update: I’d like to thank the people that pointed out in the comment thread that I’ve messed up quite a few aspects in my comments about the report. I don’t believe in taking down posts that have been out for a while, so please be warned that basically this article can be ignored.

Thank you and my apologies for those comments that were a misinterpretation of the report..


This is the Q1 2014 Forrester Wave for Hadoop:

Forrester wave for Hadoop

A couple of thoughts:

  1. Cloudera, Hortonworks, MapR are positioned very (very) close.

    1. Hortonworks is position closer to the top right meaning they report more customers/larger install base
    2. MapR is higher on the vertical axis meaning that MapR’s strategy is slightly better.

      For me, MapR’s strategy can be briefly summarized as:

      1. address some of the limitations in the Hadoop ecosystem
      2. provide API-compatible products for major components of the Hadoop ecosystem
      3. use these Apache product (trade marked) names to advertise their products

      I think the 1st point above explains the better positioning of MapR’s current offering.

    3. Even if Cloudera has been the first pure-play Hadoop distribution it’s positioned behind behind both Hortonworks and MapR.

  2. IBM has the largest market presence. That’s a big surprise as I’m very rarely hearing clear messages from IBM.

  3. IBM and Pivotal Software are considered to have the strongest strategy. That’s another interesting point in Forrester’s report. Except the fact that IBM has a ton of data products and that Pivotal Software is offering more than Hadoop, I don’t know what exactly explains this position.

    The Forrester report Strategy positioning is based on quantifying the following categories: Licensing and pricing, Ability to execute, Product road map, Customer support. IBM and Pivotal are ranked the first in all these categories (with maximum marks for the last 3). As a comparison Hortonworks has 3/5 for Ability to execute — this must be related only to budget; Cloudera has 3/5 for both Ability to execute and Customer support.

    Pivotal is the 3rd last in terms of current offering. I guess my hypothesis for ranking Pivotal as 1st in terms of strategy is wrong.

  4. Microsoft who through the collaboration with Hortonworks came up with HDInsight, which basically enabled Hadoop for Excel and its data warehouse offering, it positioned the 2nd last on all 3 axes.

    No one seems to love Microsoft anymore.

  5. While not a pure Hadoop player, DataStax has been offering the DataStax Enterprise platform that includes support for analytics through Hadoop and search through Solr for at least 2 years. That’s actually way before anyone else from the group of companies in the Forrester’s report had anything similar1.

    This report focuses only on “general-purpose Hadoop solutions based on a differentiated, commercial Hadoop distribution”.

You can download the report after registering on Hortonwork’s site: here.


  1. DataStax is my employer. But what I wrote is a pure fact. 

Original title and link: The Forrester Wave for Hadoop market (NoSQL database©myNoSQL)


Hortonworks raises $100M to grow engineering and company's ecosystem globally

Derrick Harris for GigaOm has the scoop:

Hadoop vendor Hortonworks has raised $100 million in a new round of venture capital led by BlackRock and Passport Capital. The company’s existing investors — Dragoneer, Tenaya Capital, Benchmark, Index Ventures and Yahoo — also participated in the latest round. Hortonworks CEO Rob Bearden said in an interview that the new funding will help Hortonworks scale its engineering efforts, grow the company’s ecosystem and scale its global operations.

Last week’s round E for Cloudera turned up to be $160 instead of the Bloomberg rumored $200.

These big rounds raised by the Hadoop pure-players are a confirmation of the Hadoop market. But I also think they can be explained by the tough competition Cloudera and Hortonworks are facing from large corporations like IBM, Teradata, Oracle, Microsoft. At least in terms of budget.

✚ While some of the above mentioned companies are partnering with at least one pure-play Hadooper — Cloudera, Hortonworks, MapR — that doesn’t mean they are not keeping an eye on the prize.

Original title and link: Hortonworks raises $100M to grow engineering and company’s ecosystem globally (NoSQL database©myNoSQL)

via: http://gigaom.com/2014/03/24/hortonworks-raises-100m-to-scale-its-hadoop-business/


The NoSQL Family Tree

NoSQL-Family-Tree

Even if it includes just a handful of NoSQL databases, it’s still a nice visualization.

Original title and link: The NoSQL Family Tree (NoSQL database©myNoSQL)

via: https://cloudant.com/blog/the-nosql-family-tree/