MapReduce: All content tagged as MapReduce in NoSQL databases and polyglot persistence
Tuesday, 26 February 2013
Cloudera Pissed Off
Charles Zedlewki takes position for Cloudera to the recent attacks to Hadoop and Impala:
I’m reminded of our open source strategy this week not only because of the further validation of Hadoop’s popularity but also because of the entry of a new round of proprietary imitators. At one point there were six distinct vendors all promoting proprietary filesystems as alternatives to HDFS, many of which included breathless claims of how they could make Apache Hadoop faster and “more powerful.” This year we get to see history repeat itself, this time with SQL engines. The marketing is nearly identical to that of the proprietary filesystem era: damning open source with faint praise, pointing out its limitations and extolling the virtues of some feature(s) proprietary to that particular vendor.
Proprietary SQL vendors will pull a page from the proprietary storage playbook: damn open source Impala with faint praise and point out its limitations, both real and contrived. They will be equally ineffective. We will continue to bet on an open, integrated, and highly flexible big data platform. Saying you are “all in on Hadoop” while simultaneously promoting a proprietary platform means you are missing the point.
Neither Cloudera, nor other companies that invested a lot and everything in the Hadoop ecosystem are at the size not to care about large corporations attacking their bets. Every corporation is trying to emulate the Microsoft strategy: wait for a new technology to be confirmed, then jump at the opportunity with all your forces. But I really hope open source will prevail.
Original title and link: Cloudera Pissed Off (©myNoSQL)
via: http://blog.cloudera.com/blog/2013/02/open-source-flattery-and-the-platform-for-big-data/
What Makes Amazon Redshift Faster Than Hive?
I’m not implying that this question appeared on Quora after my link and comments about Redshift’s performance and costs at AirBnb, but Reynold Xin’s answer covers in a more formal way the reasons of Redshift being faster than Hive I’ve suggested in that post:
Some of the advantages you gain from massive scale and flexibility make it challenging to build a more performant query engine. The following outlines how various features (or lack of features) influences performance:
- data format
- task launch overhead (nb: this can be optimized in Hive/Hadoop)
- intermediate data materialization vs pipelining
- columnar data format
- columnar query engine
- faster S3 connection
Original title and link: What Makes Amazon Redshift Faster Than Hive? (©myNoSQL)
via: http://www.quora.com/Hive-computing/What-makes-Amazon-Redshift-faster-than-Hive
Big Data at Torbit: Custom MapReduce-like System
Tylor Arndt about Torbit’s “build-your-own-MapReduce”:
The final system begins with a web-service against which client systems interface. To ensure resiliency, an instance of the web- service runs on each cluster host. When a client request arrives the web-service creates a MapReduce job to fulfill client requests. The reducer function component of the MapReduce job runs within the web- service handling the request.
The requirements listed in the post are too high level to understand why building their own solutions was better. But if it works for them, that’s OK. Just keep in mind that NIH and distributed systems don’t always mix well.
Original title and link: Big Data at Torbit: Custom MapReduce-like System (©myNoSQL)
Monday, 25 February 2013
Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem
Avik Dey (Intel) sent the announcement of the new open source project from Intel to the Hadoop mailing list:
As the Apache Hadoop ecosystem extends into new markets and sees new use cases with security and compliance challenges, the benefits of processing sensitive and legally protected data with Hadoop must be coupled with protection for private information that limits performance impact. Project Rhino is our open source effort to enhance the existing data protection capabilities of the Hadoop ecosystem to address these challenges, and contribute the code back to Apache.
Project Rhino targets security at all levels: from encryption and key management, cell level ACLs to audit logging.
Original title and link: Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem (©myNoSQL)
Redshift Performance & Cost at Airbnb
As shown above the performance gain is pretty significant, and the cost saving is even more impressive: $13.60/hour versus $57/hour. This is hard to compare due to the different pricing models, but check out pricing here for more info. In fact, our analysts like Redshift so much that they don’t want to go back to Hive and other tools even though a few key features are lacking in Redshift. Also, we have noticed that big joins of billions of rows tend to run for a very long time, so for that we’d go back to hadoop for help.
If I’m not mistaking, this is the second story in the last week about the performance of Redshift. But here’s something I don’t understand (or I don’t see mentioned in this post):
- you use Hadoop to store your data. The reason is that 12 months ago, 6 months ago (and today) there is no other more cost effective and productive solution.
- in this time you learn about the data. You develop models and queries
- your analysts prefer SQL because that’s what makes them more productive
- you take the data, the knowledge you’ve built in this time, you craft it to fit into a columnar analytic database
- then you write that the columnar analytic-oriented database is more performant than using Hive over Hadoop
To me this feels like saying that you are more efficient in your mother tongue than in a foreign language. Or am I missing something?
Original title and link: Redshift Performance & Cost at Airbnb (©myNoSQL)
Friday, 22 February 2013
Integrating MongoDB and Hadoop: Why & How
The Mortar blog:
Mongo was built for data storage and retrieval, and Hadoop was written for data processing. So naturally, data processing is often better offloaded to Hadoop. Here’s why:
- Easier, more expressive language
- Libraries to build on
- Big performance improvements
- Separate workloads mean less load
For the how part, the post recommends their own Hadoop-as-a-Service platform and a set of libraries the Mortar platform provides.
✚ While browsing the Mortar blog and website I couldn’t find any information related to the costs of transferring data. The AWS services usually have a data transfer dimension, which most often has an important impact on the total costs of a solution.
Original title and link: Integrating MongoDB and Hadoop: Why & How (©myNoSQL)
via: http://blog.mortardata.com/post/43080668046/mongodb-hadoop-why-how
Which Big Data Company Has the World's Biggest Hadoop Cluster?
Jimmy Wong:
Which companies use Hadoop for analyzing big data? How big are their clusters? I thought it would be fun to compare companies by the size of their Hadoop installations. The size would indicate the company’s investment in Hadoop, and subsequently their appetite to buy big data products and services from vendors, as well as their hiring needs to support their analytics infrastructure.
Unfortunately the data available is sooo little and soooo old.
Original title and link: Which Big Data Company Has the World’s Biggest Hadoop Cluster? (©myNoSQL)
via: http://www.hadoopwizard.com/which-big-data-company-has-the-worlds-biggest-hadoop-cluster/
Thursday, 21 February 2013
Vague Goals Seed Big Data Failures
Doug Henshen for InformationWeek discussing the results of Infochimps’s survey “CIOs & Big Data: What Your IT Team Wants You to Know”:
What business problem are you trying to solve? If you could tell your IT employees what it is, they’d have a much better crack at big data success.
[…]
“Inaccurate scope” is cited by 58% as the top reason that big data IT projects fail. “Too many big data projects are structured like boil-the-ocean experiments”, Infochimps’ CEO, Jim Kaskade, told InformationWeek.
Some call these vague and unrealistic expectations the “trough of disillusionment”.
Original title and link: Vague Goals Seed Big Data Failures (©myNoSQL)
Wednesday, 20 February 2013
Hortonworks: The Fastest Path to Innovation: Community Driven Open Source
Shaun Connolly for the Hortonworks blog:
we believe the fastest way to innovate is to do our work within the open source community, introduce enterprise feature requirements into that public domain, and to work diligently to progress existing open source projects and incubate new projects to meet those needs.
In support of our approach, this week we’ve announced the submission of two new incubation projects to the Apache Software foundation together with the launch of the “Stinger Initiative”, all aimed at enhancing the security and performance of Hadoop applications.
I’m forced, but extremely happy to take back what I said.
- Stinger: an initiative to speed up Apache Hive for interactive queries. Read about it here
- Know Gateway: a solution for authentication and security in Hadoop. More details here
- Tez framework: a new Hadoop YARN-based runtime for improved latency and throughput. Details here
Hortonworks believes in open source.
Original title and link: Hortonworks: The Fastest Path to Innovation: Community Driven Open Source (©myNoSQL)
via: http://hortonworks.com/blog/hortonworks-community-leadership/
Counting Triangles Smarter (Or How to Beat Big Data Vendors at Their Own Game)
Davy Suvee showing that Datablend’s custom datastore could deliver better performance than generic solutions like Hadoop, Vertica, or ExaData:
Although Vertica and Oracle’s results are impressive, they require a significant hardware setup of 4 nodes, each containing 96GB of RAM and 12 cores. My challenge: beating the Big Data vendors at their own game by calculating triangles through a smarter algorithm that is able to deliver similar performance on commodity hardware (i.e. my MacBook Pro Retina).
Considering the size of the data (86mil. relationships), I wonder what the result would be using a graph database like Neo4j. Anyone up for testing it?
Original title and link: Counting Triangles Smarter (Or How to Beat Big Data Vendors at Their Own Game) (©myNoSQL)
Tuesday, 19 February 2013
Hortonworks and Community Driven Hadoop
First, “We Believe… in community driven Enterprise Apache Hadoop” and then the next day “Announcing Apache Hadoop 2.0.3 Release and Roadmap“. These two posts published within 2 days on Hortonworks’s blog don’t entirely support each other. At least not without a bit of a different formulation and linking to the announcement sent to the Hadoop mailing list.
Original title and link: Hortonworks and Community Driven Hadoop (©myNoSQL)
Thursday, 14 February 2013
Data Deduplication Tactics With HDFS and MapReduce
5 techniques and links to research papers about data deduplication using HDFS and MapReduce:
Some of the common methods for data deduplication in storage architecture include hashing, binary comparison and delta differencing. In this post, we focus on how MapReduce and HDFS can be leveraged for eliminating duplicate data.
Patrick Durusau
Original title and link: Data Deduplication Tactics With HDFS and MapReduce (©myNoSQL)
via: http://www.hadoopsphere.com/2013/02/data-de-duplication-tactics-with-hdfs.html
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling

