ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

hadoop: All content tagged as hadoop in NoSQL databases and polyglot persistence

Use Cases for Hadoop's New Pluggable Sort

What is the big deal about Sort? Sort is fundamental to the MapReduce framework, the data is sorted between the Map and Reduce phases (see below). Syncsort’s contribution allows native Hadoop sort to be replaced by an alternative sort implementation, for both Map and Reduce sides, i.e. it makes Sort phase pluggable.

Tendu Yogurtcu describes a couple of new use cases that the pluggable sort implementation contributed by Syncsort to Apache Hadoop is opening:

  1. Optimized sort implementations and full joins
  2. Hash-based aggregations with no sort requirements
  3. Reducers that can start before all Mappers complete

Original title and link: Use Cases for Hadoop’s New Pluggable Sort (NoSQL database©myNoSQL)

via: http://blog.syncsort.com/2013/02/hadoop-mapreduce-to-sort-or-not-to-sort/


A Brief Guide to Pig Latin for the SQL Guy

Cat Miller from Mortar Data offers a quick intro to Pig Latin from a SQLish perspective:

Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.

Pig and SQL similarities are in the operations they both support. But the whole model is different. Pig is an imperative data manipulation tool, while SQL is a declarative query language.

Original title and link: A Brief Guide to Pig Latin for the SQL Guy (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/pig-eye-for-the-sql-guy/


Intel Distribution of H* in 21 Links

I don’t think anyone beside the PR department at Intel had the time to read through all the media coverage Intel Distribution H* got in the last couple of days. Here’s a collection of links for your reference. Pick wisely.

Intel Announcements

  1. Intel Aims to Enrich Lives by Unlocking the Power of Big Data

  2. Intel Jumps into HADOOP

Media Coverage

  1. NYTimes Bits: Intel’s Big Data Push

  2. Wired: Intel Leaps on Software Elephant for Trip to Hardware Heaven

  3. WSJ: Intel Releases Own Version of Hit Hadoop Software

  4. ZDNet: Intel baking Apache Hadoop into silicon for big data, security uses

  5. The Register: Intel takes on all Hadoop disties to rule big data munching

  6. Forbes: Can Intel Heal the Hadoop Open Source Ecosystem?

  7. Forbes: Intel Drops a Big Data Shocker

  8. Slashdot: Intel Launches Its Own Apache Hadoop Distribution

  9. GigaOm: Cloudera who? Intel announces its own Hadoop distribution

  10. SilliconAngle: Intel Gets Inside Big Data Chips With Hadoop

  11. eweek: Intel Releases Hadoop Distribution for Big Data

  12. InformationWeek: Intel Unveils New Distribution For Apache Hadoop

  13. Computerworld: Intel releases Hadoop software primed for its own chips

  14. PCMag: [Intel Tackles Big Data With Release of Apache Hadoop Platform](http://www.pcmag.com/article2/0,2817,2415931,00.asp “{{rel=’external nofollow’}}”

  15. DataInformed: Intel Jumps into Big Data Pool with Hadoop Distribution

  16. Slashdot: Intel’s New Hadoop Distribution Could Benefit Its Hardware Bottom Line

  17. VentureBeat: Intel moves into ‘big data’ software with Apache Hadoop distribution

  18. DatacenterKnowledge: Intel Enters the Hadoop Software Market

  19. Datacenter Dynamics: Intel launches own Hadoop distribution

Intel Distribution Partners

Intel Distribution Partners

If like me you’re interested in archiving these, I’ve put this list together in a format easier to read and archive.

Original title and link: Intel Distribution of H* in 21 Links (NoSQL database©myNoSQL)


Some Interesting Facts, Sorry FUD About Hadoop MapReduce

If you feel like reading a bit of bullet point style FUD about Hadoop, check Dr. David F. Rico’s PDF.

Original title and link: Some Interesting Facts, Sorry FUD About Hadoop MapReduce (NoSQL database©myNoSQL)


The History of Hadoop Changed the World

Over the next few years, Hadoop reinvented data analysis not only at Facebook and Yahoo but so many other web services. And then an army of commercial software vendors started selling the thing to the rest of the world. Soon, even the likes of Oracle and Greenplum were hawking Hadoop. These companies still treated Hadoop as an adjunct to the traditional database — as a tool suited only to certain types of data analysis. But now, that’s changing too.

I have found the above fragment, which fully describes the impact Hadoop had and has in the data world, in Cade Metz’s article about Greenplum’s Pivotal HD announcement for Wired: “Why Hadoop Is the Future of the Database.

Original title and link: The History of Hadoop Changed the World (NoSQL database©myNoSQL)


Apache Pig Goes 0.11

Almost lost in the tons of Hadoopy releases, I have found the announcement of Apache Pig 0.11, which, as a serious open source project, packages nice new features for a point release:

  1. DateTime data type
  2. RANK, CUBE, ROLLUP operators
  3. Groovy UDFs

Plus tons of improvements.

Original title and link: Apache Pig Goes 0.11 (NoSQL database©myNoSQL)

via: https://blogs.apache.org/pig/entry/apache_pig_it_goes_to


Spring for Apache Hadoop 1.0 Goes GA: Wrapping Hadoop in XML

Costin Leau announcing the GA of Spring for Apache Hadoop:

What we have observed is that using the standard out of the box tools that come with Hadoop, you an easily end up with Hadoop applications that are poorly structured collection of command line utilities, scripts and pieces of code stiched together.

Leaving aside the jokes and that I don’t fully understand the purpose of this project (and here and here) , congrats for the release!

Original title and link: Spring for Apache Hadoop 1.0 Goes GA: Wrapping Hadoop in XML (NoSQL database©myNoSQL)

via: http://blog.springsource.org/2013/02/26/shdp-1-0-goes-ga/


Cloudera Pissed Off

Charles Zedlewki takes position for Cloudera to the recent attacks to Hadoop and Impala:

I’m reminded of our open source strategy this week not only because of the further validation of Hadoop’s popularity but also because of the entry of a new round of proprietary imitators. At one point there were six distinct vendors all promoting proprietary filesystems as alternatives to HDFS, many of which included breathless claims of how they could make Apache Hadoop faster and “more powerful.” This year we get to see history repeat itself, this time with SQL engines. The marketing is nearly identical to that of the proprietary filesystem era: damning open source with faint praise, pointing out its limitations and extolling the virtues of some feature(s) proprietary to that particular vendor.

Proprietary SQL vendors will pull a page from the proprietary storage playbook: damn open source Impala with faint praise and point out its limitations, both real and contrived. They will be equally ineffective. We will continue to bet on an open, integrated, and highly flexible big data platform. Saying you are “all in on Hadoop” while simultaneously promoting a proprietary platform means you are missing the point.

Neither Cloudera, nor other companies that invested a lot and everything in the Hadoop ecosystem are at the size not to care about large corporations attacking their bets. Every corporation is trying to emulate the Microsoft strategy: wait for a new technology to be confirmed, then jump at the opportunity with all your forces. But I really hope open source will prevail.

Original title and link: Cloudera Pissed Off (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2013/02/open-source-flattery-and-the-platform-for-big-data/


What Makes Amazon Redshift Faster Than Hive?

I’m not implying that this question appeared on Quora after my link and comments about Redshift’s performance and costs at AirBnb, but Reynold Xin’s answer covers in a more formal way the reasons of Redshift being faster than Hive I’ve suggested in that post:

Some of the advantages you gain from massive scale and flexibility make it challenging to build a more performant query engine. The following outlines how various features (or lack of features) influences performance:

  1. data format
  2. task launch overhead (nb: this can be optimized in Hive/Hadoop)
  3. intermediate data materialization vs pipelining
  4. columnar data format
  5. columnar query engine
  6. faster S3 connection

Original title and link: What Makes Amazon Redshift Faster Than Hive? (NoSQL database©myNoSQL)

via: http://www.quora.com/Hive-computing/What-makes-Amazon-Redshift-faster-than-Hive


Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem

Avik Dey (Intel) sent the announcement of the new open source project from Intel to the Hadoop mailing list:

As the Apache Hadoop ecosystem extends into new markets and sees new use cases with security and compliance challenges, the benefits of processing sensitive and legally protected data with Hadoop must be coupled with protection for private information that limits performance impact. Project Rhino is our open source effort to enhance the existing data protection capabilities of the Hadoop ecosystem to address these challenges, and contribute the code back to Apache.

Project Rhino targets security at all levels: from encryption and key management, cell level ACLs to audit logging.

Original title and link: Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem (NoSQL database©myNoSQL)

via: http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201302.mbox/%3cCD5137E5.15610%25avik.dey@intel.com%3e


Redshift Performance & Cost at Airbnb

Henry Cai from AirBnb reports about their experiment and move from using Hive with Hadoop to Amazon Redshift:

As shown above the performance gain is pretty significant, and the cost saving is even more impressive: $13.60/hour versus $57/hour. This is hard to compare due to the different pricing models, but check out pricing here for more info. In fact, our analysts like Redshift so much that they don’t want to go back to Hive and other tools even though a few key features are lacking in Redshift. Also, we have noticed that big joins of billions of rows tend to run for a very long time, so for that we’d go back to hadoop for help.

If I’m not mistaking, this is the second story in the last week about the performance of Redshift. But here’s something I don’t understand (or I don’t see mentioned in this post):

  1. you use Hadoop to store your data. The reason is that 12 months ago, 6 months ago (and today) there is no other more cost effective and productive solution.
  2. in this time you learn about the data. You develop models and queries
  3. your analysts prefer SQL because that’s what makes them more productive
  4. you take the data, the knowledge you’ve built in this time, you craft it to fit into a columnar analytic database
  5. then you write that the columnar analytic-oriented database is more performant than using Hive over Hadoop

To me this feels like saying that you are more efficient in your mother tongue than in a foreign language. Or am I missing something?

Original title and link: Redshift Performance & Cost at Airbnb (NoSQL database©myNoSQL)


Integrating MongoDB and Hadoop: Why & How

The Mortar blog:

Mongo was built for data storage and retrieval, and Hadoop was written for data processing. So naturally, data processing is often better offloaded to Hadoop. Here’s why:

  1. Easier, more expressive language
  2. Libraries to build on
  3. Big performance improvements
  4. Separate workloads mean less load

For the how part, the post recommends their own Hadoop-as-a-Service platform and a set of libraries the Mortar platform provides.

✚ While browsing the Mortar blog and website I couldn’t find any information related to the costs of transferring data. The AWS services usually have a data transfer dimension, which most often has an important impact on the total costs of a solution.

Original title and link: Integrating MongoDB and Hadoop: Why & How (NoSQL database©myNoSQL)

via: http://blog.mortardata.com/post/43080668046/mongodb-hadoop-why-how