ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Intel: All content tagged as Intel in NoSQL databases and polyglot persistence

Hadoop security: unifying Project Rhino and Sentry

One result of Intel’s investment in Cloudera is putting together the teams to work on the same projects:

As the goals of Project Rhino and Sentry to develop more robust authorization mechanisms in Apache Hadoop are in complete alignment, the efforts of the engineers and security experts from both companies have merged, and their work now contributes to both projects. The specific goal is “unified authorization”, which goes beyond setting up authorization policies for multiple Hadoop components in a single administrative tool; it means setting an access policy once (typically tied to a “group” defined in an external user directory) and having it enforced across all of the different tools that this group of people uses to access data in Hadoop – for example access through Hive, Impala, search, as well as access from tools that execute MapReduce, Pig, and beyond.

A great first step.

You know what would be even better? A single security framework for Hadoop instead of two.

Original title and link: Hadoop security: unifying Project Rhino and Sentry (NoSQL database©myNoSQL)

via: http://vision.cloudera.com/project-rhino-and-sentry-onward-to-unified-authorization/


Intel kills a Hadoop and feeds another

I seriously doubt you could have missed the 2nd part of this, but here’s the shortest executive summary:

  1. Intel has killed its own distribution of Hadoop — is there anyone that would disagree this is a good idea?
  2. Intel has invested $740mil in Cloudera (for 18%) — there’s no typo. 740 millions.

The main questions:

  1. where will Cloudera put the $900mil raised in the last round(s)?
  2. why Intel invested so much?

These questions were also asked by Dan Primack for CNN Money and after looking at different angles he comes out empty.

So let’s check other sources:

  1. TechCrunch has initially speculated that much of the investment went to existing shareholders.

    The post was later updated with a comment from Cloudera’s VP of marketing stating that the majority of the money went to the company. But no word on how they’ll be used.

  2. Reuters writes that Intel made the investment to ensure their leading position in server processors:

    Intel hopes that encouraging more companies to leap into Big Data analysis will lead to higher sales of its high- end Xeon server processors. The chipmaker believes that hitching its wagon to Cloudera’s version of Hadoop, instead of pushing its own version, will make that happen faster.

    Still no word on how Cloudera will be using the money.

  3. Derrick Harris for GigaOm writes that the deal makes a lot of sense for both companies1:

    Cloudera needs capital and Intel’s huge sales force to keep up its engineering efforts and grow the company internationally.

    As part of the deal, Cloudera will be an early adopter of Intel gear and will optimize its Hadoop software to run on Intel’s latest technologies. Intel will port some of its work into the Cloudera distribution and will maintain its own Hadoop engineering team that will work alongside Cloudera’s engineers to help unite the two company’s goals.

  4. Jeff Kelly for SiliconAngle emphasizes the same channel advantages:

    Cloudera’s biggest reseller partner is Oracle. Based on my reading of the Intel announcement, the deal is not an official reseller partnership, but Intel will “market and promote CDH and Cloudera Enterprise to its customers as its preferred Hadoop platform.” Not quite as nice as having the Intel salesforce closing deals for it, but Cloudera stands to gain significant new business from the arrangement.


So how about this short list on how this round will be used by Cloudera:

  1. a part goes for international expansion
  2. a larger part goes to early shareholders
  3. the largest part goes into acquisitions

As for Intel, what if this investment also sealed an exclusive deal for Hadoop-centric Cloudera-supported Intel-powered appliance?


  1. Insert snarky comment here about a $740m deal that would not make sense to one of the parties. How about not making sense to any of them? 

Original title and link: Intel kills a Hadoop and feeds another (NoSQL database©myNoSQL)


The Forrester Wave for Hadoop market

Update: I’d like to thank the people that pointed out in the comment thread that I’ve messed up quite a few aspects in my comments about the report. I don’t believe in taking down posts that have been out for a while, so please be warned that basically this article can be ignored.

Thank you and my apologies for those comments that were a misinterpretation of the report..


This is the Q1 2014 Forrester Wave for Hadoop:

Forrester wave for Hadoop

A couple of thoughts:

  1. Cloudera, Hortonworks, MapR are positioned very (very) close.

    1. Hortonworks is position closer to the top right meaning they report more customers/larger install base
    2. MapR is higher on the vertical axis meaning that MapR’s strategy is slightly better.

      For me, MapR’s strategy can be briefly summarized as:

      1. address some of the limitations in the Hadoop ecosystem
      2. provide API-compatible products for major components of the Hadoop ecosystem
      3. use these Apache product (trade marked) names to advertise their products

      I think the 1st point above explains the better positioning of MapR’s current offering.

    3. Even if Cloudera has been the first pure-play Hadoop distribution it’s positioned behind behind both Hortonworks and MapR.

  2. IBM has the largest market presence. That’s a big surprise as I’m very rarely hearing clear messages from IBM.

  3. IBM and Pivotal Software are considered to have the strongest strategy. That’s another interesting point in Forrester’s report. Except the fact that IBM has a ton of data products and that Pivotal Software is offering more than Hadoop, I don’t know what exactly explains this position.

    The Forrester report Strategy positioning is based on quantifying the following categories: Licensing and pricing, Ability to execute, Product road map, Customer support. IBM and Pivotal are ranked the first in all these categories (with maximum marks for the last 3). As a comparison Hortonworks has 3/5 for Ability to execute — this must be related only to budget; Cloudera has 3/5 for both Ability to execute and Customer support.

    Pivotal is the 3rd last in terms of current offering. I guess my hypothesis for ranking Pivotal as 1st in terms of strategy is wrong.

  4. Microsoft who through the collaboration with Hortonworks came up with HDInsight, which basically enabled Hadoop for Excel and its data warehouse offering, it positioned the 2nd last on all 3 axes.

    No one seems to love Microsoft anymore.

  5. While not a pure Hadoop player, DataStax has been offering the DataStax Enterprise platform that includes support for analytics through Hadoop and search through Solr for at least 2 years. That’s actually way before anyone else from the group of companies in the Forrester’s report had anything similar1.

    This report focuses only on “general-purpose Hadoop solutions based on a differentiated, commercial Hadoop distribution”.

You can download the report after registering on Hortonwork’s site: here.


  1. DataStax is my employer. But what I wrote is a pure fact. 

Original title and link: The Forrester Wave for Hadoop market (NoSQL database©myNoSQL)


What is Intel doing in the Hadoop business?

In case you forgot, Intel offers a distribution of Hadoop. Tony Baur, Principal Analyst at Ovum, explains why Intel created this distribution:

The answer is that Hadoop is becoming the scale-out compute pillar of Intel’s emerging Software-Defined Infrastructure initiative for the data center – a vision that virtualizes general- and special-purpose CPUs powering functions under common Intel hardware-based components. The value proposition that Intel is proposing is that embedding serving, network, and/or storage functions into the chipset is a play for public or private cloud – supporting elasticity through enabling infrastructure to reshape itself dynamically according to variable processing demands.

Theoretically it sounds possible. But as the other attempt to explain Intel’s Hadoop distribution, I don’t believe this one either. Unfortunately I don’t have a good one myself, so I’ll keep asking myself.

Original title and link: What is Intel doing in the Hadoop business? (NoSQL database©myNoSQL)

via: http://ovum.com/2014/01/13/intel-refreshes-its-hadoop-distribution/


Pigs can build graphs too for graph analytics

Extremely interesting and intriguing usage and extension of Pig at Intel:

Pigs eat everything and Pig can ingest many data formats and data types from many data sources, in line with our objectives for Graph Builder. Also, Pig has native support for local file systems, HDFS, and HBase, and tools like Sqoop can be used upstream of Pig to transfer data into HDFS from relational databases. One of the most fascinating things about Pig is that it only takes a few lines of code to define a complex workflow comprised of a long chain of data operations. These operations can map to multiple MapReduce jobs and Pig compiles the logical plan into an optimized workflow of MapReduce jobs. With all of these advantages, Pig seemed like the right tool for graph ETL, so we re-architected Graph Builder 2.0 as a library of User- Defined Functions (UDF’s) and macros in Pig Latin.

Original title and link: Pigs can build graphs too for graph analytics (NoSQL database©myNoSQL)

via: http://blogs.intel.com/intellabs/2013/12/17/pigs-can-build-graphs-too/


Hadoop on top of… Intel adds Lustre support to Hadoop

Intel adds Lustre support to Hadoop:

We abstracted out an HDFS layer but underneath that it is actually talking to lustre.

This is not the first project based on the principle “we already have this distributed system, file system or database, so why not reusing it for Hadoop?”. What would be the first step of such a project? Provide a HDFS API compatible layer on top of your existing system. But how about the other assumptions in HDFS: large block, sequential, local access, etc? How do you guarantee that your integration addressed all of them?

If this trends continues, I could see one of the companies behind the open source Hadoop, Cloudera or Hortonworks or both, coming up with a TCK sold to any company that claims HDFS compatibility.

Original title and link: Hadoop on top of… Intel adds Lustre support to Hadoop (NoSQL database©myNoSQL)


Thoughts on Intel's Hadoop distribution

Gwen Shapira has a very interesting theory about what led Intel of creating its own distribution of Hadoop:

Intel is doing for Hadoop the same thing it did for C compilers – make sure they use the best hardware enhancements available in the CPUs and other hardware components available from Intel. The nice thing is that the enhancements are available as open source – Intel doesn’t care that the software is free, since they are selling the hardware!

But I don’t believe it.

  1. Intel is already using Hadoop internally. They can test the hell out of Hadoop and create improvements.
  2. As far as I know, Intel already has Hadoop committers that could push these improvements out. Even if Intel wouldn’t have Hadoop committers, I don’t think anyone from the Hadoop community would veto patches providing better utilization of cluster resources.

On the other hand I don’t have an alternative theory1. But I have a hypothesis that is related to the list of partners Intel signed when announcing their distribution2.


  1. I’m not going to state the obvious that Intel wants to make sure Hadoop works best on Intel so they don’t lose any market share. 

  2. Hint: check how many in that list are server/cluster/appliance vendors. 

Original title and link: Thoughts on Intel’s Hadoop distribution (NoSQL database©myNoSQL)

via: http://www.pythian.com/blog/intel-hadoop-distribution/


How Many Hadoops?

The short answer is there is only one Apache Hadoop distribution.

The long answer is that there are many distributions that include Apache Hadoop or are claiming compatibility with Apache Hadoop.

The oldest and probably most popular: Cloudera’s Distribution of Hadoop (CDH)

The 100% open source: Hortonworks Data Platform.

The prioprietary: MapR.

The blue one: IBM InfoSphere BigInsights.

The latest: WANdisco Hadoop WDD, Intel Distribution of Hadoop and Pivotal HD from EMC Greenplum.

There’s also the version Facebook’s running on their cluster which includes Facebook Corona: a different approach to job scheduling and resource management.

But this list is not complete as it doesn’t include appliances featuring Hadoop. In this category we have:

  1. Oracle’s Big Data appliance featuring Cloudera’s Distribution of Hadoop
  2. Netapp’s Hadooplers
  3. EMC Greenplum DCA
  4. Teradata Aster Discovery Platform featuring Hortonworks’s Hadoop Data Platform
  5. Data Direct Networks (DDN)

I hope I didn’t miss any important ones1. As a conclusion for this list, my question is: who is actually benefiting from all these distributions?


  1. I left aside for now Hadoop-as-a-Service.  

Original title and link: How Many Hadoops? (NoSQL database©myNoSQL)


Intel Distribution of H* in 21 Links

I don’t think anyone beside the PR department at Intel had the time to read through all the media coverage Intel Distribution H* got in the last couple of days. Here’s a collection of links for your reference. Pick wisely.

Intel Announcements

  1. Intel Aims to Enrich Lives by Unlocking the Power of Big Data

  2. Intel Jumps into HADOOP

Media Coverage

  1. NYTimes Bits: Intel’s Big Data Push

  2. Wired: Intel Leaps on Software Elephant for Trip to Hardware Heaven

  3. WSJ: Intel Releases Own Version of Hit Hadoop Software

  4. ZDNet: Intel baking Apache Hadoop into silicon for big data, security uses

  5. The Register: Intel takes on all Hadoop disties to rule big data munching

  6. Forbes: Can Intel Heal the Hadoop Open Source Ecosystem?

  7. Forbes: Intel Drops a Big Data Shocker

  8. Slashdot: Intel Launches Its Own Apache Hadoop Distribution

  9. GigaOm: Cloudera who? Intel announces its own Hadoop distribution

  10. SilliconAngle: Intel Gets Inside Big Data Chips With Hadoop

  11. eweek: Intel Releases Hadoop Distribution for Big Data

  12. InformationWeek: Intel Unveils New Distribution For Apache Hadoop

  13. Computerworld: Intel releases Hadoop software primed for its own chips

  14. PCMag: [Intel Tackles Big Data With Release of Apache Hadoop Platform](http://www.pcmag.com/article2/0,2817,2415931,00.asp “{{rel=’external nofollow’}}”

  15. DataInformed: Intel Jumps into Big Data Pool with Hadoop Distribution

  16. Slashdot: Intel’s New Hadoop Distribution Could Benefit Its Hardware Bottom Line

  17. VentureBeat: Intel moves into ‘big data’ software with Apache Hadoop distribution

  18. DatacenterKnowledge: Intel Enters the Hadoop Software Market

  19. Datacenter Dynamics: Intel launches own Hadoop distribution

Intel Distribution Partners

Intel Distribution Partners

If like me you’re interested in archiving these, I’ve put this list together in a format easier to read and archive.

Original title and link: Intel Distribution of H* in 21 Links (NoSQL database©myNoSQL)


Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem

Avik Dey (Intel) sent the announcement of the new open source project from Intel to the Hadoop mailing list:

As the Apache Hadoop ecosystem extends into new markets and sees new use cases with security and compliance challenges, the benefits of processing sensitive and legally protected data with Hadoop must be coupled with protection for private information that limits performance impact. Project Rhino is our open source effort to enhance the existing data protection capabilities of the Hadoop ecosystem to address these challenges, and contribute the code back to Apache.

Project Rhino targets security at all levels: from encryption and key management, cell level ACLs to audit logging.

Original title and link: Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem (NoSQL database©myNoSQL)

via: http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201302.mbox/%3cCD5137E5.15610%25avik.dey@intel.com%3e


EMC Contributes 1000+ Nodes Cluster for Apache Hadoop Testing

The Greenplum Analytics Workbench incorporates technology from the world’s leading software and hardware manufacturers with the intention of providing the infrastructure needed to facilitate Apache Hadoop innovation. The test bed cluster, which consists of 1,000+ hardware nodes or 10,000 nodes with the addition of virtual machines, features 24 petabytes of physical storage. This is the equivalent of nearly half of the entire written works of mankind, from the beginning of recorded history.

Thanks!

Original title and link: EMC Contributes 1000+ Nodes Cluster for Apache Hadoop Testing (NoSQL database©myNoSQL)

via: http://www.greenplum.com/news/greenplum-analytics-workbench


Data Is the New Currency. But Who’s Leading the Way?

In 2005, Tim O’Reilly said: “data is the next Intel Inside“. Today IDC Mario Morales (VP of semiconductor research) says data is the new currency. All’s good until you read the continuation:

And the companies that understand this are the ones already developing the analytics and infrastructure to extract that value—companies like IBM, HP, Intel, Microsoft, TI, Freescale and Oracle.

The article (nb: may require registration) continues by looking at what each of these companies are doing in the Big Data space, but focuses a large part on IBM Watson.

Going back to the question “who’s leading the Big Data way“, let’s take a quick look at the technology behind Watson. According to Jeopardy Goes to Hadoop and About Watson, Watson technology is based on Apache Hadoop, using an IBM language technology built on the Apache UIMA platform[1] and running Linux on IBM boxes.

To me it looks like open source is leading the advances in Big Data and these large organizations are just connecting the dots (as in packaging these technologies for enterprise environments and contributing missing pieces here and there)[2]. When did this happen before?


  1. Dmitriy Ryaboy taught me that UIMA came out of IBM in the first place and they’ve been critical in its development.  

  2. Or they are very secretive about their internal initiatives and research.  

Original title and link: Data Is the New Currency. But Who’s Leading the Way? (NoSQL database©myNoSQL)