ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

EMC: All content tagged as EMC in NoSQL databases and polyglot persistence

Pivotal People

I really like the people page on recently announced Pivotal’s website and in particular a couple of the individual pictures:

Pivotal People

✚ I did some digging but I came out empty about the relationship between Pivotal and Sir Tim Berners-Lee and Professor Joseph M. Hellerstein (CEO of Trifacta). Update: according to this post, the page lists people that inspire the Pivotal team.

Original title and link: Pivotal People (NoSQL database©myNoSQL)


Proprietary Hadoop Is a Losing Strategy

Matt Asay (10gen) for ReadWrite adding to the long discussion around EMC’s Pivotal HD announcement:

EMC has seemingly bottomless resources to throw at Hadoop, and every incentive to do so. It’s a smart, highly successful company and no doubt will prove successful with Pivotal HD. However, I can’t see it ever dominating an open-source infrastructure market with a proprietary distribution. Open source is the foundation for today’s most interesting markets, from Big Data to mobile to cloud computing. It’s unlikely that EMC will somehow stem this tide with a proprietary product, no matter its short- term performance or functionality advantages.

While I’ve linked to different perspectives about this topic, I’m not sure anyone outside our bubble actually came to a conclusion.

What I know, though, is that EMC is benefiting from this. A lot. Three weeks ago, I wasn’t reading anything about EMC and Hadoop. Today all major websites have at least a couple of articles about it.

Original title and link: Proprietary Hadoop Is a Losing Strategy (NoSQL database©myNoSQL)

via: http://readwrite.com/2013/03/12/proprietary-hadoop-is-a-losing-strategy


What It Means to Be “all In” on Hadoop

Another post about the Pivotal HD and the accompanying statements, this time from Matthew Aslett:

Pivotal HD is not Hadoop
Neither is Cloudera’s Distribution, including Apache Hadoop.
Nor the Hortonworks Data Platform.
Nor the MapR Distribution.
Nor IBM’s InfoSphere BigInsights.
Nor the WANdisco Distro.
Nor Intel’s Distribution for Apache Hadoop.

Original title and link: What It Means to Be “all In” on Hadoop (NoSQL database©myNoSQL)

via: http://blogs.the451group.com/information_management/2013/03/11/all-in-on-hadoop/


How Many Hadoops?

The short answer is there is only one Apache Hadoop distribution.

The long answer is that there are many distributions that include Apache Hadoop or are claiming compatibility with Apache Hadoop.

The oldest and probably most popular: Cloudera’s Distribution of Hadoop (CDH)

The 100% open source: Hortonworks Data Platform.

The prioprietary: MapR.

The blue one: IBM InfoSphere BigInsights.

The latest: WANdisco Hadoop WDD, Intel Distribution of Hadoop and Pivotal HD from EMC Greenplum.

There’s also the version Facebook’s running on their cluster which includes Facebook Corona: a different approach to job scheduling and resource management.

But this list is not complete as it doesn’t include appliances featuring Hadoop. In this category we have:

  1. Oracle’s Big Data appliance featuring Cloudera’s Distribution of Hadoop
  2. Netapp’s Hadooplers
  3. EMC Greenplum DCA
  4. Teradata Aster Discovery Platform featuring Hortonworks’s Hadoop Data Platform
  5. Data Direct Networks (DDN)

I hope I didn’t miss any important ones1. As a conclusion for this list, my question is: who is actually benefiting from all these distributions?


  1. I left aside for now Hadoop-as-a-Service.  

Original title and link: How Many Hadoops? (NoSQL database©myNoSQL)


EMC Contributes 1000+ Nodes Cluster for Apache Hadoop Testing

The Greenplum Analytics Workbench incorporates technology from the world’s leading software and hardware manufacturers with the intention of providing the infrastructure needed to facilitate Apache Hadoop innovation. The test bed cluster, which consists of 1,000+ hardware nodes or 10,000 nodes with the addition of virtual machines, features 24 petabytes of physical storage. This is the equivalent of nearly half of the entire written works of mankind, from the beginning of recorded history.

Thanks!

Original title and link: EMC Contributes 1000+ Nodes Cluster for Apache Hadoop Testing (NoSQL database©myNoSQL)

via: http://www.greenplum.com/news/greenplum-analytics-workbench


12 Hadoop Vendors to Watch in 2012

My list of 8 most interesting companies for the future of Hadoop didn’t try to include anyone having a product with the Hadoop word in it. But the list from InformationWeek does. To save you 15 clicks, here’s their list:

  • Amazon Elastic MapReduce
  • Cloudera
  • Datameer
  • EMC (with EMC Greenplum Unified Analytics Platform and EMC Data Computing Appliance)
  • Hadapt
  • Hortonworks
  • IBM (InfoSphere BigInsights)
  • Informatica (for HParser)
  • Karmasphere
  • MapR
  • Microsoft
  • Oracle

Original title and link: 12 Hadoop Vendors to Watch in 2012 (NoSQL database©myNoSQL)


Comparing Hadoop Appliances: Oracle’s Big Data Appliance, EMC Greenplum DCA, Netapp Hadooplers

Great post from Gwen Shapira over Pythian diving into the pros and cons of Hadoop appliances vs building your own Hadoop clusters. Plus a comparison of existing Hadoop appliances: Oracle Big Data Appliance, EMC Greenplum DCA, and Netapp Hadooplers.

Another good reason to roll your own is the flexibility: Appliances are called that way because they have a very specific configuration. You get a certain number of nodes, cpus, RAM and storage. Oracle’s offering is an 18 node rack. What if you want 12 nodes? or 23? tough luck. What if you want less RAM and more CPU? you are still stuck.

Original title and link: Comparing Hadoop Appliances: Oracle’s Big Data Appliance, EMC Greenplum DCA, Netapp Hadooplers (NoSQL database©myNoSQL)

via: http://www.pythian.com/news/29955/comparing-hadoop-appliances/


Partnerships in the Hadoop Market

Just a quick recap:

Amazon doesn’t partner with anyone for their Amazon Elastic Map Reduce. And IBM is walking alone with the software-only InfoSphere BigInsights.

Original title and link: Partnerships in the Hadoop Market (NoSQL database©myNoSQL)


EMC Greenplum Database and Hadoop Distribution Puts a Social Spin on Big Data

Huge technological contribution to the Hadoop ecosystem:

Greenplum, the analytics division of EMC, has announced new software that lets data analysts explore all their organization’s data and share interesting findings and data sets Facebook-style among their colleagues.

Original title and link: EMC Greenplum Database and Hadoop Distribution Puts a Social Spin on Big Data (NoSQL database©myNoSQL)

via: http://gigaom.com/cloud/emc-greenplum-puts-a-social-spin-on-big-data/


Explaining Hadoop to Your CEO

Dan Woods (Forbes):

The answer is, yes, Hadoop could be helpful, but there are other technologies as well. For example, technologies such as Splunk allow you to explore big data sets in a way that’s more interactive than most Hadoop implementations. Splunk not only lets you play with big data; you can also distill it and visualize it. Pervasive’s DataRush allows you to write parallel programs using a simplified programming model, and then process lots of data at scale. 1010data allows you to look at a spreadsheet that has a trillion rows, as well as handle time series data. EMC Greenplum and Teradata Aster Data and SAP HANA will also want a crack at your business. If you take any of these technologies and combine them with QlikView, Tableau, or TIBCO Spotfire, you can figure out what a big data set means to your business very quickly. So if your job is understanding the business value of the data, Hadoop is one of many things that you should analyze.

Translation:

Blah blah blah Big Data, blah blah blah list of vendors, blah blah blah Big Data

It might even work for a dummy CEO.

Original title and link: Explaining Hadoop to Your CEO (NoSQL database©myNoSQL)

via: http://www.forbes.com/sites/danwoods/2011/11/03/explaining-hadoop-to-your-ceo/


Hadoop: It's Still a Niche Technology

In an otherwise generic but interesting post about Hadoop and its integration with data analytics and data warehouse solutions, Jessica Twentyman writes:

It’s still a niche technology, but Hadoop’s profile received a serious boost over that past year, thanks in part to start-up companies such as Cloudera and MapR that offer commercially licensed and supported distributions of Hadoop. Its growing popularity is also the result of serious interest shown by EDW vendors like EMC, IBM and Teradata. EMC bought Hadoop specialist Greenplum in June 2010; Teradata announced its acquisition of Aster Data in March 2011; and IBM announced its own Hadoop offering, Infosphere, in May 2011.

Unfortunately she got this all wrong. It is the open source community, developers, data scientists, and Cloudera that help popularize Hadoop.

These data analytics and data warehouse vendors are just capitalizing on Hadoop delivering results. They haven’t been knocking at doors asking: “Have you heard of Hadoop? Do you want to try it?”. They’ve run into Hadoop in most of the places they went and that made them realize it is a business opportunity.

So, I’ll say it again: Hadoop is popular thanks to the open source community, developers, data scientists and Cloudera.

Original title and link: Hadoop: It’s Still a Niche Technology (NoSQL database©myNoSQL)

via: http://searchdatamanagement.techtarget.co.uk/feature/Hadoop-for-big-data-puts-architects-on-journey-of-discovery


BI Pentaho Integrates Hadoop, NoSQL Databases, and Analytic Databases

Dr.Dobb’s:

  • The ability to orchestrate execution of Hadoop related tasks (i.e., executing a Hive Query, Pig Script, or M/R job) as part of a broader IT workflow.
  • The ability to setup dependencies, so if a step fails the job can branch down a recovery path or send a notification, or if it’s a success it goes on to subsequent dependent tasks. Likewise it supports initiating several tasks in parallel.
  • New integration for Pig — so that developers have the ability to execute a Pig job from a PDI Job flow, integrate the execution of Pig jobs in broader IT workflows through PDI Jobs, take advantage of our out of the box scheduler, and so on.

The list of tools Pentaho 4 integrates with is quite long:

  • a long list of traditional RDBMS
  • analytics databases (Greenplum, Vertica, Netezza, Teradata, etc.)
  • NoSQL databases (MongoDB, HBase, etc.)
  • Hadoop variants
  • LexisNexis HPCC

This is the world of polyglot persistence and hybrid data storage.

Original title and link: BI Pentaho Integrates Hadoop, NoSQL Databases, and Analytic Databases (NoSQL database©myNoSQL)