ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Greenplum: All content tagged as Greenplum in NoSQL databases and polyglot persistence

When should I use Greenplum Database versus HAWQ?

Jon Roberts about the use cases for Greenplum and HAWQ, both technologies offered by Pivotal:

Greenplum is a robust MPP database that works very well for Data Marts and Enterprise Data Warehouses that tackles historical Business Intelligence reporting as well as predictive analytical use cases. HAWQ provides the most robust SQL interface for Hadoop and can tackle data exploration and transformation in HDFS.

First questions that popped in my mind:

  1. why isn’t HAWQ good for reporting?
  2. why isn’t HAWQ good for predictive analytics?

I don’t have a good answer for any of these. For the first, I assume that the implied answer is Hadoop’s latency. On the other hand, what I know is that Microsoft and Hortonworks are trying to bring Hadoop data into Excel with HDInsight. This is not traditional reporting, but if that’s acceptable from a latency point of view, I’m not sure why it wouldn’t work for reporting too.

For the second question, Hadoop and the tools built around it are well known for predictive analytics. So maybe this separation is due only to HAWQ. Another explanation could be product positioning.

This last part seems to be confirmed by the rest of the post which is making the point that data stored in HDFS is temporary and once it is processed with HAWQ it is moved into Greenplum.

Greenplum and HAWQ

In other words, HAWQ is just for ETL/ELT on Hadoop.

✚ I’m pretty sure that many traditional data warehouse companies that are forced to come up with coherent proposals for architectures based on their core products and Hadoop are facing the same product positioning problem — it’s difficult to accept in front of the customers that Hadoop might be capable to replace core functionality of the products you are selling.

What is the best answer to this positioning dilemma?

  1. Find a spot for Hadoop that is not hurting your core products. Let’s say ETL.
  2. Propose an architecture where your core products and Hadoop are fully complementing and interacting with each other.

You already know my answer.

Original title and link: When should I use Greenplum Database versus HAWQ? (NoSQL database©myNoSQL)

via: http://www.pivotalguru.com/?p=642


Aster Data, HAWQ, GPDB and the First Hadoop Squeeze

Rob Klopp:

But there are three products, the Greenplum database (GPDB), HAWQ, and Aster Data, that will be squeezed more quickly as they are positioned either in between the EDW and Hadoop… or directly over Hadoop. In this post I’ll explain what I suspect Pivotal and Teradata are trying to do… why I believe their strategy will not work for long… and why readers of this blog should be careful moving forward.

This is a very interesting analysis of the enterprise data warehouse market. There’s also a nice visualization of this prediction:

the-first-squeeze2

Here’s an alternative though. As showed in the picture above, the expansion of in-memory databases’ depends heavily on the evolution of the price of memory. It’s hard to argument against price predictions or Moore’s law. But accidents even if rare are still possible. Any significant change in the trend of memory costs, or other hardware market conditions (e.g. an unpredicted decrease of the price for SSDs), could give Teradata and Pivotal the extra time/conditions to break into advanced hybrid storage solutions that would offer slightly less fast but also less expensive products than their competitors’ in-memory databases.

Original title and link: Aster Data, HAWQ, GPDB and the First Hadoop Squeeze (NoSQL database©myNoSQL)

via: http://robklopp.wordpress.com/2013/12/11/aster-data-hawq-gpdb-and-the-first-hadoop-squeeze/


What It Means to Be “all In” on Hadoop

Another post about the Pivotal HD and the accompanying statements, this time from Matthew Aslett:

Pivotal HD is not Hadoop
Neither is Cloudera’s Distribution, including Apache Hadoop.
Nor the Hortonworks Data Platform.
Nor the MapR Distribution.
Nor IBM’s InfoSphere BigInsights.
Nor the WANdisco Distro.
Nor Intel’s Distribution for Apache Hadoop.

Original title and link: What It Means to Be “all In” on Hadoop (NoSQL database©myNoSQL)

via: http://blogs.the451group.com/information_management/2013/03/11/all-in-on-hadoop/


How Many Hadoops?

The short answer is there is only one Apache Hadoop distribution.

The long answer is that there are many distributions that include Apache Hadoop or are claiming compatibility with Apache Hadoop.

The oldest and probably most popular: Cloudera’s Distribution of Hadoop (CDH)

The 100% open source: Hortonworks Data Platform.

The prioprietary: MapR.

The blue one: IBM InfoSphere BigInsights.

The latest: WANdisco Hadoop WDD, Intel Distribution of Hadoop and Pivotal HD from EMC Greenplum.

There’s also the version Facebook’s running on their cluster which includes Facebook Corona: a different approach to job scheduling and resource management.

But this list is not complete as it doesn’t include appliances featuring Hadoop. In this category we have:

  1. Oracle’s Big Data appliance featuring Cloudera’s Distribution of Hadoop
  2. Netapp’s Hadooplers
  3. EMC Greenplum DCA
  4. Teradata Aster Discovery Platform featuring Hortonworks’s Hadoop Data Platform
  5. Data Direct Networks (DDN)

I hope I didn’t miss any important ones1. As a conclusion for this list, my question is: who is actually benefiting from all these distributions?


  1. I left aside for now Hadoop-as-a-Service.  

Original title and link: How Many Hadoops? (NoSQL database©myNoSQL)


Big Data Market Analysis: Vendors Revenue and Forecasts

I think this is the first extensive Big Data report I’m reading that includes enough relevant and quite exhaustive data about the majority of players in the Big Data market, plus some captivating forecasts.

As of early 2012, the Big Data market stands at just over $5 billion based on related software, hardware, and services revenue. Increased interest in and awareness of the power of Big Data and related analytic capabilities to gain competitive advantage and to improve operational efficiencies, coupled with developments in the technologies and services that make Big Data a practical reality, will result in a super-charged CAGR of 58% between now and 2017.

2011 Big Data Pure-Play Vendors Yealy Big Data Revenue

While there are many stories behind these numbers and many things to think about, here is what I’ve jotted down while studying the report:

  • it’s no surprise that “megavendors” (IBM, HP, etc.) account for the largest part of today’s Big Data market revenue
  • still, the revenue ratio of pure-players vs megavendors feels quite unbalanced: $311mil out of $5.1bil
    • the pure-player category includes: Vertica, Aster Data, Splunk, Greenplum, 1010data, Cloudera, Think Big Analytics, MapR, Digital Reasoning, Datameer, Hortonworks, DataStax, HPCC Systems, Karmasphere
    • there are a couple of names that position themselves in the Big Data market that do not show up in anywhere (e.g. 10gen, Couchbase)
  • this could lead to the conclusion that the companies that include hardware in their offer benefit of larger revenues
    • I’m wondering though what is the margin in the hardware market segment. While not having any data at hand, I think I’ve read reports about HP and Dell not doing so well due exactly to lower margins
    • see bullet point further down about revenue by hardware, software, and services
  • this could explain why so many companies are trying their hand at appliances
  • by looking at the various numbers you can see that those selling appliances usually have a large corporation behind supporting the production costs for hadware and probably the cost of the sales force
  • in the Big Data revenue by vendor you can find quite a few well-known names from the consulting segment
  • the revenue by type pie lists services as accounting for 44%, hardware for 31%, and software for 13% which might give an idea of what makes up the megavendors’ sales packages
    • most of the NoSQL database companies and Hadoop companies are mostly in the software and services segment

Great job done by the Wikibon team.

Original title and link: Big Data Market Analysis: Vendors Revenue and Forecasts (NoSQL database©myNoSQL)

via: http://wikibon.org/wiki/v/Big_Data_Market_Size_and_Vendor_Revenues


12 Hadoop Vendors to Watch in 2012

My list of 8 most interesting companies for the future of Hadoop didn’t try to include anyone having a product with the Hadoop word in it. But the list from InformationWeek does. To save you 15 clicks, here’s their list:

  • Amazon Elastic MapReduce
  • Cloudera
  • Datameer
  • EMC (with EMC Greenplum Unified Analytics Platform and EMC Data Computing Appliance)
  • Hadapt
  • Hortonworks
  • IBM (InfoSphere BigInsights)
  • Informatica (for HParser)
  • Karmasphere
  • MapR
  • Microsoft
  • Oracle

Original title and link: 12 Hadoop Vendors to Watch in 2012 (NoSQL database©myNoSQL)


Comparing Hadoop Appliances: Oracle’s Big Data Appliance, EMC Greenplum DCA, Netapp Hadooplers

Great post from Gwen Shapira over Pythian diving into the pros and cons of Hadoop appliances vs building your own Hadoop clusters. Plus a comparison of existing Hadoop appliances: Oracle Big Data Appliance, EMC Greenplum DCA, and Netapp Hadooplers.

Another good reason to roll your own is the flexibility: Appliances are called that way because they have a very specific configuration. You get a certain number of nodes, cpus, RAM and storage. Oracle’s offering is an 18 node rack. What if you want 12 nodes? or 23? tough luck. What if you want less RAM and more CPU? you are still stuck.

Original title and link: Comparing Hadoop Appliances: Oracle’s Big Data Appliance, EMC Greenplum DCA, Netapp Hadooplers (NoSQL database©myNoSQL)

via: http://www.pythian.com/news/29955/comparing-hadoop-appliances/


Partnerships in the Hadoop Market

Just a quick recap:

Amazon doesn’t partner with anyone for their Amazon Elastic Map Reduce. And IBM is walking alone with the software-only InfoSphere BigInsights.

Original title and link: Partnerships in the Hadoop Market (NoSQL database©myNoSQL)


EMC Greenplum Database and Hadoop Distribution Puts a Social Spin on Big Data

Huge technological contribution to the Hadoop ecosystem:

Greenplum, the analytics division of EMC, has announced new software that lets data analysts explore all their organization’s data and share interesting findings and data sets Facebook-style among their colleagues.

Original title and link: EMC Greenplum Database and Hadoop Distribution Puts a Social Spin on Big Data (NoSQL database©myNoSQL)

via: http://gigaom.com/cloud/emc-greenplum-puts-a-social-spin-on-big-data/


Explaining Hadoop to Your CEO

Dan Woods (Forbes):

The answer is, yes, Hadoop could be helpful, but there are other technologies as well. For example, technologies such as Splunk allow you to explore big data sets in a way that’s more interactive than most Hadoop implementations. Splunk not only lets you play with big data; you can also distill it and visualize it. Pervasive’s DataRush allows you to write parallel programs using a simplified programming model, and then process lots of data at scale. 1010data allows you to look at a spreadsheet that has a trillion rows, as well as handle time series data. EMC Greenplum and Teradata Aster Data and SAP HANA will also want a crack at your business. If you take any of these technologies and combine them with QlikView, Tableau, or TIBCO Spotfire, you can figure out what a big data set means to your business very quickly. So if your job is understanding the business value of the data, Hadoop is one of many things that you should analyze.

Translation:

Blah blah blah Big Data, blah blah blah list of vendors, blah blah blah Big Data

It might even work for a dummy CEO.

Original title and link: Explaining Hadoop to Your CEO (NoSQL database©myNoSQL)

via: http://www.forbes.com/sites/danwoods/2011/11/03/explaining-hadoop-to-your-ceo/


Hadoop: It's Still a Niche Technology

In an otherwise generic but interesting post about Hadoop and its integration with data analytics and data warehouse solutions, Jessica Twentyman writes:

It’s still a niche technology, but Hadoop’s profile received a serious boost over that past year, thanks in part to start-up companies such as Cloudera and MapR that offer commercially licensed and supported distributions of Hadoop. Its growing popularity is also the result of serious interest shown by EDW vendors like EMC, IBM and Teradata. EMC bought Hadoop specialist Greenplum in June 2010; Teradata announced its acquisition of Aster Data in March 2011; and IBM announced its own Hadoop offering, Infosphere, in May 2011.

Unfortunately she got this all wrong. It is the open source community, developers, data scientists, and Cloudera that help popularize Hadoop.

These data analytics and data warehouse vendors are just capitalizing on Hadoop delivering results. They haven’t been knocking at doors asking: “Have you heard of Hadoop? Do you want to try it?”. They’ve run into Hadoop in most of the places they went and that made them realize it is a business opportunity.

So, I’ll say it again: Hadoop is popular thanks to the open source community, developers, data scientists and Cloudera.

Original title and link: Hadoop: It’s Still a Niche Technology (NoSQL database©myNoSQL)

via: http://searchdatamanagement.techtarget.co.uk/feature/Hadoop-for-big-data-puts-architects-on-journey-of-discovery


R: the Leading Statistics Language and Key Weapon in Advanced Analytics Today

David Smith (Revolution Analytics):

Of course, this isn’t the first time that R has been embedded into a data warehousing appliance. IBM Netezza’s iClass device integrates with Revolution R, and AsterData, the Teradata Data Warehouse Appliance, and Greenplum all provide connections to R as well. Here at Revolution Analytics, we think that such enterprise-level integrations with R serve to grow the R ecosystem and serve as validation of R as a key platform for advanced analytics. As CEO Norman Nie said to GigaOm this weekend, 

“Oracle’s announcement to embed R demonstrates validation for the leading statistics language and offers further evidence that R is a key weapon in advanced analytics today”

And let’s not leave aside the strategic partnership between Revolution Analytics and Cloudera to include RevoConnectR in the CDH.

Original title and link: R: the Leading Statistics Language and Key Weapon in Advanced Analytics Today (NoSQL database©myNoSQL)

via: http://www.r-bloggers.com/oracles-big-data-appliance-to-include-r/