ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Cloudera: All content tagged as Cloudera in NoSQL databases and polyglot persistence

What Are the Pros and Cons of Running Cloudera’s Distribution for Hadoop vs Amazon Elastic MapReduce Service?

Old Quora question, but still very relevant. Top response from Jeff Hammerbacher:

Elastic MapReduce Pros:

  • Dynamic MapReduce cluster sizing.
  • Ease of use for simple jobs via their proprietary web console.
  • Great documentation.
  • Integrates nicely with other Amazon Web Services.

Cloudera Distribution for Hadoop:

  • CDH is open source; you have access to the source code and can inspect it for debugging purposes and make modifications as required.
  • CDH can be run on a number of public or private clouds using an open source framework, Whirr, so you’re not tied to a single cloud provider
  • With CDH, you can move your cluster to dedicated hardware with little disruption when the economics make sense. Most non-trivial applications will benefit from this move.
  • CDH packages a number of open source projects that are not included with EMR: Sqoop, Flume, HBase, Oozie, ZooKeeper, Avro, and Hue. You have access to the complete platform composed of data collection, storage, and processing tools.
  • CDH packages a number of critical bug fixes and features and the most recent stable releases, so you’re usually using a more stable and feature-rich product.
  • You can purchase support and management tools for CDH via Cloudera Enterprise.
  • CDH uses the open source Oozie framework for workflow management. EMR implemented a proprietary “job flow” system before major Hadoop users standardized on Oozie for workload management.
  • CDH uses the open source Hue framework for its user interface. If you require new features from your web interface, you can easily implement them using the Hue SDK.
  • CDH includes a number of integrations with other software components of the data management stack, including Talend, Informatica, Netezza, Teradata, Greenplum, Microstrategy, and others. […]
  • CDH has been designed and deployed in common Linux environments and you can use standard tools to debug your programs. […]

Make sure you also read Hadoop in the Cloud: Pros and Cons which addresses (almost) the same question.

A Twitter-style answer to this question would be: “Control and customization vs Automated and Managed Service”. 80 characters left to add your own perspective.

Original title and link: What Are the Pros and Cons of Running Cloudera’s Distribution for Hadoop vs Amazon Elastic MapReduce Service? (NoSQL database©myNoSQL)


Big Data Market Analysis: Vendors Revenue and Forecasts

I think this is the first extensive Big Data report I’m reading that includes enough relevant and quite exhaustive data about the majority of players in the Big Data market, plus some captivating forecasts.

As of early 2012, the Big Data market stands at just over $5 billion based on related software, hardware, and services revenue. Increased interest in and awareness of the power of Big Data and related analytic capabilities to gain competitive advantage and to improve operational efficiencies, coupled with developments in the technologies and services that make Big Data a practical reality, will result in a super-charged CAGR of 58% between now and 2017.

2011 Big Data Pure-Play Vendors Yealy Big Data Revenue

While there are many stories behind these numbers and many things to think about, here is what I’ve jotted down while studying the report:

  • it’s no surprise that “megavendors” (IBM, HP, etc.) account for the largest part of today’s Big Data market revenue
  • still, the revenue ratio of pure-players vs megavendors feels quite unbalanced: $311mil out of $5.1bil
    • the pure-player category includes: Vertica, Aster Data, Splunk, Greenplum, 1010data, Cloudera, Think Big Analytics, MapR, Digital Reasoning, Datameer, Hortonworks, DataStax, HPCC Systems, Karmasphere
    • there are a couple of names that position themselves in the Big Data market that do not show up in anywhere (e.g. 10gen, Couchbase)
  • this could lead to the conclusion that the companies that include hardware in their offer benefit of larger revenues
    • I’m wondering though what is the margin in the hardware market segment. While not having any data at hand, I think I’ve read reports about HP and Dell not doing so well due exactly to lower margins
    • see bullet point further down about revenue by hardware, software, and services
  • this could explain why so many companies are trying their hand at appliances
  • by looking at the various numbers you can see that those selling appliances usually have a large corporation behind supporting the production costs for hadware and probably the cost of the sales force
  • in the Big Data revenue by vendor you can find quite a few well-known names from the consulting segment
  • the revenue by type pie lists services as accounting for 44%, hardware for 31%, and software for 13% which might give an idea of what makes up the megavendors’ sales packages
    • most of the NoSQL database companies and Hadoop companies are mostly in the software and services segment

Great job done by the Wikibon team.

Original title and link: Big Data Market Analysis: Vendors Revenue and Forecasts (NoSQL database©myNoSQL)

via: http://wikibon.org/wiki/v/Big_Data_Market_Size_and_Vendor_Revenues


Big Data Investment Network Map

Very interesting visualization of some of the companies in the Big Data market connected through their venture capital and investment firms by Benedikt Koehler and Joerg Blumtritt over Beautiful Data blog:

Big Data Investment Network Map

Click to see larger size

There’s only one company I couldn’t find on this map: Hortonworks.

Original title and link: Big Data Investment Network Map (NoSQL database©myNoSQL)


Teradata and Hortonworks Partnership and What It Means

Context

Teradata sells software, hardware, and services for data warehouses and analytic applications. Part of the Teradata portfolio is also the Teradata Aster MapReduce Platform a massively parallel processing infrastructure with a software solution that embeds both SQL and MapReduce analytic processing for deeper analytic insights on multi-structured data and new analytic capabilities driven by data science.

Hortonworks offers services around the 100% Apache-licensed, open source Hortonworks Data Platform, an integrated solution built around Hadoop.

Hortonworks Data Platform

Announcement

The interesting bits from the announcement and media coverage:

News release:

Teradata and Hortonworks will join forces to provide technologies and strategic guidance to help businesses build integrated, transparent, enterprise-class big data analytic solutions that leverage Apache Hadoop. The partnership will focus on enabling businesses to use Apache Hadoop to harness the value from new sources of data. Businesses will be able to quickly load and refine multi-structured data, some of which is being discarded today, for discovery and analytics. The resulting insights will enable analysts and front line users to make the best business decision possible.

Teradata Hortonworks Hadoop Aster Architecture

For example, each day websites generate many terabytes of raw, complex data about customers’ viewing and buying habits. These web logs can be directly loaded into Teradata Aster or Apache Hadoop where they can be stored, transformed, and refined in preparation for analysis by the Teradata Aster MapReduce platform (nb: my emphasis).

Derrick Harris:

The company [Teradata] has already worked with Hortonworks’ competitor Cloudera on a connector between the Teradata Database and Cloudera’s Hadoop distribution, but the Hortonworks deal appears a little deeper and more strategic.

Quentin Hardy:

The alliance between Teradata and Hortonworks means that companies can get strategic advice about how to get into the new analytics game from Teradata, and have practical help on running the systems from Hortonworks.

Arun Murthy:

However, there are two important challenges that need to be addressed before broad enterprise adoption can occur:

  • Understanding the right use cases in which to utilize Apache Hadoop.
  • Integrating Apache Hadoop with existing data architectures in an appropriate manner to get better value from existing investments.

My sense of excitement about the Teradata/Hortonworks partnership is amplified by the fact that it addresses these two core challenges for Apache Hadoop:

  • We will be rolling out a reference architecture that provides guidance to enterprises that want to understand the best use cases for which to apply Hadoop. As part of that, we will be helping Teradata customers use Hadoop in conjunction with their Teradata and Teradata Aster analytic data solutions investments.
  • We will also be working closely with the Teradata engineering teams on jointly engineered solutions that optimize the integration points with Apache Hadoop.

Commentary

  • From Hortonworks perspective this deal is weaker than the Oracle-Cloudera deal.

    In the former case, new Teradata sales do not necessary result in new Hortonworks Data Platform installations, while in the case of the Oracle-Cloudera partnership, every sale results in a new business for Cloudera.

  • From Teradata perspective, this partnership gives them a perfect answer and solution for clients asking about unstructured data scenarios.

  • The announcement is slightly positioning Hadoop as part of ETL process, but is not as strict about this as other Hadoop integration architectures—see Netezza and Hadoop and Vertica and Hadoop.

  • Depending on the level of integration the two team will pull together, this partnership might result in one of the most complete and powerful structured and unstructured data warehouse and analytics platform.

I’m looking forward to seeing the proposed architecture blueprint once it’s finalized.

Links

Original title and link: Teradata and Hortonworks Partnership and What It Means (NoSQL database©myNoSQL)


12 Hadoop Vendors to Watch in 2012

My list of 8 most interesting companies for the future of Hadoop didn’t try to include anyone having a product with the Hadoop word in it. But the list from InformationWeek does. To save you 15 clicks, here’s their list:

  • Amazon Elastic MapReduce
  • Cloudera
  • Datameer
  • EMC (with EMC Greenplum Unified Analytics Platform and EMC Data Computing Appliance)
  • Hadapt
  • Hortonworks
  • IBM (InfoSphere BigInsights)
  • Informatica (for HParser)
  • Karmasphere
  • MapR
  • Microsoft
  • Oracle

Original title and link: 12 Hadoop Vendors to Watch in 2012 (NoSQL database©myNoSQL)


Apache Hadoop 1.0 Doesn’t Clear Up Trunks and Branches Questions. Do Distributions?

It looks like the three pictures about Hadoop versionsfirst two by Cloudera and the third by Konstantin I. Boudnik & Cos—are actually worth 1066 Gartner words.

On the other hand, to address the question in the title—would custom distributions clarify Hadoop versions—I think that while custom distributions might be helpful for experimenting or getting started with Hadoop, long term they’ll actually lead to more segmentation in the market and bigger maintenance and upgrade costs for end users.

There are just a few companies with a track record of maintaining and distributing open source projects—in the Hadoop space these are Cloudera and Hortonworks (nb Hortonworks is supporting the Apache Hadoop distribution). So if a vendor tries to sell you a Hadoop package ask them about their history managing open source distributions.

Original title and link: Apache Hadoop 1.0 Doesn’t Clear Up Trunks and Branches Questions. Do Distributions? (NoSQL database©myNoSQL)


Partnerships in the Hadoop Market

Just a quick recap:

Amazon doesn’t partner with anyone for their Amazon Elastic Map Reduce. And IBM is walking alone with the software-only InfoSphere BigInsights.

Original title and link: Partnerships in the Hadoop Market (NoSQL database©myNoSQL)


Oracle Big Data Appliance Released Features Cloudera Distribution of Hadoop: What You Need to Know

Oracle Big Data Appliance hardware specification

Klint Finley for ServicesANGLE:

18 Oracle Sun servers with a total of:

  • 864 GB main memory;
  • 216 CPU cores;
  • 648 TB of raw disk storage;
  • 40 Gb/s InfiniBand connectivity between nodes and other Oracle engineered systems; and,
  • 10 Gb/s Ethernet data center connectivity.

Joab Jackson for PCWorld Business Center:

The package includes 40Gb/s InfiniBand connectivity among the nodes, a rarity among Hadoop deployments, many of which use Ethernet to connect the nodes. Lumpkin said InfiniBand would speed data transfers within the system. Multiple racks can be tethered together in a cluster configuration. There is no theoretical limit to how many racks can be clustered together, though configurations of more than eight racks would require additional switches, Lumpkin said.

Oracle Big Data Appliance software specification

  • Cloudera’s Distribution including Apache Hadoop
  • Cloudera Manager
  • Open source distribution of R
  • Oracle NoSQL Database Community Edition
  • Oracle Big Data Connectors
  • Oracle Linux

Joab Jackson for PCWorld Business Center:

Along with the release, Oracle also released Oracle Big Data Connectors, a set of drivers for exchanging data between the Big Data Appliance and other Oracle products, such as the Oracle Database 11g, the Oracle Exadata Database Machine, Oracle Exalogic Elastic Cloud and Oracle Exalytics In-Memory Machine.

Derrick Harris for GigaOm:

However, Oracle isn’t blind to the fact that not everyone will be gung ho about buying an appliance. Its custom-built Big Data Connectors are available as separate products for those customers wanting to connect existing Hadoop clusters to Oracle database environments or R statistical-analysis environments.

Klint Finley for ServicesANGLE:

According to Oracle’s announcement “The integrated Oracle and Cloudera architecture has been fully tested and validated by Oracle, who will also collaborate with Cloudera to provide support for Oracle Big Data Appliance.”

Oracle Big Data Appliance Services

George Lumpkin, Oracle’s vice president of data warehousing product management:

Oracle will provide first-line support for the appliance and all software (including the Hadoop distribution and Cloudera Manager) through its case-tracking support infrastructure. But when particularly tough support cases arise, Oracle will tap Cloudera’s expertise.

What’s more, Oracle will refer customers to Cloudera for Hadoop training and consulting engagements.

Oracle Big Data Appliance Positioning

George Lumpkin, Oracle’s vice president of data warehousing product management:

We are positioning this as something that runs alongside other Oracle-based systems. Big data is more than just a cluster of hardware running Hadoop. It is an overall information architecture for enabling companies to analyze data and make decisions.

Doug Hanshen for Informationweek:

Oracle highlighted the Big Data Appliance as a complement to a growing family of “engineered systems” that now includes Exadata, Exalogic, and the Exalytics In-Memory Machine.

Merv Adrian (Gartner analyst) cited by Informationweek:

But what’s more remarkable is the fact that Oracle is finally looking beyond its core database. Oracle’s TimesTen and Essbase databases, which were recently upgraded for use in the Exalytics appliance, and BerkeleyDB, which was Oracle’s development starting point for the new NoSQL database, are examples of that shift.

Oracle is suddenly beginning to act as a data-management portfolio company, not just a company with a big brother and a bunch of starving siblings.

Joab Jackson for PCWorld Business Center:

Oracle is positioning the appliance for managing and analyzing large sets of data that may be too large, or otherwise unsuitable for keeping in databases, such as telemetry data, click-stream data or other log data. “You may not want to keep the data in a database, but you do want to store it and analyze it,” Lumpkin said. The appliance is intended for those organizations that want to undertake Big Data-style analysis but may not have the in-house expertise to assemble large Hadoop or NoSQL-based systems.

Pricing

Kurt Dunn, Cloudera’s chief operating officer told InformationWeek.

Oracle has put together a very comprehensive product that is priced very well.

Brian Proffitt for ITworld:

The cost of the Big Data Appliance is what will really stand out. At $500,000, this may not seem like a bargain, but in reality it is. Typically, commoditized Hadoop systems run at about $4,000 a node. To get this much data storage capacity and power, you would need about 385 nodes… which puts the price tag at around $1.54 million—three times the price of Oracle’s Cloudera-based offering (which, I should add, excludes things like support costs and power).

Doug Hanshen for Informationweek:

The hardware and software combined will sell for $450,000, with an annual support fee for both hardware and software of 12%. That’s highly competitive, working out to less than $700 per terabyte and being in line with the low costs big data practitioners expect from deployments built on commodity hardware.

Oracle - Cloudera Parternship

I wrote earlier my take on what this partnership means to both Oracle and Cloudera.

Doug Hanshen for Informationweek:

But by releasing the product early in the year in partnership with Cloudera, which has more customers and years in the market than any other Hadoop software and services provider, Oracle has made it clear that it is wasting no time and taking no chances with unproven technology.

“Cloudera brings us a couple of very important missing pieces, including its management software and assistance for a deeper second- and third-tier level of support,” said George Lumpkin, Oracle’s vice president of product management, data warehousing.

Speculations about the future of the Oracle - Cloudera partnership

Brian Proffitt for ITworld:

Students of Linux history will well remember that’s exactly what happened when Oracle partnered with Red Hat to introduce commoditized Oracle offerings… and then Larry Ellison and crew decided to roll their own Oracle Enterprise Linux in 2006 when they decided to cut Red Hat out of the stack.

This is strong historical evidence that Oracle will do the same with Cloudera, because frankly the big data market is too big for Oracle not to want to own. Big Data Appliance customers should note this, and be very prepared that future versions may not be tied to Cloudera at all, but rather Oracle’s version of Hadoop.

A few people suggested on Twitter that this partnership is a sign of a possible Oracle’s acquisition of Cloudera. TechCrunch’s Leena Rao links to an old post by Matt Asay suggesting this acquisition.

Media coverage of Oracle Big Data Appliance

Original title and link: Oracle Big Data Appliance Released Features Cloudera Distribution of Hadoop: What You Need to Know (NoSQL database©myNoSQL)


Cloudera Distribution of Hadoop Powers Oracle’s Big Data Appliance

The announcement of the Oracle Big Data Appliance was out for a couple of hours and already hit all media sites. Before looking at the details of the announcement, let’s try to understand what this announcement means for the parties involved.

What does it mean for Oracle?

  • Oracle enters a very busy Hadoop market associated with the best known company in the Hadoop ecosystem
  • With this partnership, Oracle didn’t have to make a huge investment in software development or services
  • Not having to build its own distribution of Hadoop, Oracle could focus on developing the Oracle Big Data Connectors
  • Oracle will delegate everything Hadoop to Cloudera thus it won’t have to deal with a very fast evolving open source project that might see some interesting events due to the
  • Oracle seems to have changed the message about Hadoop being used only for basic ETL.

What does it mean for Cloudera?

  • Cloudera gets access to a pool of customers (many of them possibly very large customers)
  • Cloudera will not need a big sales force to reach to these possible customers. Even if Cloudera knew about them, Oracle’s sales force will do the job
  • If Oracle spells Cloudera’s name in every sales pitch, Cloudera will see a huge publicity bump that will sooner or later lead to more customers

Truth is I was expecting yet another distribution of Hadoop. And even if Oracle’s Big Data Appliance doesn’t feature the official Apache Hadoop distribution, I think that by choosing an existing distribution, Oracle did the right thing. For them and for their customers.

Original title and link: Cloudera Distribution of Hadoop Powers Oracle’s Big Data Appliance (NoSQL database©myNoSQL)


8 Most Interesting Companies for Hadoop’s Future

Filtering and augmenting a Q&A on Quora:

  1. Cloudera: Hadoop distribution, Cloudera Enterprise, Services, Training
  2. Hortonworks: Apache Hadoop major contributions, Services, Training
  3. MapR: Hadoop distribution, Services, Training
  4. HPCC Systems: massive parallel-processing computing platform
  5. HStreaming: real-time data processing and analytics capabilities on top of Hadoop
  6. DataStax: DataStax Enterprise, Apache Cassandra based platform accepting real-time input from online applications, while offering analytic operations, powered by Hadoop
  7. Zettaset: Enterprise Data Analytics Suite built on Hadoop
  8. Hadapt: analytic platform based on Apache Hadoop and relational DBMS technology

I’ve left aside names like IBM, EMC, Informatica, which are doing a lot of integration work.

Original title and link: 8 Most Interesting Companies for Hadoop’s Future (NoSQL database©myNoSQL)


Hadoop Market Competition: comScore From Cloudera to MapR

Mike Brown (comScore CTO):

We could capitalize the purchase [of MapR] with an annual maintenance charge versus a yearly cost per node. NFS allowed our enterprise systems to easily access the data in the cluster.

Some interesting bits:

  • comScore runs a 1000+ self-hosted Hadoop cluster
  • comScore migrated from Cloudera to MapR in 2 days
    • the migration was accomplished by copying and reloading data
    • depending on the size of stored data, a better approach would a rolling migration—
  • comScore MapR’s Direct Access NFS feature, which exposes Hadoop Distributed File System (HDFS) data as NFS files which can then be easily mounted, modified or overwritten
  • comScore will continue to use Cloudera for training purposes
    • Question: what is the advantage of paying two providers and maintaining two different clusters?

As previewed by Cloudera-Hortonworks exchanges, the competition on the Hadoop market is becoming fierce. But at least this story involves companies that are actively involved in innovating and improving Hadoop. Not those that just want to monetize it.

Original title and link: Hadoop Market Competition: comScore From Cloudera to MapR (NoSQL database©myNoSQL)

via: http://searchdatamanagement.techtarget.com/news/2240112247/ComScore-moves-big-data-analytics-environment-from-Cloudera-to-MapR


Cloudera Enterprise: Cloudera Manager and Cloudera support

Cloudera Enterprise is what Cloudera sells in addition to their Cloudera Hadoop Distribution (CDH):

  • Cloudera Manager and Cloudera support
  • Cloudera Manager: end-to-end management application for Apache Hadoop
    • Deploy: automated installation
    • Discover: service health and monitoring, including events and alerts
    • Diagnose
      • Job analytics
      • Log search
      • Configuration recommendations
    • Act
      • Service and configuration management
      • Security management
    • Optimize
      • Resource and quota management
  • Free and Enterprise editions
  • Free edition: up to 50 nodes
  • Enterprise edition: no available pricing
  • Feature comparison
Cloudera Manager Editions

After the break: a short video about Cloudera Manager and media coverage: