ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Hortonworks: All content tagged as Hortonworks in NoSQL databases and polyglot persistence

Notes on the Hadoop and HBase Markets

Curt Monash shares what he heard from his customers:

  • Over half of Cloudera’s customers (nb 100 subscription customers) use HBase
  • Hortonworks thinks a typical enterprise Hadoop cluster has 20-50 nodes, with 50-100 already being on the large side.
  • There are huge amounts of Elastic MapReduce/Hadoop processing in the Amazon cloud. Some estimates say it’s the majority of all Amazon Web Services processing.

Original title and link: Notes on the Hadoop and HBase Markets (NoSQL database©myNoSQL)

via: http://www.dbms2.com/2012/04/24/notes-on-the-hadoop-and-hbase-markets/


Big Data and Hadoop for C-Suites in 3 Minutes

Bring your own (small) popcorn as this is just like a TV ad:

Focus on the voice. Then slowly start repeating in your mind: “Big data. Hadoop. I love big data. I love Hadoop.

Original title and link: Big Data and Hadoop for C-Suites in 3 Minutes (NoSQL database©myNoSQL)


Big Data Market Analysis: Vendors Revenue and Forecasts

I think this is the first extensive Big Data report I’m reading that includes enough relevant and quite exhaustive data about the majority of players in the Big Data market, plus some captivating forecasts.

As of early 2012, the Big Data market stands at just over $5 billion based on related software, hardware, and services revenue. Increased interest in and awareness of the power of Big Data and related analytic capabilities to gain competitive advantage and to improve operational efficiencies, coupled with developments in the technologies and services that make Big Data a practical reality, will result in a super-charged CAGR of 58% between now and 2017.

2011 Big Data Pure-Play Vendors Yealy Big Data Revenue

While there are many stories behind these numbers and many things to think about, here is what I’ve jotted down while studying the report:

  • it’s no surprise that “megavendors” (IBM, HP, etc.) account for the largest part of today’s Big Data market revenue
  • still, the revenue ratio of pure-players vs megavendors feels quite unbalanced: $311mil out of $5.1bil
    • the pure-player category includes: Vertica, Aster Data, Splunk, Greenplum, 1010data, Cloudera, Think Big Analytics, MapR, Digital Reasoning, Datameer, Hortonworks, DataStax, HPCC Systems, Karmasphere
    • there are a couple of names that position themselves in the Big Data market that do not show up in anywhere (e.g. 10gen, Couchbase)
  • this could lead to the conclusion that the companies that include hardware in their offer benefit of larger revenues
    • I’m wondering though what is the margin in the hardware market segment. While not having any data at hand, I think I’ve read reports about HP and Dell not doing so well due exactly to lower margins
    • see bullet point further down about revenue by hardware, software, and services
  • this could explain why so many companies are trying their hand at appliances
  • by looking at the various numbers you can see that those selling appliances usually have a large corporation behind supporting the production costs for hadware and probably the cost of the sales force
  • in the Big Data revenue by vendor you can find quite a few well-known names from the consulting segment
  • the revenue by type pie lists services as accounting for 44%, hardware for 31%, and software for 13% which might give an idea of what makes up the megavendors’ sales packages
    • most of the NoSQL database companies and Hadoop companies are mostly in the software and services segment

Great job done by the Wikibon team.

Original title and link: Big Data Market Analysis: Vendors Revenue and Forecasts (NoSQL database©myNoSQL)

via: http://wikibon.org/wiki/v/Big_Data_Market_Size_and_Vendor_Revenues


JavaScript Console and Excel Coming to Hadoop

Eric Baldeschwieler about the Hortonworks and Microsoft partnership for bringing Apache Hadoop to Windows:

What makes this announcement significant is that Microsoft is opening up Apache Hadoop to literally millions of new users. There are millions of JavaScript developers that can now leverage the power of Apache Hadoop. There are many more millions of Excel and PowerPivot users that can also now derive value from Apache Hadoop using software is that already very familiar to them. Simply put, these contributions by Microsoft will extend Apache Hadoop to the most prolific data analysis tools in the world.

Me, back in January, after taking a look at Hadoop on Windows Azure:

The JavaScript console and the visualization support are very nice additions on top of the managed Hadoop on Azure.

Feature checklists are still important, but technology adoption depends more and more on the user experience. Think of getting up to speed as being the first impression someone gets of a new technology.

Think of integration with familiar tools and frameworks as a huge adoption accelerator.

Original title and link: JavaScript Console and Excel Coming to Hadoop (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/extending-apache-hadoop-to-millions-of-new-microsoft-users/


Hadoop Namenode High Availability Merged to HDFS Trunk

As I’m slowly recovering after a severe poisoning that I initially ignored but finally put me to bed for almost a week, I’m going to post some of the most interesting articles I’ve read while resting.

Hadoop Namenode’s single point of failure has always been mentioned as one of the weaknesses of Hadoop and also as a differentiator of other Hadoop-based commercial offerings. But now the Namenode HA branch was merged into trunk and while it will take a couple of cicles to complete the tests, this will become soon part of the Hadoop distribution.

Here’s Jitendra Pandey announcement on Hortonworks’s blog:

Significant enhancements were completed to make HOT Failover work:

  • Configuration changes for HA
  • Notion of active and standby states were added to the Namenode
  • Client-side redirection
  • Standby processing journal from Active
  • Dual block reports to Active and Standby

In a follow up post to Gartner’s article Apache Hadoop 1.0 Doesn’t Clear Up Trunks and Branches Questions. Do Distributions?, the advantage of using custom distributions will slowly vanish and the open source version will be the one you’ll want to have in production.

Original title and link: Hadoop Namenode High Availability Merged to HDFS Trunk (NoSQL database©myNoSQL)


More Details About the Teradata and Hortonworks Partnership

Some more interesting bits about the Teradata and Hortonworks partnership in Timothy Prickett Morgan’s “Teradata grabs Hortonworks by trunk” on The Register:

The Cloudera deal from September 2010 provided a pipe from a Hadoop cluster into the Teradata data warehouses, while the Hortonworks partnership announced today is providing a pipe between Hadoop and Aster Data appliances.

Hortonworks and Teradata will do joint marketing and development, and are exploring ways to better integrate their respective software. This will specifically be done on Data Platform 1.0 from Hortonworks and Aster Database 5.0 from Teradata. Future engineering work could include running the HortonWorks and Aster Data programs on the same physical clusters, side-by-side, although this is not the way customers tend to do it today, according to Argyros.

Original title and link: More Details About the Teradata and Hortonworks Partnership (NoSQL database©myNoSQL)


Teradata and Hortonworks Partnership and What It Means

Context

Teradata sells software, hardware, and services for data warehouses and analytic applications. Part of the Teradata portfolio is also the Teradata Aster MapReduce Platform a massively parallel processing infrastructure with a software solution that embeds both SQL and MapReduce analytic processing for deeper analytic insights on multi-structured data and new analytic capabilities driven by data science.

Hortonworks offers services around the 100% Apache-licensed, open source Hortonworks Data Platform, an integrated solution built around Hadoop.

Hortonworks Data Platform

Announcement

The interesting bits from the announcement and media coverage:

News release:

Teradata and Hortonworks will join forces to provide technologies and strategic guidance to help businesses build integrated, transparent, enterprise-class big data analytic solutions that leverage Apache Hadoop. The partnership will focus on enabling businesses to use Apache Hadoop to harness the value from new sources of data. Businesses will be able to quickly load and refine multi-structured data, some of which is being discarded today, for discovery and analytics. The resulting insights will enable analysts and front line users to make the best business decision possible.

Teradata Hortonworks Hadoop Aster Architecture

For example, each day websites generate many terabytes of raw, complex data about customers’ viewing and buying habits. These web logs can be directly loaded into Teradata Aster or Apache Hadoop where they can be stored, transformed, and refined in preparation for analysis by the Teradata Aster MapReduce platform (nb: my emphasis).

Derrick Harris:

The company [Teradata] has already worked with Hortonworks’ competitor Cloudera on a connector between the Teradata Database and Cloudera’s Hadoop distribution, but the Hortonworks deal appears a little deeper and more strategic.

Quentin Hardy:

The alliance between Teradata and Hortonworks means that companies can get strategic advice about how to get into the new analytics game from Teradata, and have practical help on running the systems from Hortonworks.

Arun Murthy:

However, there are two important challenges that need to be addressed before broad enterprise adoption can occur:

  • Understanding the right use cases in which to utilize Apache Hadoop.
  • Integrating Apache Hadoop with existing data architectures in an appropriate manner to get better value from existing investments.

My sense of excitement about the Teradata/Hortonworks partnership is amplified by the fact that it addresses these two core challenges for Apache Hadoop:

  • We will be rolling out a reference architecture that provides guidance to enterprises that want to understand the best use cases for which to apply Hadoop. As part of that, we will be helping Teradata customers use Hadoop in conjunction with their Teradata and Teradata Aster analytic data solutions investments.
  • We will also be working closely with the Teradata engineering teams on jointly engineered solutions that optimize the integration points with Apache Hadoop.

Commentary

  • From Hortonworks perspective this deal is weaker than the Oracle-Cloudera deal.

    In the former case, new Teradata sales do not necessary result in new Hortonworks Data Platform installations, while in the case of the Oracle-Cloudera partnership, every sale results in a new business for Cloudera.

  • From Teradata perspective, this partnership gives them a perfect answer and solution for clients asking about unstructured data scenarios.

  • The announcement is slightly positioning Hadoop as part of ETL process, but is not as strict about this as other Hadoop integration architectures—see Netezza and Hadoop and Vertica and Hadoop.

  • Depending on the level of integration the two team will pull together, this partnership might result in one of the most complete and powerful structured and unstructured data warehouse and analytics platform.

I’m looking forward to seeing the proposed architecture blueprint once it’s finalized.

Links

Original title and link: Teradata and Hortonworks Partnership and What It Means (NoSQL database©myNoSQL)


Apache Hadoop 1.0 Doesn’t Clear Up Trunks and Branches Questions. Do Distributions?

It looks like the three pictures about Hadoop versionsfirst two by Cloudera and the third by Konstantin I. Boudnik & Cos—are actually worth 1066 Gartner words.

On the other hand, to address the question in the title—would custom distributions clarify Hadoop versions—I think that while custom distributions might be helpful for experimenting or getting started with Hadoop, long term they’ll actually lead to more segmentation in the market and bigger maintenance and upgrade costs for end users.

There are just a few companies with a track record of maintaining and distributing open source projects—in the Hadoop space these are Cloudera and Hortonworks (nb Hortonworks is supporting the Apache Hadoop distribution). So if a vendor tries to sell you a Hadoop package ask them about their history managing open source distributions.

Original title and link: Apache Hadoop 1.0 Doesn’t Clear Up Trunks and Branches Questions. Do Distributions? (NoSQL database©myNoSQL)


Latest NoSQL Releases: HBase 0.92, DataStax Community Server, Hortonworks Data Platform, SolrCloud

Just a quick roundup of the latest releases and announcements.

Hortonworks Data Platform (HDP) version 2

HDP v2 will include:

  • NextGen MapReduce architecture
  • HDFS NameNode HA
  • HDFS Federation
  • up-to-date HCatalog, HBase, Hive, Pig

According to the announcement:

In order to avoid confusion, let me explain the two versions of HDP:

  • HDP v1 is based upon Apache Hadoop 1.0 (which comes from the 0.20.205 branch). It the most stable, production-ready version of Hadoop that is currently found in many large enterprise deployments. HDP v1 is currently available as a private technology preview. A public technology preview will be made available later this quarter.
  • HDP v2 is based upon Apache Hadoop 0.23, which includes the next generation advancements mentioned above. It’s an important step forward in terms of scalability, performance, high availability and data integrity. A technology preview will also be made publicly available later in Q1.

SolrCloud Completes Phase 2

Mark Miller about the completion of phase 2:

The second phase of SolrCloud has been in full swing for a couple of months now and it looks like we are going to be able to commit this work to trunk very soon! In Phase1 we built on top of Solr’s distributed search capabilities and added cluster state, central config, and built-in read side fault tolerance. Phase 2 is even more ambitious and focuses on the write side. We are talking full-blown fault tolerance for reads and writes, near real-time support, real-time GET, true single node durability, optimistic locking, cluster elasticity, improvements to the Phase 1 features, and more.

Not there yet, but it’s coming.

DataStax Community Server 1.0.7

A new release of DataStax’s distribution of Cassandra incorporating Cassandra 1.0.7

HBase 0.92

Don’t let the version number trick you. This is an important release for HBase featuring:

  • coprocessors
  • security
  • new (self-migrating) file format
  • AWS improvements: EBS support, building a HA cluster

The list of new features, improvements, and bug fixes in HBase 0.92 is impressive. But the highlight of this release is in my opinion HBase coprocessors (Jira entry HBASE-200).

I’m leaving you with Andrew Purtell’s slides about HBase Coprocessors:


Partnerships in the Hadoop Market

Just a quick recap:

Amazon doesn’t partner with anyone for their Amazon Elastic Map Reduce. And IBM is walking alone with the software-only InfoSphere BigInsights.

Original title and link: Partnerships in the Hadoop Market (NoSQL database©myNoSQL)


Project Isotope Will Bring Together Hadoop Toolchain With Microsoft’s Data Products

There’s a series of events lately that makes me think Microsoft is nowhere near accepting defeat in the cloud services area. As regards Microsoft’s Project Isotop, things are much simpler than ZDNet article make them sound[1]: Microsoft is working on integrating Hadoop and its toolchain with their own products (SQL Server Analysis Services, PowerPivot).

Microsoft Project Isotop

A picture worth more than the 626 words.


  1. I bet the details of integration are fascinating and far from being simple, but the article is not focusing on those  

Original title and link: Project Isotope Will Bring Together Hadoop Toolchain With Microsoft’s Data Products (NoSQL database©myNoSQL)


8 Most Interesting Companies for Hadoop’s Future

Filtering and augmenting a Q&A on Quora:

  1. Cloudera: Hadoop distribution, Cloudera Enterprise, Services, Training
  2. Hortonworks: Apache Hadoop major contributions, Services, Training
  3. MapR: Hadoop distribution, Services, Training
  4. HPCC Systems: massive parallel-processing computing platform
  5. HStreaming: real-time data processing and analytics capabilities on top of Hadoop
  6. DataStax: DataStax Enterprise, Apache Cassandra based platform accepting real-time input from online applications, while offering analytic operations, powered by Hadoop
  7. Zettaset: Enterprise Data Analytics Suite built on Hadoop
  8. Hadapt: analytic platform based on Apache Hadoop and relational DBMS technology

I’ve left aside names like IBM, EMC, Informatica, which are doing a lot of integration work.

Original title and link: 8 Most Interesting Companies for Hadoop’s Future (NoSQL database©myNoSQL)