ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

HPCC: All content tagged as HPCC in NoSQL databases and polyglot persistence

8 Most Interesting Companies for Hadoop’s Future

Filtering and augmenting a Q&A on Quora:

  1. Cloudera: Hadoop distribution, Cloudera Enterprise, Services, Training
  2. Hortonworks: Apache Hadoop major contributions, Services, Training
  3. MapR: Hadoop distribution, Services, Training
  4. HPCC Systems: massive parallel-processing computing platform
  5. HStreaming: real-time data processing and analytics capabilities on top of Hadoop
  6. DataStax: DataStax Enterprise, Apache Cassandra based platform accepting real-time input from online applications, while offering analytic operations, powered by Hadoop
  7. Zettaset: Enterprise Data Analytics Suite built on Hadoop
  8. Hadapt: analytic platform based on Apache Hadoop and relational DBMS technology

I’ve left aside names like IBM, EMC, Informatica, which are doing a lot of integration work.

Original title and link: 8 Most Interesting Companies for Hadoop’s Future (NoSQL database©myNoSQL)


Hadoop, HPCC, MapR and the TeraSort Benchmark

Just in, from LexisNexis:

HPCC Systems 4 nodes cluster sorts 100 gigabytes in 98 seconds and is 25% faster than a 20 nodes Hadoop cluster.

Results achieved in December 2011 show that an HPCC Systems four node Thor cluster took only 98 seconds to complete a Terasort with a job size of 100 gigabytes (GB) on a cluster five times smaller than Hadoop. The HPCC Systems four node cluster was comprised of one (1) Dell PowerEdge C6100 2U server with Intel® Xeon® processors E5675 series, 48GB of memory, and 6 x 146GB SAS HDD’s. The Dell C6100 houses four nodes inside the 2U enclosure. The previous leader ran the same Terasort benchmark in 130 seconds on a 20-node Hadoop cluster using equivalent node hardware. HPCC Systems is an Open Source, enterprise-proven Big Data analytics-processing platform.

Thus Armando Escalante (SVP and CTO of LexisNexis Risk Solutions and head of HPCC Systems) concludes:

These results demonstrate that HPCC Systems is a leader in Big Data processing

Now switching to a post on MapR’s blog:

Recently a world record was claimed for a Hadoop benchmark. […] We were surprised to see that this world record was for a TeraSort benchmark on a 100GB of data. TeraSort is a standard benchmark and the name is derived from “sorting a terabyte”.  Any record claims for sorting a 100GB dataset across a 20 node cluster with 10 times as much memory is comical. The test is named TeraSort not GigaSort.

Original title and link: Hadoop, HPCC, MapR and the TeraSort Benchmark (NoSQL database©myNoSQL)


Hortonworks Data Platform: Hortonworks’ Hadoop Distribution

Announcement came out today[1]:

Hortonworks Data Platform, powered by Apache Hadoop — As we began to interact with enterprises and ecosystem partners, the one constant was the need for a base distribution of Apache Hadoop that is 100% open source and that contains the essential components used with every Hadoop installation.  A distribution was needed to provide an easy to install, tightly integrated and well tested set of servers and tools. As we interacted with potential partners, we also heard the message loud and clear that they wanted open and secure APIs to easily integrate and extend Hadoop. We believe we have succeeded on both fronts. The Hortonworks Data Platform is such an open source distribution.  It is powered by Apache Hadoop and includes the essential Hadoop components, plus some that make it more manageable, open and extensible. Our distribution is based on Hadoop 0.20.205, the first Apache Hadoop release that supports security and HBase.  It also includes some new APIs, such as WebHDFS and those in Ambari and HCatalog, which will make it easy for our partners to integrate their products with Apache Hadoop. For those new to Ambari, it is an open source Apache project that will bring improved installation and management to Hadoop. HCatalog is a metadata management service for simplifying the sharing of data between Hadoop and other data systems. We are releasing Hortonworks Data Platform initially as a limited technology preview with plans to open it up to the public in early 2012.

The fight is on–even if for now the tone is still polite. And if we are adding to the mix MapR and LexisNexis’ HPCC, not to mention the armies of marketers and sales coming from Oracle, IBM, EMC, NetApp, etc. this actually smells like war.

Edward Ribeiro apty commented: “This reminds me of Linux distros war circa 2001”.


  1. The emphasis in the text is mine to underline the most important aspects of the announcement.  

Original title and link: Hortonworks Data Platform: Hortonworks’ Hadoop Distribution (NoSQL database©myNoSQL)


BI Pentaho Integrates Hadoop, NoSQL Databases, and Analytic Databases

Dr.Dobb’s:

  • The ability to orchestrate execution of Hadoop related tasks (i.e., executing a Hive Query, Pig Script, or M/R job) as part of a broader IT workflow.
  • The ability to setup dependencies, so if a step fails the job can branch down a recovery path or send a notification, or if it’s a success it goes on to subsequent dependent tasks. Likewise it supports initiating several tasks in parallel.
  • New integration for Pig — so that developers have the ability to execute a Pig job from a PDI Job flow, integrate the execution of Pig jobs in broader IT workflows through PDI Jobs, take advantage of our out of the box scheduler, and so on.

The list of tools Pentaho 4 integrates with is quite long:

  • a long list of traditional RDBMS
  • analytics databases (Greenplum, Vertica, Netezza, Teradata, etc.)
  • NoSQL databases (MongoDB, HBase, etc.)
  • Hadoop variants
  • LexisNexis HPCC

This is the world of polyglot persistence and hybrid data storage.

Original title and link: BI Pentaho Integrates Hadoop, NoSQL Databases, and Analytic Databases (NoSQL database©myNoSQL)