NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



greenplum: All content tagged as greenplum in NoSQL databases and polyglot persistence

BI Pentaho Integrates Hadoop, NoSQL Databases, and Analytic Databases


  • The ability to orchestrate execution of Hadoop related tasks (i.e., executing a Hive Query, Pig Script, or M/R job) as part of a broader IT workflow.
  • The ability to setup dependencies, so if a step fails the job can branch down a recovery path or send a notification, or if it’s a success it goes on to subsequent dependent tasks. Likewise it supports initiating several tasks in parallel.
  • New integration for Pig — so that developers have the ability to execute a Pig job from a PDI Job flow, integrate the execution of Pig jobs in broader IT workflows through PDI Jobs, take advantage of our out of the box scheduler, and so on.

The list of tools Pentaho 4 integrates with is quite long:

  • a long list of traditional RDBMS
  • analytics databases (Greenplum, Vertica, Netezza, Teradata, etc.)
  • NoSQL databases (MongoDB, HBase, etc.)
  • Hadoop variants
  • LexisNexis HPCC

This is the world of polyglot persistence and hybrid data storage.

Original title and link: BI Pentaho Integrates Hadoop, NoSQL Databases, and Analytic Databases (NoSQL database©myNoSQL)

EMC BigData Acquisition Budget: $3 Billion

Bloomberg reports on EMC’s planned budget for acquisitions in the BigData market:

EMC Corp. may spend about $3 billion on acquisitions this year, keeping pace with last year’s tally, to add businesses that can help corporate customers analyze reams of data, Chief Operating Officer Pat Gelsinger said.


EMC says it spent $3.2 billion last year on acquisitions including Isilon Systems Inc. and Greenplum Inc. to gain products that let its customers store and analyze a vast and rapid onslaught of data from business applications and the Web. EMC may spend about that much again in 2011 as it races Oracle Corp. (ORCL), International Business Machines Corp., Hewlett-Packard Co. (HPQ) and SAP AG (SAP) to offer more robust data-analysis products.

EMC joins HP which has also directly[1] and indirectly announced its plans for acquisitions in the BigData market.

On the other hand, can you imagine how much could be done for the community driven NoSQL databases with only 1-2% of this budget?

  1. Earlier this year, HP acquired Vertica.  

Original title and link: EMC BigData Acquisition Budget: $3 Billion (NoSQL databases © myNoSQL)


Druid: Distributed In-Memory OLAP Data Store

Over the last twelve months, we tried and failed to achieve scale and speed with relational databases (Greenplum, InfoBright, MySQL) and NoSQL offerings (HBase).

Stepping back from our two failures, let’s examine why these systems failed to scale for our needs:

  1. Relational Database Architectures

    • Full table scans were slow, regardless of the storage engine used
    • Maintaining proper dimension tables, indexes and aggregate tables was painful
    • Parallelization of queries was not always supported or non-trivial
  2. Massive NOSQL With Pre-Computation

    • Supporting high dimensional OLAP requires pre-computing an exponentially large amount of data

Many of the questions you have in mind have already been asked in the this comment thread, but with not so many answers until now.

Original title and link: Druid: Distributed In-Memory OLAP Data Store (NoSQL databases © myNoSQL)


The Data Processing Platform for Tomorrow

In the blue corner we have IBM with Netezza as analytic database, Cognos for BI, and SPSS for predictive analytics. In the green corner we have EMC with Greenplum and the partnership with SAS[1]. And in the open source corner we have Hadoop and R.

Update: there’s also another corner I don’t know how to color where Teradata and its recently acquired Aster Data partner with SAS.

Who is ready to bet on which of these platforms will be processing more data in the next years?

  1. GigaOm has a good article on this subject here  

Original title and link: The Data Processing Platform for Tomorrow (NoSQL databases © myNoSQL)

Types of Big Data Work

Mike Minelli: Working with big data can be classified into three basic categories […] One is information management, a second is business intelligence, and the third is advanced analytics

Information management captures and stores the information, BI analyzes data to see what has happened in the past, and advanced analytics is predictive, looking at what the data indicates for the future.

There’s also a list of tools for BigData: AsterData (acquired by Teradata), Datameer, Paraccel, IBM Netezza, Oracle Exadata, EMC Greenplum.

Original title and link: Types of Big Data Work (NoSQL databases © myNoSQL)


Cloudera: A Business Inteligence Leader

The Informatica accord is Cloudera’s second partnership this year with a leading DI player. Back in August, Cloudera cemented a deal with open source software (OSS) data integration (DI) specialist Talend. It also has partnerships with Teradata Corp., the former Netezza Inc., the former Greenplum Software Corp., Aster Data Systems Inc., Vertica Inc., and Pentaho.

One thing’s for sure: Cloudera is certainly attracting attention.

The strategy is surprisingly simple: make it easy to put data in and get it out.

Original title and link: Cloudera: A Business Inteligence Leader (NoSQL databases © myNoSQL)


New Tools in the NoSQL and Big Data Market

DataStax OpsCenter for Apache Cassandra

DataStax (ex-Riptano) announced yesterday their tool for managing including sophisticated visualizations of the cluster, comprehensive management and configuration, monitoring and operating enterprise Cassandra applications named OpsCenter.

DataStax OpsCenter for Apache Cassandra will require a subscription, but a developer version, not to be used in production, will be made available too.

Call me an idealist, but I would have suggested a different than Gold/Silver/Bronze or Mission-Critical/Premier model:

  • 1-5 nodes: free (nb: good kharma)
  • 6-low tens of nodes: moderately priced package
  • premier: everything else

EMC Greenplum Community Edition

After acquiring Greenplum[1], EMC is making available a community edition:

[…] the new EMC Greenplum Community Edition removes the cost barrier to entry for big data power tools empowering large numbers of developers, data scientists, and other data professionals. This free set of tools enables the community to not only better understand their data, gain deeper insights and better visualize insights, but to also contribute and participate in the development of next-generation tools and solutions. With the Community Edition stack, developers can build complex applications to collect, analyze and operationalize big data leveraging best of breed big data tools including the Greenplum Database with its in-database analytic processing capabilities.

I couldn’t find the details of the community edition license, but instead I’ve found this:

The software is only intended for research, development and experiments, with license purchases required for commercial uses.

About the (marketing) rationale behind this release you can read more on Chuck Hollis’, Global Marketing CTO, blog

  1. Greenplum: a shared-nothing MPP architecture from the ground up for BI and analytical processing using commodity hardwar  

Original title and link: New Tools in the NoSQL and Big Data Market (NoSQL databases © myNoSQL)

Hadoop Spreading through Cloudera Parternships

Cloudera in its attempt to Hadoopize the world goes on partnership spree:

Many of you may have read about some of the recent announcements of partnerships between Cloudera and some of the leading data management software companies like Teradata, Netezza, Greenplum (EMC), Quest and Aster Data. We established these partnerships because Hadoop is increasingly serving as an open platform that many different applications and complimentary technologies work with. Our goal is to to make this as easy and as standardized as possible.

Checking the ☞ press release section turns out the following parnerships:

  • Membase
  • Talend
  • Quest
  • Pentaho
  • NTT Data
  • Aster Data
  • EMC Greenplum
  • Teradata
  • Netezza

Quite a few companies from the non-relational market.

Original title and link: Hadoop Spreading through Cloudera Parternships (NoSQL databases © myNoSQL)