NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



EMC: All content tagged as EMC in NoSQL databases and polyglot persistence

Petabyte-Scale Hadoop Clusters

Curt Monash quoting Omer Trajman (Cloudera) in a post counting petabyte-scale Hadoop deployments:

The number of Petabyte+ Hadoop clusters expanded dramatically over the past year, with our recent count reaching 22 in production (in addition to the well-known clusters at Yahoo! and Facebook). Just as our poll back at Hadoop World 2010 showed the average cluster size at just over 60 nodes, today it tops 200. While mean is not the same as median (most clusters are under 30 nodes), there are some beefy ones pulling up that average. Outside of the well-known large clusters at Yahoo and Facebook, we count today 16 organizations running PB+ clusters running CDH across a diverse number of industries including online advertising, retail, government, financial services, online publishing, web analytics and academic research. We expect to see many more in the coming years, as Hadoop gets easier to use and more accessible to a wide variety of enterprise organizations.

First questions that bumped in my head after reading it:

  1. How many deployments DataStax’ Brisk has? How many close or over petabyte?
  2. How many clients run EMC Greenplum HD and how many are close to this scale?
  3. Same question about NetApp Hadoopler clients.
  4. Same question for MapR.

Answering these questions would give us a good overview of the Hadoop ecosystem.

Original title and link: Petabyte-Scale Hadoop Clusters (NoSQL database©myNoSQL)


2 Ways to Tackle Really Big Data

So there you have the two approaches to handling machine-generated-data. If you have vast archives, EMC, IBM Netezza, and Teradata all have purpose-build appliances that scale into the petabytes. You also could use Hadoop, which promises much lower cost, but you’ll have to develop separate processes and applications for that environment. You’ll also have to establish or outsource expertise on Hadoop deployment, management, and data processing. For fast-query needs, EMC, IBM Netezza, and Teradata all have fast, standard appliances and faster, high-performance appliances (and companies including Kognitio and Oracle have similar configuration choices). Column-oriented database and appliance vendors including HP Vertica, InfoBright, ParAccel, and Sybase have speed advantages inherent in their database architectures.

I’m wondering why Hadoop is mentioned just in passing considering how many large datasets it is already handling.

Original title and link: 2 Ways to Tackle Really Big Data (NoSQL database©myNoSQL)


EMC BigData Acquisition Budget: $3 Billion

Bloomberg reports on EMC’s planned budget for acquisitions in the BigData market:

EMC Corp. may spend about $3 billion on acquisitions this year, keeping pace with last year’s tally, to add businesses that can help corporate customers analyze reams of data, Chief Operating Officer Pat Gelsinger said.


EMC says it spent $3.2 billion last year on acquisitions including Isilon Systems Inc. and Greenplum Inc. to gain products that let its customers store and analyze a vast and rapid onslaught of data from business applications and the Web. EMC may spend about that much again in 2011 as it races Oracle Corp. (ORCL), International Business Machines Corp., Hewlett-Packard Co. (HPQ) and SAP AG (SAP) to offer more robust data-analysis products.

EMC joins HP which has also directly[1] and indirectly announced its plans for acquisitions in the BigData market.

On the other hand, can you imagine how much could be done for the community driven NoSQL databases with only 1-2% of this budget?

  1. Earlier this year, HP acquired Vertica.  

Original title and link: EMC BigData Acquisition Budget: $3 Billion (NoSQL databases © myNoSQL)


EMC Partners with MapR for Greenplum HD Enterprise Edition

EMC plans to bring MapR’s proprietary replacement for the Hadoop Distributed File System to its enterprise-ready Apache Hadoop Greenplum HD:

Because MapR’s file system is more efficient than HDFS, users will achieve two to five times the performance over standard Hadoop nodes in a cluster, according to Schroeder. That translates into being able to use about half the number of nodes typically required in a cluster, he said.

“Hadoop nodes cost about $4,000 per node depending on configuration. If you add in power costs, HVAC, switching, and rackspace, you’ll probably double that,” Schroeder said. “Our product can immediately save you $4,000 and over 8 years it’ll save you $8000 per node.”

In terms of what MapR is bringing to the table, the article mentions MapR’s improvements to Apache Hadoop:

  • multiple channels to data through NFS protocol
  • a re-architected NameNode for high availability
  • eliminated single points of failure and automated jobs failover
  • data mirroring and snapshot capabilities
  • wide area replication

Filing this to the announced section of the Hadoop-related solutions list.

Original title and link: EMC Partners with MapR for Greenplum HD Enterprise Edition (NoSQL databases © myNoSQL)


IBM Hadoop Commitment

The company also cemented its commitment to the Hadoop open source data analytics tool, identifying it as “the cornerstone of [IBM’s] big data strategy” in a statement.

IBM is the latest in a line of enterprises to stress their commitment to Hadoop. Enterprise storage vendor EMC put a tweaked Hadoop distribution at the heart of a recently updated range of data analytics Greenplum appliances, while business intelligence company Jaspersoft announced plans to better integrate its products with Hadoop in February.

Sometimes I don’t get the meaning of the words commitment and investment. But this makes me believe others are having the same understanding problem.

Original title and link: IBM Hadoop Commitment (NoSQL databases © myNoSQL)


EMC: There's a time and place for Hadoop

Bill Cook (President and GM, Data Computing Division, EMC):

There’s a time and a place for the value that relational databases add to structured data, and there’s a time and a place for the value Hadoop can give to unstructured data. Many of our enterprise customers need both and, with the help of our partners, we’re able to provide them both, while also meeting their expectations around high availability, fault tolerance, and enterprise-class support and service.

Original title and link: EMC: There’s a time and place for Hadoop (NoSQL databases © myNoSQL)


Hadoop Ecosystem: EMC, NetApp, Mellanox, SnapLogic, DataStax

GigaOm and RWW have coverage of the 5 Hadoop-related announcements:

  • DataStax Brisk: Hadoop and Hive on Cassandra
  • NetApp Hadoop Shared DAS
  • Mellanox Hadoop-Direct

    increase throughput in Hadoop clusters via its ConnectX-2 adapters with Hadoop Direct

  • SnapLogic SnapReduce

    SnapReduce transforms SnapLogic data integration pipelines directly into MapReduce tasks, making Hadoop processing much more accessible and resulting in optimal Hadoop cluster utilization.

  • EMC GreenplumHD

    Greenplum HD combines the Hadoop analytics platform with Greenplum’s database technology.

Ways to look at it:

  • 2 large corporations getting into Hadoop
  • 2 software solutions, 3 hardware solutions
  • 1 open source project, 4 commercial products or
  • 4 companies wanting to make a profit from Hadoop without contributing back to the community

Original title and link: Hadoop Ecosystem: EMC, NetApp, Mellanox, SnapLogic, DataStax (NoSQL databases © myNoSQL)

The Data Processing Platform for Tomorrow

In the blue corner we have IBM with Netezza as analytic database, Cognos for BI, and SPSS for predictive analytics. In the green corner we have EMC with Greenplum and the partnership with SAS[1]. And in the open source corner we have Hadoop and R.

Update: there’s also another corner I don’t know how to color where Teradata and its recently acquired Aster Data partner with SAS.

Who is ready to bet on which of these platforms will be processing more data in the next years?

  1. GigaOm has a good article on this subject here  

Original title and link: The Data Processing Platform for Tomorrow (NoSQL databases © myNoSQL)

Types of Big Data Work

Mike Minelli: Working with big data can be classified into three basic categories […] One is information management, a second is business intelligence, and the third is advanced analytics

Information management captures and stores the information, BI analyzes data to see what has happened in the past, and advanced analytics is predictive, looking at what the data indicates for the future.

There’s also a list of tools for BigData: AsterData (acquired by Teradata), Datameer, Paraccel, IBM Netezza, Oracle Exadata, EMC Greenplum.

Original title and link: Types of Big Data Work (NoSQL databases © myNoSQL)


Cloudera: A Business Inteligence Leader

The Informatica accord is Cloudera’s second partnership this year with a leading DI player. Back in August, Cloudera cemented a deal with open source software (OSS) data integration (DI) specialist Talend. It also has partnerships with Teradata Corp., the former Netezza Inc., the former Greenplum Software Corp., Aster Data Systems Inc., Vertica Inc., and Pentaho.

One thing’s for sure: Cloudera is certainly attracting attention.

The strategy is surprisingly simple: make it easy to put data in and get it out.

Original title and link: Cloudera: A Business Inteligence Leader (NoSQL databases © myNoSQL)


Everything Drives Storage

James Governor about storage and EMC:

It seems like every computing revolution drives storage volumes […]. But everything drives storage. Virtualisation drives storage (which helps explain both the rationalisation, and the huge success, of EMC’s VMware acquisition. The cloud drives storage. Big Data drives storage (obviously). Data Center consolidation drives storage. The Web drives storage.

… and they don’t believe in memory.

Original title and link: Everything Drives Storage (NoSQL databases © myNoSQL)


New Tools in the NoSQL and Big Data Market

DataStax OpsCenter for Apache Cassandra

DataStax (ex-Riptano) announced yesterday their tool for managing including sophisticated visualizations of the cluster, comprehensive management and configuration, monitoring and operating enterprise Cassandra applications named OpsCenter.

DataStax OpsCenter for Apache Cassandra will require a subscription, but a developer version, not to be used in production, will be made available too.

Call me an idealist, but I would have suggested a different than Gold/Silver/Bronze or Mission-Critical/Premier model:

  • 1-5 nodes: free (nb: good kharma)
  • 6-low tens of nodes: moderately priced package
  • premier: everything else

EMC Greenplum Community Edition

After acquiring Greenplum[1], EMC is making available a community edition:

[…] the new EMC Greenplum Community Edition removes the cost barrier to entry for big data power tools empowering large numbers of developers, data scientists, and other data professionals. This free set of tools enables the community to not only better understand their data, gain deeper insights and better visualize insights, but to also contribute and participate in the development of next-generation tools and solutions. With the Community Edition stack, developers can build complex applications to collect, analyze and operationalize big data leveraging best of breed big data tools including the Greenplum Database with its in-database analytic processing capabilities.

I couldn’t find the details of the community edition license, but instead I’ve found this:

The software is only intended for research, development and experiments, with license purchases required for commercial uses.

About the (marketing) rationale behind this release you can read more on Chuck Hollis’, Global Marketing CTO, blog

  1. Greenplum: a shared-nothing MPP architecture from the ground up for BI and analytical processing using commodity hardwar  

Original title and link: New Tools in the NoSQL and Big Data Market (NoSQL databases © myNoSQL)