NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



NetApp: All content tagged as NetApp in NoSQL databases and polyglot persistence

Comparing Hadoop Appliances: Oracle’s Big Data Appliance, EMC Greenplum DCA, Netapp Hadooplers

Great post from Gwen Shapira over Pythian diving into the pros and cons of Hadoop appliances vs building your own Hadoop clusters. Plus a comparison of existing Hadoop appliances: Oracle Big Data Appliance, EMC Greenplum DCA, and Netapp Hadooplers.

Another good reason to roll your own is the flexibility: Appliances are called that way because they have a very specific configuration. You get a certain number of nodes, cpus, RAM and storage. Oracle’s offering is an 18 node rack. What if you want 12 nodes? or 23? tough luck. What if you want less RAM and more CPU? you are still stuck.

Original title and link: Comparing Hadoop Appliances: Oracle’s Big Data Appliance, EMC Greenplum DCA, Netapp Hadooplers (NoSQL database©myNoSQL)


A Short Incursion Into Alternate Hadoop Filesystems

Steve Loughran starts with a critical look at Netapp Open solution for Hadoop paper:

Actually it is weirder than I first thought. This is still HDFS, just running on more expensive hardware. You get the (current) HDFS limitations: no native filesystem mounting, a namenode to care about, security on a par with NFS, without the cost savings of pure-SATA-no-licensing-fees. Instead you have to use RAID everywhere, which not only bumps up your cost of storage, puts you at risk of RAID controller failure and errors in the OS drivers for those controller (hence their strict rules about which Linux releases to trust). If you do follow their recommendations and rely on hardware for data integrity, you’ve cut down the probability of node-local job execution, so all FUD about replication traffic is now moot as at least 1/3 more of your tasks will be running remote -possibly even with the Fair Scheduler, which waits for a bit to see if a local slot becomes free. What they are doing then is adding some HA hardware underneath a filesystem that is designed to give strong availability out of medium availability hardware. I have seen such a design before, and thought it sucked then too.  Information week says this is a response to EMC, but it looks more like NetApp’s strategy to stay relevant, and Cloudera are partnering with them as NetApp offered them money and if it sells into more “enterprise customers” then why not? With the extra hardware costs of NetApp the cloudera licenses will look better value, and clearly both NetApp and their customers are in need of the hand-holding that Cloudera can offer.

Then in a follow up post, he looks at a couple of alternatives (Lustre, GPFS, IBRIX, etc):

I’m not against running MapReduce—or the entire Hadoop stack—against alternate filesystems. There are some good cases where it makes sense. Other filesystems offer security, NFS mounting, the ability to be used by other applications and other features. HDFS is designed to scale well on “commodity” hardware, (where servers containing Xeon E5 series parts with 64GB RAM, 10GbE and 8-12 SFF HDDs are considered a subset of “commodity”).

Original title and link: A Short Incursion Into Alternate Hadoop Filesystems (NoSQL database©myNoSQL)

Petabyte-Scale Hadoop Clusters

Curt Monash quoting Omer Trajman (Cloudera) in a post counting petabyte-scale Hadoop deployments:

The number of Petabyte+ Hadoop clusters expanded dramatically over the past year, with our recent count reaching 22 in production (in addition to the well-known clusters at Yahoo! and Facebook). Just as our poll back at Hadoop World 2010 showed the average cluster size at just over 60 nodes, today it tops 200. While mean is not the same as median (most clusters are under 30 nodes), there are some beefy ones pulling up that average. Outside of the well-known large clusters at Yahoo and Facebook, we count today 16 organizations running PB+ clusters running CDH across a diverse number of industries including online advertising, retail, government, financial services, online publishing, web analytics and academic research. We expect to see many more in the coming years, as Hadoop gets easier to use and more accessible to a wide variety of enterprise organizations.

First questions that bumped in my head after reading it:

  1. How many deployments DataStax’ Brisk has? How many close or over petabyte?
  2. How many clients run EMC Greenplum HD and how many are close to this scale?
  3. Same question about NetApp Hadoopler clients.
  4. Same question for MapR.

Answering these questions would give us a good overview of the Hadoop ecosystem.

Original title and link: Petabyte-Scale Hadoop Clusters (NoSQL database©myNoSQL)


NetApp Hadoop Open Storage System

Did I say everyone wants a piece of Hadoop’s future profit?

From an infrastructure perspective, NetApp will simplify setup & scale with our first Hadoop Open Storage System (HOSS) informally aka “Hadoopler”, while delivering improved performance and storage efficiency. Although we recognize that RAID is not yet generally considered an HDFS best-practice within historical Hadoop whitepapers and Wiki’s, NetApp aims to work with the Hadoop community to educate and change that perception for our targeted Enterprise HDFS use-cases.

The good news is that NetApp will not create yet another proprietary Hadoop distribution, but rather focus on providing HDFS RAID configurations for DataNodes which separate data protection from job and query completion.

Original title and link: NetApp Hadoop Open Storage System (NoSQL database©myNoSQL)


Hadoop Ecosystem: EMC, NetApp, Mellanox, SnapLogic, DataStax

GigaOm and RWW have coverage of the 5 Hadoop-related announcements:

  • DataStax Brisk: Hadoop and Hive on Cassandra
  • NetApp Hadoop Shared DAS
  • Mellanox Hadoop-Direct

    increase throughput in Hadoop clusters via its ConnectX-2 adapters with Hadoop Direct

  • SnapLogic SnapReduce

    SnapReduce transforms SnapLogic data integration pipelines directly into MapReduce tasks, making Hadoop processing much more accessible and resulting in optimal Hadoop cluster utilization.

  • EMC GreenplumHD

    Greenplum HD combines the Hadoop analytics platform with Greenplum’s database technology.

Ways to look at it:

  • 2 large corporations getting into Hadoop
  • 2 software solutions, 3 hardware solutions
  • 1 open source project, 4 commercial products or
  • 4 companies wanting to make a profit from Hadoop without contributing back to the community

Original title and link: Hadoop Ecosystem: EMC, NetApp, Mellanox, SnapLogic, DataStax (NoSQL databases © myNoSQL)

NetApp Hadoop Shared DAS

In preparation for the EMC Hadoop related announcement:

Shared DAS addresses the inevitable storage capacity growth requirements of Hadoop nodes in a cluster by placing disks in an external shelf shared by multiple directly attached hosts (aka Hadoop compute nodes). The connectivity from host to disk can be SATA, SAS, SCSI or even Ethernet, but always in a direct rather than networked storage configuration.


Therefore the three dimensions of Shared DAS benefit are:

  1. NetApp E-Series Shared DAS solutions can dramatically reduce the amount of background replication tasks by employing highly efficient RAID configurations to offload post-disk failure reconstruction tasks from the Hadoop cluster compute nodes and cluster network,
  2. When compared against single disk I/O configuration of regular Hadoop nodes, NetApp E-Series Shared DAS enables significantly higher disk I/O bandwidth at lower latency due to wide striping within the shelf, and finally,
  3. NetApp E-Series Shared DAS improves storage efficiency by reducing the number of object replicas within a rack using low-overhead high-performance RAID. Fewer replicas mean less disks to buy or more objects stored within the same infrastructure.

But it can also be connected to DataStax Brisk .

Original title and link: NetApp Hadoop Shared DAS (NoSQL databases © myNoSQL)