NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



BigData: All content tagged as BigData in NoSQL databases and polyglot persistence

Enterprise Hadoop Market in 2013: Reflections and Directions

By end of last year, Shaun Connoly (Hortonworks) has posted a fantastic blog looking at the Hadoop market and its future, reflecting on the open source community and its ability to continuously innovate at a fast pace, and putting all these in perspective from a business point of view using the vistory of RedHat.

It is a must read.

Peter Goldmacher (analyst Cowen & Co):

“We believe Hadoop is a big opportunity and we can envision a small number of billion dollar companies based on Hadoop. We think the bigger opportunity is Apps and Analytics companies selling products that abstract the complexity of working with Hadoop from end users and sell solutions into a much larger end market of business users. The biggest opportunity in our mind, by far, is the Big Data Practitioners that create entirely new business opportunities based on data where $1M spent on Hadoop is the backbone of a $1B business.”.

Original title and link: Enterprise Hadoop Market in 2013: Reflections and Directions (NoSQL database©myNoSQL)


2013 and 2014 for Hadoop adoption

Syncsort’s Keith Kohl, in a guest post on Hortonworks’s blog (on an unrelated topic):

I heard a quote the other day that really made me think about the experiences I hear from our customers and partners: 2013 was the year companies tried to find budget for Hadoop, 2014 is the year they ARE budgeting for Hadoop projects.

If I remember correctly, Gartner’s data doesn’t fully support this, but on the other hand I’m convinced that more projects using Hadoop will be rolled in production this year. The only questions to be answered:

  1. will this number grow significantly?
  2. what distributions will see most of the growth?

Original title and link: 2013 and 2014 for Hadoop adoption (NoSQL database©myNoSQL)


Big Data's 2 big years is actually Hadoop

Doug Henschen makes two great points:

  1. Everyone wants to sell Hadoop:

    Practically every vendor out there has embraced Hadoop, going well beyond the fledgling announcements and primitive “connectors” that were prevalent two years ago. Industry heavyweights IBM, Microsoft, Oracle, Pivotal, SAP, and Teradata are all selling and supporting Hadoop distributions — partnering, in some cases, with Cloudera and Hortonworks. Four of these six have vendor-specific distributions, Hadoop appliances, or both.

  2. Then everyone is building SQL-on-Hadoop.

Original title and link: Big Data’s 2 big years is actually Hadoop (NoSQL database©myNoSQL)


SQL on Hadoop: An overview of frameworks and their applicability

An overview of the 3 SQL-on-Hadoop execution models — batch (10s of minutes and up), interactive (up to minutes), operational (sub-second), their applicability in the field of applications, and the main characteristics of the tools/frameworks in each of these categories:

Within the big data landscape there are multiple approaches to accessing, analyzing, and manipulating data in Hadoop. Each depends on key considerations such as latency, ANSI SQL completeness (and the ability to tolerate machine-generated SQL), developer and analyst skillsets, and architecture tradeoffs.

The usual suspects are included: Hive, Impala, Preso, Spark/Shark, Drill.


Original title and link: SQL on Hadoop: An overview of frameworks and their applicability (NoSQL database©myNoSQL)


Heterogeneous storages in HDFS

In my post about in-memory databases vs Aster Data and Greenplum vs Hadoop market share, I’ve proposed a scenario in which Aster Data and Greenplum could expand into the space of in-memory databases by employing hybrid storage.

What I haven’t covered in that post is the possibility of Hadoop, actually HDFS, expanding into hybrid storage.

But that’s happening already and Hortonworks is already working on introducing support for heterogeneous storages in HDFS:

We plan to introduce the idea of Storage Preferences for files. A Storage Preference is a hint to HDFS specifying how the application would like block replicas for the given file to be placed. Initially the Storage Preference will include:

  1. The desired number of file replicas (also called the replication factor) and;
  2. The target storage type for the replicas.

Even if the costs of memory will continue to decrease at the same rate as before 2012, when they flat-lined, a cost effective architecture will almost always rely on hybrid storage.

Original title and link: Heterogeneous storages in HDFS (NoSQL database©myNoSQL)

Hadoop and Enterprise Data Hubs: Aspirational Marketing

Merv Adrian:

In those same shops, there are thousands of significant database instances, and tens of thousands of applications — and those are conservative numbers. So the first few Hadoop applications will represent a toehold in their information infrastructure. It will be a significant beachhead, and it will grow as long as the community of vendors and open source committers deliver on the exciting promise of added functionality we see described in the budding Hadoop 2.0 era, adding to its early successes in some analytics and data integration workloads.

So “Enterprise Data Hub?” Not yet. At best in 2014, Hadoop will begin to build a role as part of an Enterprise Data Spoke in some shops.

This is today. Tomorrow might be Data Lakes.

Original title and link: Hadoop and Enterprise Data Hubs: Aspirational Marketing (NoSQL database©myNoSQL)


Where is Big Data heading? The Data Lakes

Edd Dumbill for Forbes:

The data lake dream is of a place with data-centered architecture, where silos are minimized, and processing happens with little friction in a scalable, distributed environment. Applications are no longer islands, and exist within the data cloud, taking advantage of high bandwidth access to data and scalable computing resource. Data itself is no longer restrained by initial schema decisions, and can be exploited more freely by the enterprise.

Basically this perspective expands on the decentralized data model, by adding an application layer living inside the data platform. That’s the part I don’t fully grok. Having access to all data and being close to data are both different to having the application inside the data platform. And I really hope this doesn’t imply anything like huge PL/SQLish apps1.

  1. I actually think that this description is based mostly on the YARN architecture.  

Original title and link: Where is Big Data heading? The Data Lakes (NoSQL database©myNoSQL)


Performance advantages of the new Google Cloud Storage Connector for Hadoop

This guest post by Mike Wendt from Accenture Technology provides some very good answers to the questions I had about the recently announced Hadoop connector for Google Cloud Storage: how does it behave compared to local storage (data locality), what the performance of accessing Google Cloud Storage directly from Hadoop, and, last but essential for cloud setups, what are the cost implications:

From our study, we can see that remote storage powered by the Google Cloud Storage connector for Hadoop actually performs better than local storage. The increased performance can be seen in all three of our workloads to varying degrees based on their access patterns. […] Availability of the files, and their chunks, is no longer limited to three copies within the cluster, which eliminates the dependence on the three nodes that contain the data to process the file or to transfer the file to an available node for processing.

[…] This availability of remote storage on the scale and size provided by Google Cloud Storage unlocks a unique way of moving and storing large amounts of data that is not available with bare-metal deployments.

If you are looking just for the conclusions:

First, cloud-based Hadoop deployments offer better price-performance ratios than bare-metal clusters. Second, the benefit of performance tuning is so huge that cloud’s virtualization layer overhead is a worthy investment as it expands performance-tuning opportunities. Third, despite the sizable benefit, the performance-tuning process is complex and time-consuming and thus requires automated tuning tools.

✚ Keep in mind though that this study was posted on the Google Cloud Platform, so you could expect the results to beat the competition.

Original title and link: Performance advantages of the new Google Cloud Storage Connector for Hadoop (NoSQL database©myNoSQL)


What is Intel doing in the Hadoop business?

In case you forgot, Intel offers a distribution of Hadoop. Tony Baur, Principal Analyst at Ovum, explains why Intel created this distribution:

The answer is that Hadoop is becoming the scale-out compute pillar of Intel’s emerging Software-Defined Infrastructure initiative for the data center – a vision that virtualizes general- and special-purpose CPUs powering functions under common Intel hardware-based components. The value proposition that Intel is proposing is that embedding serving, network, and/or storage functions into the chipset is a play for public or private cloud – supporting elasticity through enabling infrastructure to reshape itself dynamically according to variable processing demands.

Theoretically it sounds possible. But as the other attempt to explain Intel’s Hadoop distribution, I don’t believe this one either. Unfortunately I don’t have a good one myself, so I’ll keep asking myself.

Original title and link: What is Intel doing in the Hadoop business? (NoSQL database©myNoSQL)


Pig vs MapReduce: When, Why, and How

Donald Miner, author of MapReduce Design Patterns and CTO at ClearEdge IT Solutions discusses how he chooses between Pig and MapReduce, considering developer and processing time, maintainability and deployment, and repurposing engineers that are new to Java and Pig.

Video and slides after the break.

Scale-up vs Scale-out for Hadoop: Time to rethink?

A paper authored by a Microsoft Research team:

In the last decade we have seen a huge deployment of cheap clusters to run data analytics workloads. The conventional wisdom in industry and academia is that scaling out using a cluster of commodity machines is better for these workloads than scaling up by adding more resources to a single server. Popular analytics infrastructures such as Hadoop are aimed at such a cluster scale-out environment. Is this the right approach?

The main premise of the paper is based on different reports that show “the majority of analytics jobs do not process huge data sets”. The authors are citing different publications from production clusters at Microsoft, Yahoo, and Facebook that put the median input size under 14GB (for MS and Yahoo) and respectively 100GB for 90% of the jobs run. Obviously, this working hypothesis is critical for the rest of the paper.

Another important part for understanding and interpreting the results of this paper is the section on Optimizing Storage:

Storage bottlenecks can easily be removed either by using SSDs or by using one of many scalable back-end solutions (SAN or NAS in the enterprise scenario, e.g. [23], or Amazon S3/Windows Azure in the cloud scenario). In our experimental setup which is a small cluster we use SSDs for both the scale-up and the scale-out machines.

First, the common knowledge in the Hadoop community is to always avoid using SAN and NAS (for ensuring data locality). I’m not referring to Hadoop reference architectures coming from storage vendors. Still in the scale-up scenario, NAS/SAN can make sense for accomodating storage needs that would overpass the capacity and resilience requirements of the scaled-up machine. But I expect that using such storage would change aspects related to total costs and unfortunately the paper does not provide an analysis for it.

The other option, of using SSDs for storage, implies that when processing data, the input size is either the same as the total size of stored data or that the costs of moving and loading data to be processed is close to zero. Neither of these are true.


Everything is faster than Hive

Derrick Harris has brought together a series of benchmarks conducted by the different SQL-on-Hadoop implementors comparing their solution (Impala, Stinger/Tez, HAWQ, Shark) with

For what it’s worth, everyone is faster than Hive — that’s the whole point of all of these SQL-on-Hadoop technologies. How they compare with each other is harder to gauge, and a determination probably best left to individual companies to test on their own workloads as they’re making their own buying decisions. But for what it’s worth, here is a collection of more benchmark tests showing the performance of various Hadoop query engines against Hive, relational databases and, sometimes, themselves.

As Derrick Harris remarks, the only direct comparisons are between HAWQ and Impala (and this seems to be old as it mentions Impala being in beta) and the benchmark run by AMPlab (the guys behind Shark) comparing Redshift, Hive, Shark, and Impala.

The good part is that both the Hive Testbench and AMPlab benchmark are available on GitHub.

Original title and link: Everything is faster than Hive (NoSQL database©myNoSQL)