NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



bigdata: All content tagged as bigdata in NoSQL databases and polyglot persistence

Does Hadoop replace or augment the enterprise data warehouse?

Wayne Eckerson:

For Cloudera, the first vendor to offer a Hadoop distribution, the answer is an unequivocal yes. Last November, Cloudera finally exposed its true sentiments by introducing the Enterprise Data Hub in which Hadoop replaces the data warehouse, among other things, as the center of an organization’s data management strategy. In contrast, Hortonworks takes a hybrid approach, partnering with leading commercial data management and analytics vendors to create a data environment that blends the best of Hadoop and commercial software. In short, Cloudera offers revolution, Hortonworks evolution.

You know what? Both are right. To replace existing enterprise data warehouse, the first step is in cohabiting with them.

Original title and link: Does Hadoop replace or augment the enterprise data warehouse? (NoSQL database©myNoSQL)

Investments in the Hadoop market in 2013

A post looking at the investments made in the Hadoop market in 2013:

In 2013, the zeitgeist around big data hit a fever pitch, and with that surge came venture capital love for the Hadoop ecosystem – to the tune of $270 million. On a year-over-year basis, Hadoop VC funding grew 50% while deal activity rose 30%.

To see investment relationships you could use the beautiful Big Data investment map 2014 — I wish it was easier to navigate though.

Back to the original post, there were two aspects that caught my attention:

  1. The majority of funding growth in 2013 came from Series A. This could mean two things: 1) investors consider the market still open; or 2) there are many investors that realized quite late the potential of this market that are trying to make up for they late reaction. I’d go with the first option though.
  2. There seems quite a bit of variability (or inconsistency) in the investments made in the big data market since 2012. This chart shows exactly what I mean:

    hadoop investments

Original title and link: Investments in the Hadoop market in 2013 (NoSQL database©myNoSQL)


A crisis of data confidence

Michael Vizard:

A recent survey of 442 business executives conducted by Harvard Business Review Analytics Services at the behest of QlickTech, provider of QlickView business intelligence software, finds that only 16 percent of the executives surveyed were confident in the accuracy of the data they used to make business decisions. Another 42 percent said they were not confident in their decisions simply because they couldn’t get access to all the relevant data they needed

Just another data point for those that still believe that more data cannot help.

Original title and link: A crisis of data confidence (NoSQL database©myNoSQL)


HDFS Explorer: Accessing HDFS from Windows Explorer

HDFS Explorer, by Red Gate Big Data:

At Red Gate we have been working on some query tools for Hadoop for a while and while testing we found ourselves endlessly typing hadoop fs. Getting data sets from our Windows desktops, to the cluster, or inspecting job output files was just taking too many steps. It should be as easy for us to access files on HDFS as files on my local drive. So we created HDFS Explorer, which works just like Windows Explorer, but connects to the WebHDFS APIs so we can browse files on our clusters.

Solving a pain point. Making HDFS more accessible and thus friendlier. Very good reasons for such a tool.

Original title and link: HDFS Explorer: Accessing HDFS from Windows Explorer (NoSQL database©myNoSQL)


Enterprise Hadoop Market in 2013: Reflections and Directions

By end of last year, Shaun Connoly (Hortonworks) has posted a fantastic blog looking at the Hadoop market and its future, reflecting on the open source community and its ability to continuously innovate at a fast pace, and putting all these in perspective from a business point of view using the vistory of RedHat.

It is a must read.

Peter Goldmacher (analyst Cowen & Co):

“We believe Hadoop is a big opportunity and we can envision a small number of billion dollar companies based on Hadoop. We think the bigger opportunity is Apps and Analytics companies selling products that abstract the complexity of working with Hadoop from end users and sell solutions into a much larger end market of business users. The biggest opportunity in our mind, by far, is the Big Data Practitioners that create entirely new business opportunities based on data where $1M spent on Hadoop is the backbone of a $1B business.”.

Original title and link: Enterprise Hadoop Market in 2013: Reflections and Directions (NoSQL database©myNoSQL)


2013 and 2014 for Hadoop adoption

Syncsort’s Keith Kohl, in a guest post on Hortonworks’s blog (on an unrelated topic):

I heard a quote the other day that really made me think about the experiences I hear from our customers and partners: 2013 was the year companies tried to find budget for Hadoop, 2014 is the year they ARE budgeting for Hadoop projects.

If I remember correctly, Gartner’s data doesn’t fully support this, but on the other hand I’m convinced that more projects using Hadoop will be rolled in production this year. The only questions to be answered:

  1. will this number grow significantly?
  2. what distributions will see most of the growth?

Original title and link: 2013 and 2014 for Hadoop adoption (NoSQL database©myNoSQL)


Big Data's 2 big years is actually Hadoop

Doug Henschen makes two great points:

  1. Everyone wants to sell Hadoop:

    Practically every vendor out there has embraced Hadoop, going well beyond the fledgling announcements and primitive “connectors” that were prevalent two years ago. Industry heavyweights IBM, Microsoft, Oracle, Pivotal, SAP, and Teradata are all selling and supporting Hadoop distributions — partnering, in some cases, with Cloudera and Hortonworks. Four of these six have vendor-specific distributions, Hadoop appliances, or both.

  2. Then everyone is building SQL-on-Hadoop.

Original title and link: Big Data’s 2 big years is actually Hadoop (NoSQL database©myNoSQL)


SQL on Hadoop: An overview of frameworks and their applicability

An overview of the 3 SQL-on-Hadoop execution models — batch (10s of minutes and up), interactive (up to minutes), operational (sub-second), their applicability in the field of applications, and the main characteristics of the tools/frameworks in each of these categories:

Within the big data landscape there are multiple approaches to accessing, analyzing, and manipulating data in Hadoop. Each depends on key considerations such as latency, ANSI SQL completeness (and the ability to tolerate machine-generated SQL), developer and analyst skillsets, and architecture tradeoffs.

The usual suspects are included: Hive, Impala, Preso, Spark/Shark, Drill.


Original title and link: SQL on Hadoop: An overview of frameworks and their applicability (NoSQL database©myNoSQL)


Heterogeneous storages in HDFS

In my post about in-memory databases vs Aster Data and Greenplum vs Hadoop market share, I’ve proposed a scenario in which Aster Data and Greenplum could expand into the space of in-memory databases by employing hybrid storage.

What I haven’t covered in that post is the possibility of Hadoop, actually HDFS, expanding into hybrid storage.

But that’s happening already and Hortonworks is already working on introducing support for heterogeneous storages in HDFS:

We plan to introduce the idea of Storage Preferences for files. A Storage Preference is a hint to HDFS specifying how the application would like block replicas for the given file to be placed. Initially the Storage Preference will include:

  1. The desired number of file replicas (also called the replication factor) and;
  2. The target storage type for the replicas.

Even if the costs of memory will continue to decrease at the same rate as before 2012, when they flat-lined, a cost effective architecture will almost always rely on hybrid storage.

Original title and link: Heterogeneous storages in HDFS (NoSQL database©myNoSQL)

Hadoop and Enterprise Data Hubs: Aspirational Marketing

Merv Adrian:

In those same shops, there are thousands of significant database instances, and tens of thousands of applications — and those are conservative numbers. So the first few Hadoop applications will represent a toehold in their information infrastructure. It will be a significant beachhead, and it will grow as long as the community of vendors and open source committers deliver on the exciting promise of added functionality we see described in the budding Hadoop 2.0 era, adding to its early successes in some analytics and data integration workloads.

So “Enterprise Data Hub?” Not yet. At best in 2014, Hadoop will begin to build a role as part of an Enterprise Data Spoke in some shops.

This is today. Tomorrow might be Data Lakes.

Original title and link: Hadoop and Enterprise Data Hubs: Aspirational Marketing (NoSQL database©myNoSQL)


Where is Big Data heading? The Data Lakes

Edd Dumbill for Forbes:

The data lake dream is of a place with data-centered architecture, where silos are minimized, and processing happens with little friction in a scalable, distributed environment. Applications are no longer islands, and exist within the data cloud, taking advantage of high bandwidth access to data and scalable computing resource. Data itself is no longer restrained by initial schema decisions, and can be exploited more freely by the enterprise.

Basically this perspective expands on the decentralized data model, by adding an application layer living inside the data platform. That’s the part I don’t fully grok. Having access to all data and being close to data are both different to having the application inside the data platform. And I really hope this doesn’t imply anything like huge PL/SQLish apps1.

  1. I actually think that this description is based mostly on the YARN architecture.  

Original title and link: Where is Big Data heading? The Data Lakes (NoSQL database©myNoSQL)


Performance advantages of the new Google Cloud Storage Connector for Hadoop

This guest post by Mike Wendt from Accenture Technology provides some very good answers to the questions I had about the recently announced Hadoop connector for Google Cloud Storage: how does it behave compared to local storage (data locality), what the performance of accessing Google Cloud Storage directly from Hadoop, and, last but essential for cloud setups, what are the cost implications:

From our study, we can see that remote storage powered by the Google Cloud Storage connector for Hadoop actually performs better than local storage. The increased performance can be seen in all three of our workloads to varying degrees based on their access patterns. […] Availability of the files, and their chunks, is no longer limited to three copies within the cluster, which eliminates the dependence on the three nodes that contain the data to process the file or to transfer the file to an available node for processing.

[…] This availability of remote storage on the scale and size provided by Google Cloud Storage unlocks a unique way of moving and storing large amounts of data that is not available with bare-metal deployments.

If you are looking just for the conclusions:

First, cloud-based Hadoop deployments offer better price-performance ratios than bare-metal clusters. Second, the benefit of performance tuning is so huge that cloud’s virtualization layer overhead is a worthy investment as it expands performance-tuning opportunities. Third, despite the sizable benefit, the performance-tuning process is complex and time-consuming and thus requires automated tuning tools.

✚ Keep in mind though that this study was posted on the Google Cloud Platform, so you could expect the results to beat the competition.

Original title and link: Performance advantages of the new Google Cloud Storage Connector for Hadoop (NoSQL database©myNoSQL)