ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

mapreduce: All content tagged as mapreduce in NoSQL databases and polyglot persistence

Investments in the Hadoop market in 2013

A post looking at the investments made in the Hadoop market in 2013:

In 2013, the zeitgeist around big data hit a fever pitch, and with that surge came venture capital love for the Hadoop ecosystem – to the tune of $270 million. On a year-over-year basis, Hadoop VC funding grew 50% while deal activity rose 30%.

To see investment relationships you could use the beautiful Big Data investment map 2014 — I wish it was easier to navigate though.

Back to the original post, there were two aspects that caught my attention:

  1. The majority of funding growth in 2013 came from Series A. This could mean two things: 1) investors consider the market still open; or 2) there are many investors that realized quite late the potential of this market that are trying to make up for they late reaction. I’d go with the first option though.
  2. There seems quite a bit of variability (or inconsistency) in the investments made in the big data market since 2012. This chart shows exactly what I mean:

    hadoop investments

Original title and link: Investments in the Hadoop market in 2013 (NoSQL database©myNoSQL)

via: http://www.cbinsights.com/blog/trends/hadoop-venture-capital-investors


HDFS Explorer: Accessing HDFS from Windows Explorer

HDFS Explorer, by Red Gate Big Data:

At Red Gate we have been working on some query tools for Hadoop for a while and while testing we found ourselves endlessly typing hadoop fs. Getting data sets from our Windows desktops, to the cluster, or inspecting job output files was just taking too many steps. It should be as easy for us to access files on HDFS as files on my local drive. So we created HDFS Explorer, which works just like Windows Explorer, but connects to the WebHDFS APIs so we can browse files on our clusters.

Solving a pain point. Making HDFS more accessible and thus friendlier. Very good reasons for such a tool.

Original title and link: HDFS Explorer: Accessing HDFS from Windows Explorer (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/windows-explorer-experience-hdfs/


Enterprise Hadoop Market in 2013: Reflections and Directions

By end of last year, Shaun Connoly (Hortonworks) has posted a fantastic blog looking at the Hadoop market and its future, reflecting on the open source community and its ability to continuously innovate at a fast pace, and putting all these in perspective from a business point of view using the vistory of RedHat.

It is a must read.

Peter Goldmacher (analyst Cowen & Co):

“We believe Hadoop is a big opportunity and we can envision a small number of billion dollar companies based on Hadoop. We think the bigger opportunity is Apps and Analytics companies selling products that abstract the complexity of working with Hadoop from end users and sell solutions into a much larger end market of business users. The biggest opportunity in our mind, by far, is the Big Data Practitioners that create entirely new business opportunities based on data where $1M spent on Hadoop is the backbone of a $1B business.”.

Original title and link: Enterprise Hadoop Market in 2013: Reflections and Directions (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/enterprise-hadoop-market-in-2013-reflections-and-directions/


2013 and 2014 for Hadoop adoption

Syncsort’s Keith Kohl, in a guest post on Hortonworks’s blog (on an unrelated topic):

I heard a quote the other day that really made me think about the experiences I hear from our customers and partners: 2013 was the year companies tried to find budget for Hadoop, 2014 is the year they ARE budgeting for Hadoop projects.

If I remember correctly, Gartner’s data doesn’t fully support this, but on the other hand I’m convinced that more projects using Hadoop will be rolled in production this year. The only questions to be answered:

  1. will this number grow significantly?
  2. what distributions will see most of the growth?

Original title and link: 2013 and 2014 for Hadoop adoption (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/hadoop-2-yarn-big-deal-syncsort/


Big Data's 2 big years is actually Hadoop

Doug Henschen makes two great points:

  1. Everyone wants to sell Hadoop:

    Practically every vendor out there has embraced Hadoop, going well beyond the fledgling announcements and primitive “connectors” that were prevalent two years ago. Industry heavyweights IBM, Microsoft, Oracle, Pivotal, SAP, and Teradata are all selling and supporting Hadoop distributions — partnering, in some cases, with Cloudera and Hortonworks. Four of these six have vendor-specific distributions, Hadoop appliances, or both.

  2. Then everyone is building SQL-on-Hadoop.

Original title and link: Big Data’s 2 big years is actually Hadoop (NoSQL database©myNoSQL)

via: http://www.informationweek.com/software/information-management/big-datas-2-big-years/d/d-id/1113664


SQL on Hadoop: An overview of frameworks and their applicability

An overview of the 3 SQL-on-Hadoop execution models — batch (10s of minutes and up), interactive (up to minutes), operational (sub-second), their applicability in the field of applications, and the main characteristics of the tools/frameworks in each of these categories:

Within the big data landscape there are multiple approaches to accessing, analyzing, and manipulating data in Hadoop. Each depends on key considerations such as latency, ANSI SQL completeness (and the ability to tolerate machine-generated SQL), developer and analyst skillsets, and architecture tradeoffs.

The usual suspects are included: Hive, Impala, Preso, Spark/Shark, Drill.

sql-on-hadoop-segments-diagram

Original title and link: SQL on Hadoop: An overview of frameworks and their applicability (NoSQL database©myNoSQL)

via: http://www.mapr.com/products/sql-on-hadoop-details


Heterogeneous storages in HDFS

In my post about in-memory databases vs Aster Data and Greenplum vs Hadoop market share, I’ve proposed a scenario in which Aster Data and Greenplum could expand into the space of in-memory databases by employing hybrid storage.

What I haven’t covered in that post is the possibility of Hadoop, actually HDFS, expanding into hybrid storage.

But that’s happening already and Hortonworks is already working on introducing support for heterogeneous storages in HDFS:

We plan to introduce the idea of Storage Preferences for files. A Storage Preference is a hint to HDFS specifying how the application would like block replicas for the given file to be placed. Initially the Storage Preference will include:

  1. The desired number of file replicas (also called the replication factor) and;
  2. The target storage type for the replicas.

Even if the costs of memory will continue to decrease at the same rate as before 2012, when they flat-lined, a cost effective architecture will almost always rely on hybrid storage.

Original title and link: Heterogeneous storages in HDFS (NoSQL database©myNoSQL)


Hadoop and Enterprise Data Hubs: Aspirational Marketing

Merv Adrian:

In those same shops, there are thousands of significant database instances, and tens of thousands of applications — and those are conservative numbers. So the first few Hadoop applications will represent a toehold in their information infrastructure. It will be a significant beachhead, and it will grow as long as the community of vendors and open source committers deliver on the exciting promise of added functionality we see described in the budding Hadoop 2.0 era, adding to its early successes in some analytics and data integration workloads.

So “Enterprise Data Hub?” Not yet. At best in 2014, Hadoop will begin to build a role as part of an Enterprise Data Spoke in some shops.

This is today. Tomorrow might be Data Lakes.

Original title and link: Hadoop and Enterprise Data Hubs: Aspirational Marketing (NoSQL database©myNoSQL)

via: http://blogs.gartner.com/merv-adrian/2014/01/17/aspirational-marketing-and-enterprise-data-hubs/


Performance advantages of the new Google Cloud Storage Connector for Hadoop

This guest post by Mike Wendt from Accenture Technology provides some very good answers to the questions I had about the recently announced Hadoop connector for Google Cloud Storage: how does it behave compared to local storage (data locality), what the performance of accessing Google Cloud Storage directly from Hadoop, and, last but essential for cloud setups, what are the cost implications:

From our study, we can see that remote storage powered by the Google Cloud Storage connector for Hadoop actually performs better than local storage. The increased performance can be seen in all three of our workloads to varying degrees based on their access patterns. […] Availability of the files, and their chunks, is no longer limited to three copies within the cluster, which eliminates the dependence on the three nodes that contain the data to process the file or to transfer the file to an available node for processing.

[…] This availability of remote storage on the scale and size provided by Google Cloud Storage unlocks a unique way of moving and storing large amounts of data that is not available with bare-metal deployments.

If you are looking just for the conclusions:

First, cloud-based Hadoop deployments offer better price-performance ratios than bare-metal clusters. Second, the benefit of performance tuning is so huge that cloud’s virtualization layer overhead is a worthy investment as it expands performance-tuning opportunities. Third, despite the sizable benefit, the performance-tuning process is complex and time-consuming and thus requires automated tuning tools.

✚ Keep in mind though that this study was posted on the Google Cloud Platform, so you could expect the results to beat the competition.

Original title and link: Performance advantages of the new Google Cloud Storage Connector for Hadoop (NoSQL database©myNoSQL)

via: http://googlecloudplatform.blogspot.com/2014/01/performance-advantages-of-the-new-google-cloud-storage-connector-for-hadoop.html


What is Intel doing in the Hadoop business?

In case you forgot, Intel offers a distribution of Hadoop. Tony Baur, Principal Analyst at Ovum, explains why Intel created this distribution:

The answer is that Hadoop is becoming the scale-out compute pillar of Intel’s emerging Software-Defined Infrastructure initiative for the data center – a vision that virtualizes general- and special-purpose CPUs powering functions under common Intel hardware-based components. The value proposition that Intel is proposing is that embedding serving, network, and/or storage functions into the chipset is a play for public or private cloud – supporting elasticity through enabling infrastructure to reshape itself dynamically according to variable processing demands.

Theoretically it sounds possible. But as the other attempt to explain Intel’s Hadoop distribution, I don’t believe this one either. Unfortunately I don’t have a good one myself, so I’ll keep asking myself.

Original title and link: What is Intel doing in the Hadoop business? (NoSQL database©myNoSQL)

via: http://ovum.com/2014/01/13/intel-refreshes-its-hadoop-distribution/


Pig vs MapReduce: When, Why, and How

Donald Miner, author of MapReduce Design Patterns and CTO at ClearEdge IT Solutions discusses how he chooses between Pig and MapReduce, considering developer and processing time, maintainability and deployment, and repurposing engineers that are new to Java and Pig.

Video and slides after the break.


Scale-up vs Scale-out for Hadoop: Time to rethink?

A paper authored by a Microsoft Research team:

In the last decade we have seen a huge deployment of cheap clusters to run data analytics workloads. The conventional wisdom in industry and academia is that scaling out using a cluster of commodity machines is better for these workloads than scaling up by adding more resources to a single server. Popular analytics infrastructures such as Hadoop are aimed at such a cluster scale-out environment. Is this the right approach?

The main premise of the paper is based on different reports that show “the majority of analytics jobs do not process huge data sets”. The authors are citing different publications from production clusters at Microsoft, Yahoo, and Facebook that put the median input size under 14GB (for MS and Yahoo) and respectively 100GB for 90% of the jobs run. Obviously, this working hypothesis is critical for the rest of the paper.

Another important part for understanding and interpreting the results of this paper is the section on Optimizing Storage:

Storage bottlenecks can easily be removed either by using SSDs or by using one of many scalable back-end solutions (SAN or NAS in the enterprise scenario, e.g. [23], or Amazon S3/Windows Azure in the cloud scenario). In our experimental setup which is a small cluster we use SSDs for both the scale-up and the scale-out machines.

First, the common knowledge in the Hadoop community is to always avoid using SAN and NAS (for ensuring data locality). I’m not referring to Hadoop reference architectures coming from storage vendors. Still in the scale-up scenario, NAS/SAN can make sense for accomodating storage needs that would overpass the capacity and resilience requirements of the scaled-up machine. But I expect that using such storage would change aspects related to total costs and unfortunately the paper does not provide an analysis for it.

The other option, of using SSDs for storage, implies that when processing data, the input size is either the same as the total size of stored data or that the costs of moving and loading data to be processed is close to zero. Neither of these are true.

via: http://research.microsoft.com/pubs/204499/a20-appuswamy.pdf