ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

MapReduce: All content tagged as MapReduce in NoSQL databases and polyglot persistence

SQL on Hadoop: An overview of frameworks and their applicability

An overview of the 3 SQL-on-Hadoop execution models — batch (10s of minutes and up), interactive (up to minutes), operational (sub-second), their applicability in the field of applications, and the main characteristics of the tools/frameworks in each of these categories:

Within the big data landscape there are multiple approaches to accessing, analyzing, and manipulating data in Hadoop. Each depends on key considerations such as latency, ANSI SQL completeness (and the ability to tolerate machine-generated SQL), developer and analyst skillsets, and architecture tradeoffs.

The usual suspects are included: Hive, Impala, Preso, Spark/Shark, Drill.

sql-on-hadoop-segments-diagram

Original title and link: SQL on Hadoop: An overview of frameworks and their applicability (NoSQL database©myNoSQL)

via: http://www.mapr.com/products/sql-on-hadoop-details


Heterogeneous storages in HDFS

In my post about in-memory databases vs Aster Data and Greenplum vs Hadoop market share, I’ve proposed a scenario in which Aster Data and Greenplum could expand into the space of in-memory databases by employing hybrid storage.

What I haven’t covered in that post is the possibility of Hadoop, actually HDFS, expanding into hybrid storage.

But that’s happening already and Hortonworks is already working on introducing support for heterogeneous storages in HDFS:

We plan to introduce the idea of Storage Preferences for files. A Storage Preference is a hint to HDFS specifying how the application would like block replicas for the given file to be placed. Initially the Storage Preference will include:

  1. The desired number of file replicas (also called the replication factor) and;
  2. The target storage type for the replicas.

Even if the costs of memory will continue to decrease at the same rate as before 2012, when they flat-lined, a cost effective architecture will almost always rely on hybrid storage.

Original title and link: Heterogeneous storages in HDFS (NoSQL database©myNoSQL)


Hadoop and Enterprise Data Hubs: Aspirational Marketing

Merv Adrian:

In those same shops, there are thousands of significant database instances, and tens of thousands of applications — and those are conservative numbers. So the first few Hadoop applications will represent a toehold in their information infrastructure. It will be a significant beachhead, and it will grow as long as the community of vendors and open source committers deliver on the exciting promise of added functionality we see described in the budding Hadoop 2.0 era, adding to its early successes in some analytics and data integration workloads.

So “Enterprise Data Hub?” Not yet. At best in 2014, Hadoop will begin to build a role as part of an Enterprise Data Spoke in some shops.

This is today. Tomorrow might be Data Lakes.

Original title and link: Hadoop and Enterprise Data Hubs: Aspirational Marketing (NoSQL database©myNoSQL)

via: http://blogs.gartner.com/merv-adrian/2014/01/17/aspirational-marketing-and-enterprise-data-hubs/


Performance advantages of the new Google Cloud Storage Connector for Hadoop

This guest post by Mike Wendt from Accenture Technology provides some very good answers to the questions I had about the recently announced Hadoop connector for Google Cloud Storage: how does it behave compared to local storage (data locality), what the performance of accessing Google Cloud Storage directly from Hadoop, and, last but essential for cloud setups, what are the cost implications:

From our study, we can see that remote storage powered by the Google Cloud Storage connector for Hadoop actually performs better than local storage. The increased performance can be seen in all three of our workloads to varying degrees based on their access patterns. […] Availability of the files, and their chunks, is no longer limited to three copies within the cluster, which eliminates the dependence on the three nodes that contain the data to process the file or to transfer the file to an available node for processing.

[…] This availability of remote storage on the scale and size provided by Google Cloud Storage unlocks a unique way of moving and storing large amounts of data that is not available with bare-metal deployments.

If you are looking just for the conclusions:

First, cloud-based Hadoop deployments offer better price-performance ratios than bare-metal clusters. Second, the benefit of performance tuning is so huge that cloud’s virtualization layer overhead is a worthy investment as it expands performance-tuning opportunities. Third, despite the sizable benefit, the performance-tuning process is complex and time-consuming and thus requires automated tuning tools.

✚ Keep in mind though that this study was posted on the Google Cloud Platform, so you could expect the results to beat the competition.

Original title and link: Performance advantages of the new Google Cloud Storage Connector for Hadoop (NoSQL database©myNoSQL)

via: http://googlecloudplatform.blogspot.com/2014/01/performance-advantages-of-the-new-google-cloud-storage-connector-for-hadoop.html


What is Intel doing in the Hadoop business?

In case you forgot, Intel offers a distribution of Hadoop. Tony Baur, Principal Analyst at Ovum, explains why Intel created this distribution:

The answer is that Hadoop is becoming the scale-out compute pillar of Intel’s emerging Software-Defined Infrastructure initiative for the data center – a vision that virtualizes general- and special-purpose CPUs powering functions under common Intel hardware-based components. The value proposition that Intel is proposing is that embedding serving, network, and/or storage functions into the chipset is a play for public or private cloud – supporting elasticity through enabling infrastructure to reshape itself dynamically according to variable processing demands.

Theoretically it sounds possible. But as the other attempt to explain Intel’s Hadoop distribution, I don’t believe this one either. Unfortunately I don’t have a good one myself, so I’ll keep asking myself.

Original title and link: What is Intel doing in the Hadoop business? (NoSQL database©myNoSQL)

via: http://ovum.com/2014/01/13/intel-refreshes-its-hadoop-distribution/


Pig vs MapReduce: When, Why, and How

Donald Miner, author of MapReduce Design Patterns and CTO at ClearEdge IT Solutions discusses how he chooses between Pig and MapReduce, considering developer and processing time, maintainability and deployment, and repurposing engineers that are new to Java and Pig.

Video and slides after the break.


Scale-up vs Scale-out for Hadoop: Time to rethink?

A paper authored by a Microsoft Research team:

In the last decade we have seen a huge deployment of cheap clusters to run data analytics workloads. The conventional wisdom in industry and academia is that scaling out using a cluster of commodity machines is better for these workloads than scaling up by adding more resources to a single server. Popular analytics infrastructures such as Hadoop are aimed at such a cluster scale-out environment. Is this the right approach?

The main premise of the paper is based on different reports that show “the majority of analytics jobs do not process huge data sets”. The authors are citing different publications from production clusters at Microsoft, Yahoo, and Facebook that put the median input size under 14GB (for MS and Yahoo) and respectively 100GB for 90% of the jobs run. Obviously, this working hypothesis is critical for the rest of the paper.

Another important part for understanding and interpreting the results of this paper is the section on Optimizing Storage:

Storage bottlenecks can easily be removed either by using SSDs or by using one of many scalable back-end solutions (SAN or NAS in the enterprise scenario, e.g. [23], or Amazon S3/Windows Azure in the cloud scenario). In our experimental setup which is a small cluster we use SSDs for both the scale-up and the scale-out machines.

First, the common knowledge in the Hadoop community is to always avoid using SAN and NAS (for ensuring data locality). I’m not referring to Hadoop reference architectures coming from storage vendors. Still in the scale-up scenario, NAS/SAN can make sense for accomodating storage needs that would overpass the capacity and resilience requirements of the scaled-up machine. But I expect that using such storage would change aspects related to total costs and unfortunately the paper does not provide an analysis for it.

The other option, of using SSDs for storage, implies that when processing data, the input size is either the same as the total size of stored data or that the costs of moving and loading data to be processed is close to zero. Neither of these are true.

via: http://research.microsoft.com/pubs/204499/a20-appuswamy.pdf


Aster Data, HAWQ, GPDB and the First Hadoop Squeeze

Rob Klopp:

But there are three products, the Greenplum database (GPDB), HAWQ, and Aster Data, that will be squeezed more quickly as they are positioned either in between the EDW and Hadoop… or directly over Hadoop. In this post I’ll explain what I suspect Pivotal and Teradata are trying to do… why I believe their strategy will not work for long… and why readers of this blog should be careful moving forward.

This is a very interesting analysis of the enterprise data warehouse market. There’s also a nice visualization of this prediction:

the-first-squeeze2

Here’s an alternative though. As showed in the picture above, the expansion of in-memory databases’ depends heavily on the evolution of the price of memory. It’s hard to argument against price predictions or Moore’s law. But accidents even if rare are still possible. Any significant change in the trend of memory costs, or other hardware market conditions (e.g. an unpredicted decrease of the price for SSDs), could give Teradata and Pivotal the extra time/conditions to break into advanced hybrid storage solutions that would offer slightly less fast but also less expensive products than their competitors’ in-memory databases.

Original title and link: Aster Data, HAWQ, GPDB and the First Hadoop Squeeze (NoSQL database©myNoSQL)

via: http://robklopp.wordpress.com/2013/12/11/aster-data-hawq-gpdb-and-the-first-hadoop-squeeze/


Cloudera shipped a mountain... what can you read between the lines

Cloudera Engineering (@ClouderaEng) shipped a mountain of new product (production-grade software, not just technical previews): Cloudera Impala, Cloudera Search, Cloudera Navigator, Cloudera Development Kit (now Kite SDK), new Apache Accumulo packages for CDH, and several iterative releases of CDH and Cloudera Manager. (And, the Cloudera Enterprise 5 Beta release was made available to the world.). Furthermore, as always, a ton of bug fixes and new features went upstream, with the features notably but not exclusively HiveServer2 and Apache Sentry (incubating).

How many things can you read in this paragraph?

  1. a not that subtle stab at Hortonwork’s series of technical previews.
  2. more and more projects brought under the CDH umbrella. Does more ever become too much? (I cannot explain why, but my first thought was “this feels so Oracle-style”)
  3. Cloudera’s current big bet is Impala. SQL and low latency querying. A big win for the project, but not necessarily a direct financial win for Cloudera, was its addition as a supported service on Amazon Elastic MapReduce.

Original title and link: Cloudera shipped a mountain… what can you read between the lines (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2014/01/this-month-and-year-in-the-ecosystem-december-2013/


Parkour - Idiomatic Clojure for Map Reduce

If you are running out of interesting projects to experiment with during this seasonal break, Parkour is a Clojure library for writing MapReduce jobs.

From Marshall Bockrath-Vandegrift’s guest post on Cloudera’s blog:

Parkour is our new Clojure library that carries this philosophy to the Apache Hadoop’s MapReduce platform. Instead of hiding the underlying MapReduce model behind new framework abstractions, Parkour exposes that model with a clear, direct interface. Everything possible in raw Java MapReduce is possible with Parkour, but usually with a fraction of the code.

Original title and link: Parkour - Idiomatic Clojure for Map Reduce (NoSQL database©myNoSQL)


Cloudera's strategy for Hadoop

Alex Woodie about Cloudera’s strategy for Hadoop:

Cloudera has gone further than other Hadoop vendors in articulating a business-oriented strategy for converting Hadoop R&D into a profitable business model. The company unveiled its “enterprise data hub” strategy at the Strata + Hadoop World conference in October, in which it envisions Hadoop at the center of a new data-focused architecture. Every type of data, whether it’s analytical or transactional in nature, goes through Hadoop on its way to somewhere else. (Hortonworks, MapR Technologies, and Pivotal, for what it’s worth, have similar strategies in play, but Cloudera has jumped out front in articulating the marketing message in the cleanest manner.)

In the early days a coherent strategy is not a critical point as technology alone can win adopters quite easily through its direct value. Later, when penetrating the enterprise world, a big picture strategy is at least a way to keep the conversation going even if in the end the deployed solutions are highly customized.

Original title and link: Cloudera’s strategy for Hadoop (NoSQL database©myNoSQL)

via: http://www.datanami.com/datanami/2013-12-13/reaping_the_fruits_of_hadoop_labor_in_2014.html


LinkedIn's Hourglass: Incremental data processing in Hadoop

Matthew Hayes introduces a very interesting new framework from LinkedIn

Hourglass is designed to make computations over sliding windows more efficient. For these types of computations, the input data is partitioned in some way, usually according to time, and the range of input data to process is adjusted as new data arrives. Hourglass works with input data that is partitioned by day, as this is a common scheme for partitioning temporal data.

Hourglass is available on GitHub.

We have found that two types of sliding window computations are extremely common in practice:

  • Fixed-length: the length of the window is set to some constant number of days and the entire window moves forward as new data becomes available. Example: a daily report summarizing the the number of visitors to a site from the past 30 days.
  • Fixed-start: the beginning of the window stays constant, but the end slides forward as new input data becomes available. Example: a daily report summarizing all visitors to a site since the site launched.

Original title and link: LinkedIn’s Hourglass: Incremental data processing in Hadoop (NoSQL database©myNoSQL)

via: http://engineering.linkedin.com/datafu/datafus-hourglass-incremental-data-processing-hadoop