ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

bigdata: All content tagged as bigdata in NoSQL databases and polyglot persistence

Big Data's 2 big years is actually Hadoop

Doug Henschen makes two great points:

  1. Everyone wants to sell Hadoop:

    Practically every vendor out there has embraced Hadoop, going well beyond the fledgling announcements and primitive “connectors” that were prevalent two years ago. Industry heavyweights IBM, Microsoft, Oracle, Pivotal, SAP, and Teradata are all selling and supporting Hadoop distributions — partnering, in some cases, with Cloudera and Hortonworks. Four of these six have vendor-specific distributions, Hadoop appliances, or both.

  2. Then everyone is building SQL-on-Hadoop.

Original title and link: Big Data’s 2 big years is actually Hadoop (NoSQL database©myNoSQL)

via: http://www.informationweek.com/software/information-management/big-datas-2-big-years/d/d-id/1113664


SQL on Hadoop: An overview of frameworks and their applicability

An overview of the 3 SQL-on-Hadoop execution models — batch (10s of minutes and up), interactive (up to minutes), operational (sub-second), their applicability in the field of applications, and the main characteristics of the tools/frameworks in each of these categories:

Within the big data landscape there are multiple approaches to accessing, analyzing, and manipulating data in Hadoop. Each depends on key considerations such as latency, ANSI SQL completeness (and the ability to tolerate machine-generated SQL), developer and analyst skillsets, and architecture tradeoffs.

The usual suspects are included: Hive, Impala, Preso, Spark/Shark, Drill.

sql-on-hadoop-segments-diagram

Original title and link: SQL on Hadoop: An overview of frameworks and their applicability (NoSQL database©myNoSQL)

via: http://www.mapr.com/products/sql-on-hadoop-details


Heterogeneous storages in HDFS

In my post about in-memory databases vs Aster Data and Greenplum vs Hadoop market share, I’ve proposed a scenario in which Aster Data and Greenplum could expand into the space of in-memory databases by employing hybrid storage.

What I haven’t covered in that post is the possibility of Hadoop, actually HDFS, expanding into hybrid storage.

But that’s happening already and Hortonworks is already working on introducing support for heterogeneous storages in HDFS:

We plan to introduce the idea of Storage Preferences for files. A Storage Preference is a hint to HDFS specifying how the application would like block replicas for the given file to be placed. Initially the Storage Preference will include:

  1. The desired number of file replicas (also called the replication factor) and;
  2. The target storage type for the replicas.

Even if the costs of memory will continue to decrease at the same rate as before 2012, when they flat-lined, a cost effective architecture will almost always rely on hybrid storage.

Original title and link: Heterogeneous storages in HDFS (NoSQL database©myNoSQL)


Hadoop and Enterprise Data Hubs: Aspirational Marketing

Merv Adrian:

In those same shops, there are thousands of significant database instances, and tens of thousands of applications — and those are conservative numbers. So the first few Hadoop applications will represent a toehold in their information infrastructure. It will be a significant beachhead, and it will grow as long as the community of vendors and open source committers deliver on the exciting promise of added functionality we see described in the budding Hadoop 2.0 era, adding to its early successes in some analytics and data integration workloads.

So “Enterprise Data Hub?” Not yet. At best in 2014, Hadoop will begin to build a role as part of an Enterprise Data Spoke in some shops.

This is today. Tomorrow might be Data Lakes.

Original title and link: Hadoop and Enterprise Data Hubs: Aspirational Marketing (NoSQL database©myNoSQL)

via: http://blogs.gartner.com/merv-adrian/2014/01/17/aspirational-marketing-and-enterprise-data-hubs/


Where is Big Data heading? The Data Lakes

Edd Dumbill for Forbes:

The data lake dream is of a place with data-centered architecture, where silos are minimized, and processing happens with little friction in a scalable, distributed environment. Applications are no longer islands, and exist within the data cloud, taking advantage of high bandwidth access to data and scalable computing resource. Data itself is no longer restrained by initial schema decisions, and can be exploited more freely by the enterprise.

Basically this perspective expands on the decentralized data model, by adding an application layer living inside the data platform. That’s the part I don’t fully grok. Having access to all data and being close to data are both different to having the application inside the data platform. And I really hope this doesn’t imply anything like huge PL/SQLish apps1.


  1. I actually think that this description is based mostly on the YARN architecture.  

Original title and link: Where is Big Data heading? The Data Lakes (NoSQL database©myNoSQL)

via: http://www.forbes.com/sites/edddumbill/2014/01/14/the-data-lake-dream/


Performance advantages of the new Google Cloud Storage Connector for Hadoop

This guest post by Mike Wendt from Accenture Technology provides some very good answers to the questions I had about the recently announced Hadoop connector for Google Cloud Storage: how does it behave compared to local storage (data locality), what the performance of accessing Google Cloud Storage directly from Hadoop, and, last but essential for cloud setups, what are the cost implications:

From our study, we can see that remote storage powered by the Google Cloud Storage connector for Hadoop actually performs better than local storage. The increased performance can be seen in all three of our workloads to varying degrees based on their access patterns. […] Availability of the files, and their chunks, is no longer limited to three copies within the cluster, which eliminates the dependence on the three nodes that contain the data to process the file or to transfer the file to an available node for processing.

[…] This availability of remote storage on the scale and size provided by Google Cloud Storage unlocks a unique way of moving and storing large amounts of data that is not available with bare-metal deployments.

If you are looking just for the conclusions:

First, cloud-based Hadoop deployments offer better price-performance ratios than bare-metal clusters. Second, the benefit of performance tuning is so huge that cloud’s virtualization layer overhead is a worthy investment as it expands performance-tuning opportunities. Third, despite the sizable benefit, the performance-tuning process is complex and time-consuming and thus requires automated tuning tools.

✚ Keep in mind though that this study was posted on the Google Cloud Platform, so you could expect the results to beat the competition.

Original title and link: Performance advantages of the new Google Cloud Storage Connector for Hadoop (NoSQL database©myNoSQL)

via: http://googlecloudplatform.blogspot.com/2014/01/performance-advantages-of-the-new-google-cloud-storage-connector-for-hadoop.html


What is Intel doing in the Hadoop business?

In case you forgot, Intel offers a distribution of Hadoop. Tony Baur, Principal Analyst at Ovum, explains why Intel created this distribution:

The answer is that Hadoop is becoming the scale-out compute pillar of Intel’s emerging Software-Defined Infrastructure initiative for the data center – a vision that virtualizes general- and special-purpose CPUs powering functions under common Intel hardware-based components. The value proposition that Intel is proposing is that embedding serving, network, and/or storage functions into the chipset is a play for public or private cloud – supporting elasticity through enabling infrastructure to reshape itself dynamically according to variable processing demands.

Theoretically it sounds possible. But as the other attempt to explain Intel’s Hadoop distribution, I don’t believe this one either. Unfortunately I don’t have a good one myself, so I’ll keep asking myself.

Original title and link: What is Intel doing in the Hadoop business? (NoSQL database©myNoSQL)

via: http://ovum.com/2014/01/13/intel-refreshes-its-hadoop-distribution/


Pig vs MapReduce: When, Why, and How

Donald Miner, author of MapReduce Design Patterns and CTO at ClearEdge IT Solutions discusses how he chooses between Pig and MapReduce, considering developer and processing time, maintainability and deployment, and repurposing engineers that are new to Java and Pig.

Video and slides after the break.


Scale-up vs Scale-out for Hadoop: Time to rethink?

A paper authored by a Microsoft Research team:

In the last decade we have seen a huge deployment of cheap clusters to run data analytics workloads. The conventional wisdom in industry and academia is that scaling out using a cluster of commodity machines is better for these workloads than scaling up by adding more resources to a single server. Popular analytics infrastructures such as Hadoop are aimed at such a cluster scale-out environment. Is this the right approach?

The main premise of the paper is based on different reports that show “the majority of analytics jobs do not process huge data sets”. The authors are citing different publications from production clusters at Microsoft, Yahoo, and Facebook that put the median input size under 14GB (for MS and Yahoo) and respectively 100GB for 90% of the jobs run. Obviously, this working hypothesis is critical for the rest of the paper.

Another important part for understanding and interpreting the results of this paper is the section on Optimizing Storage:

Storage bottlenecks can easily be removed either by using SSDs or by using one of many scalable back-end solutions (SAN or NAS in the enterprise scenario, e.g. [23], or Amazon S3/Windows Azure in the cloud scenario). In our experimental setup which is a small cluster we use SSDs for both the scale-up and the scale-out machines.

First, the common knowledge in the Hadoop community is to always avoid using SAN and NAS (for ensuring data locality). I’m not referring to Hadoop reference architectures coming from storage vendors. Still in the scale-up scenario, NAS/SAN can make sense for accomodating storage needs that would overpass the capacity and resilience requirements of the scaled-up machine. But I expect that using such storage would change aspects related to total costs and unfortunately the paper does not provide an analysis for it.

The other option, of using SSDs for storage, implies that when processing data, the input size is either the same as the total size of stored data or that the costs of moving and loading data to be processed is close to zero. Neither of these are true.

via: http://research.microsoft.com/pubs/204499/a20-appuswamy.pdf


Everything is faster than Hive

Derrick Harris has brought together a series of benchmarks conducted by the different SQL-on-Hadoop implementors comparing their solution (Impala, Stinger/Tez, HAWQ, Shark) with

For what it’s worth, everyone is faster than Hive — that’s the whole point of all of these SQL-on-Hadoop technologies. How they compare with each other is harder to gauge, and a determination probably best left to individual companies to test on their own workloads as they’re making their own buying decisions. But for what it’s worth, here is a collection of more benchmark tests showing the performance of various Hadoop query engines against Hive, relational databases and, sometimes, themselves.

As Derrick Harris remarks, the only direct comparisons are between HAWQ and Impala (and this seems to be old as it mentions Impala being in beta) and the benchmark run by AMPlab (the guys behind Shark) comparing Redshift, Hive, Shark, and Impala.

The good part is that both the Hive Testbench and AMPlab benchmark are available on GitHub.

Original title and link: Everything is faster than Hive (NoSQL database©myNoSQL)

via: http://gigaom.com/2014/01/13/cloudera-says-impala-is-faster-than-hive-which-isnt-saying-much/


Aster Data, HAWQ, GPDB and the First Hadoop Squeeze

Rob Klopp:

But there are three products, the Greenplum database (GPDB), HAWQ, and Aster Data, that will be squeezed more quickly as they are positioned either in between the EDW and Hadoop… or directly over Hadoop. In this post I’ll explain what I suspect Pivotal and Teradata are trying to do… why I believe their strategy will not work for long… and why readers of this blog should be careful moving forward.

This is a very interesting analysis of the enterprise data warehouse market. There’s also a nice visualization of this prediction:

the-first-squeeze2

Here’s an alternative though. As showed in the picture above, the expansion of in-memory databases’ depends heavily on the evolution of the price of memory. It’s hard to argument against price predictions or Moore’s law. But accidents even if rare are still possible. Any significant change in the trend of memory costs, or other hardware market conditions (e.g. an unpredicted decrease of the price for SSDs), could give Teradata and Pivotal the extra time/conditions to break into advanced hybrid storage solutions that would offer slightly less fast but also less expensive products than their competitors’ in-memory databases.

Original title and link: Aster Data, HAWQ, GPDB and the First Hadoop Squeeze (NoSQL database©myNoSQL)

via: http://robklopp.wordpress.com/2013/12/11/aster-data-hawq-gpdb-and-the-first-hadoop-squeeze/


Cloudera shipped a mountain... what can you read between the lines

Cloudera Engineering (@ClouderaEng) shipped a mountain of new product (production-grade software, not just technical previews): Cloudera Impala, Cloudera Search, Cloudera Navigator, Cloudera Development Kit (now Kite SDK), new Apache Accumulo packages for CDH, and several iterative releases of CDH and Cloudera Manager. (And, the Cloudera Enterprise 5 Beta release was made available to the world.). Furthermore, as always, a ton of bug fixes and new features went upstream, with the features notably but not exclusively HiveServer2 and Apache Sentry (incubating).

How many things can you read in this paragraph?

  1. a not that subtle stab at Hortonwork’s series of technical previews.
  2. more and more projects brought under the CDH umbrella. Does more ever become too much? (I cannot explain why, but my first thought was “this feels so Oracle-style”)
  3. Cloudera’s current big bet is Impala. SQL and low latency querying. A big win for the project, but not necessarily a direct financial win for Cloudera, was its addition as a supported service on Amazon Elastic MapReduce.

Original title and link: Cloudera shipped a mountain… what can you read between the lines (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2014/01/this-month-and-year-in-the-ecosystem-december-2013/