NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



BigData: All content tagged as BigData in NoSQL databases and polyglot persistence

Performance advantages of the new Google Cloud Storage Connector for Hadoop

This guest post by Mike Wendt from Accenture Technology provides some very good answers to the questions I had about the recently announced Hadoop connector for Google Cloud Storage: how does it behave compared to local storage (data locality), what the performance of accessing Google Cloud Storage directly from Hadoop, and, last but essential for cloud setups, what are the cost implications:

From our study, we can see that remote storage powered by the Google Cloud Storage connector for Hadoop actually performs better than local storage. The increased performance can be seen in all three of our workloads to varying degrees based on their access patterns. […] Availability of the files, and their chunks, is no longer limited to three copies within the cluster, which eliminates the dependence on the three nodes that contain the data to process the file or to transfer the file to an available node for processing.

[…] This availability of remote storage on the scale and size provided by Google Cloud Storage unlocks a unique way of moving and storing large amounts of data that is not available with bare-metal deployments.

If you are looking just for the conclusions:

First, cloud-based Hadoop deployments offer better price-performance ratios than bare-metal clusters. Second, the benefit of performance tuning is so huge that cloud’s virtualization layer overhead is a worthy investment as it expands performance-tuning opportunities. Third, despite the sizable benefit, the performance-tuning process is complex and time-consuming and thus requires automated tuning tools.

✚ Keep in mind though that this study was posted on the Google Cloud Platform, so you could expect the results to beat the competition.

Original title and link: Performance advantages of the new Google Cloud Storage Connector for Hadoop (NoSQL database©myNoSQL)


What is Intel doing in the Hadoop business?

In case you forgot, Intel offers a distribution of Hadoop. Tony Baur, Principal Analyst at Ovum, explains why Intel created this distribution:

The answer is that Hadoop is becoming the scale-out compute pillar of Intel’s emerging Software-Defined Infrastructure initiative for the data center – a vision that virtualizes general- and special-purpose CPUs powering functions under common Intel hardware-based components. The value proposition that Intel is proposing is that embedding serving, network, and/or storage functions into the chipset is a play for public or private cloud – supporting elasticity through enabling infrastructure to reshape itself dynamically according to variable processing demands.

Theoretically it sounds possible. But as the other attempt to explain Intel’s Hadoop distribution, I don’t believe this one either. Unfortunately I don’t have a good one myself, so I’ll keep asking myself.

Original title and link: What is Intel doing in the Hadoop business? (NoSQL database©myNoSQL)


Pig vs MapReduce: When, Why, and How

Donald Miner, author of MapReduce Design Patterns and CTO at ClearEdge IT Solutions discusses how he chooses between Pig and MapReduce, considering developer and processing time, maintainability and deployment, and repurposing engineers that are new to Java and Pig.

Video and slides after the break.

Scale-up vs Scale-out for Hadoop: Time to rethink?

A paper authored by a Microsoft Research team:

In the last decade we have seen a huge deployment of cheap clusters to run data analytics workloads. The conventional wisdom in industry and academia is that scaling out using a cluster of commodity machines is better for these workloads than scaling up by adding more resources to a single server. Popular analytics infrastructures such as Hadoop are aimed at such a cluster scale-out environment. Is this the right approach?

The main premise of the paper is based on different reports that show “the majority of analytics jobs do not process huge data sets”. The authors are citing different publications from production clusters at Microsoft, Yahoo, and Facebook that put the median input size under 14GB (for MS and Yahoo) and respectively 100GB for 90% of the jobs run. Obviously, this working hypothesis is critical for the rest of the paper.

Another important part for understanding and interpreting the results of this paper is the section on Optimizing Storage:

Storage bottlenecks can easily be removed either by using SSDs or by using one of many scalable back-end solutions (SAN or NAS in the enterprise scenario, e.g. [23], or Amazon S3/Windows Azure in the cloud scenario). In our experimental setup which is a small cluster we use SSDs for both the scale-up and the scale-out machines.

First, the common knowledge in the Hadoop community is to always avoid using SAN and NAS (for ensuring data locality). I’m not referring to Hadoop reference architectures coming from storage vendors. Still in the scale-up scenario, NAS/SAN can make sense for accomodating storage needs that would overpass the capacity and resilience requirements of the scaled-up machine. But I expect that using such storage would change aspects related to total costs and unfortunately the paper does not provide an analysis for it.

The other option, of using SSDs for storage, implies that when processing data, the input size is either the same as the total size of stored data or that the costs of moving and loading data to be processed is close to zero. Neither of these are true.


Everything is faster than Hive

Derrick Harris has brought together a series of benchmarks conducted by the different SQL-on-Hadoop implementors comparing their solution (Impala, Stinger/Tez, HAWQ, Shark) with

For what it’s worth, everyone is faster than Hive — that’s the whole point of all of these SQL-on-Hadoop technologies. How they compare with each other is harder to gauge, and a determination probably best left to individual companies to test on their own workloads as they’re making their own buying decisions. But for what it’s worth, here is a collection of more benchmark tests showing the performance of various Hadoop query engines against Hive, relational databases and, sometimes, themselves.

As Derrick Harris remarks, the only direct comparisons are between HAWQ and Impala (and this seems to be old as it mentions Impala being in beta) and the benchmark run by AMPlab (the guys behind Shark) comparing Redshift, Hive, Shark, and Impala.

The good part is that both the Hive Testbench and AMPlab benchmark are available on GitHub.

Original title and link: Everything is faster than Hive (NoSQL database©myNoSQL)


Aster Data, HAWQ, GPDB and the First Hadoop Squeeze

Rob Klopp:

But there are three products, the Greenplum database (GPDB), HAWQ, and Aster Data, that will be squeezed more quickly as they are positioned either in between the EDW and Hadoop… or directly over Hadoop. In this post I’ll explain what I suspect Pivotal and Teradata are trying to do… why I believe their strategy will not work for long… and why readers of this blog should be careful moving forward.

This is a very interesting analysis of the enterprise data warehouse market. There’s also a nice visualization of this prediction:


Here’s an alternative though. As showed in the picture above, the expansion of in-memory databases’ depends heavily on the evolution of the price of memory. It’s hard to argument against price predictions or Moore’s law. But accidents even if rare are still possible. Any significant change in the trend of memory costs, or other hardware market conditions (e.g. an unpredicted decrease of the price for SSDs), could give Teradata and Pivotal the extra time/conditions to break into advanced hybrid storage solutions that would offer slightly less fast but also less expensive products than their competitors’ in-memory databases.

Original title and link: Aster Data, HAWQ, GPDB and the First Hadoop Squeeze (NoSQL database©myNoSQL)


Cloudera shipped a mountain... what can you read between the lines

Cloudera Engineering (@ClouderaEng) shipped a mountain of new product (production-grade software, not just technical previews): Cloudera Impala, Cloudera Search, Cloudera Navigator, Cloudera Development Kit (now Kite SDK), new Apache Accumulo packages for CDH, and several iterative releases of CDH and Cloudera Manager. (And, the Cloudera Enterprise 5 Beta release was made available to the world.). Furthermore, as always, a ton of bug fixes and new features went upstream, with the features notably but not exclusively HiveServer2 and Apache Sentry (incubating).

How many things can you read in this paragraph?

  1. a not that subtle stab at Hortonwork’s series of technical previews.
  2. more and more projects brought under the CDH umbrella. Does more ever become too much? (I cannot explain why, but my first thought was “this feels so Oracle-style”)
  3. Cloudera’s current big bet is Impala. SQL and low latency querying. A big win for the project, but not necessarily a direct financial win for Cloudera, was its addition as a supported service on Amazon Elastic MapReduce.

Original title and link: Cloudera shipped a mountain… what can you read between the lines (NoSQL database©myNoSQL)


Big Data 2014: Powering Up the Curve

Quentin Gallivan1:

  1. The big data ‘power curve’ in 2014 will be shaped by business users’ demand for data blending
  2. Big data needs to play well with others!
  3. You will see even more rapid innovation from the big data open source community
  4. You can’t prepare for tomorrow with yesterday’s tools

Just 4 truisms. Also data blending? Really?

  1. Quentin Gallivan: CEO Pentaho 

Original title and link: Big Data 2014: Powering Up the Curve (NoSQL database©myNoSQL)


Big Data analytics predictions for 2014

Michele Chambers1:

In 2014, data analysts will be empowered through easy-to-use tool that leverage the insights of data scientists, by providing real-time forecasts and recommendations in their day-to-day business tools. Better analytics will make data analysis more effective, while automation frees up data scientists to focus on strategic initiatives and unlocking further value in corporate data stores.

Ease of use is a mega trend of the last few years. Those ignoring it, try to make up through different means. But as proved over and over again, usability wins even over usefulness and correctness.

  1. Michele Chambers: Chief Strategy Officer and VP Product Management at Revolution Analytics, the company behind R. 

Original title and link: Big Data analytics predictions for 2014 (NoSQL database©myNoSQL)


Big Data Top Ten predictions for 2014 by Jim Kaskade

I couldn’t start the new year without a final round of predictions. Not mines though.

Jim Kaskade:

  1. Consolidation of NoSQLs begins
  2. The Hadoop Clone wars end
  3. Open source business model is acknowledged by Wall Street
  4. Big Data and Cloud really means private cloud
  5. 2014 starts the era of analytic applications
  6. Search-based business intelligence tools will become the norm with Big Data
  7. Real-time in-memory analytics, complex event processing, and ETL combine
  8. Prescriptive analytics become more mainstream
  9. MDM will provide the dimensions for big data facts
  10. Security in Big Data won’t be a big issue

4.5 out of 10.

Original title and link: Big Data Top Ten predictions for 2014 by Jim Kaskade (NoSQL database©myNoSQL)


Big Data predictions for 2014

EMC’s Bill Schmarzo concluding his Big Data centric 8 predictions for 2014:

Whether these predictions are entirely true or only partially true, 2014 will be an even bigger year for Big Data as technology innovations enable organizations to leverage data to optimize key business processes and uncover new monetization opportunities. To be successful, organizations don’t need a Big Data strategy as much as they need a business strategy that incorporates Big Data.

One of the few prediction lists that are realistic.

Original title and link: Big Data predictions for 2014 (NoSQL database©myNoSQL)


Big Data 2.0: the next generation of Big Data

Gagan Mehra1 from VentureBeat:

The big data ecosystem has now reached a tipping point where the basic infrastructural capabilities for supporting big data challenges and opportunities are easily available. Now we are entering what I would call the next generation of big data — big data 2.0 — where the focus is on three key areas: Speed, Data Quality, Applications.

While I’m not convinced we can call the first cycle complete, if this is Big Data 2.0 can we skip at 3.0 already2?

  1. Gagan Mehra: Chief evangelist for Software AG 

  2. Not that these are boring, but they sound like pre Big Data problems. 

Original title and link: Big Data 2.0: the next generation of Big Data (NoSQL database©myNoSQL)