NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



hadoop: All content tagged as hadoop in NoSQL databases and polyglot persistence

Aster Data, HAWQ, GPDB and the First Hadoop Squeeze

Rob Klopp:

But there are three products, the Greenplum database (GPDB), HAWQ, and Aster Data, that will be squeezed more quickly as they are positioned either in between the EDW and Hadoop… or directly over Hadoop. In this post I’ll explain what I suspect Pivotal and Teradata are trying to do… why I believe their strategy will not work for long… and why readers of this blog should be careful moving forward.

This is a very interesting analysis of the enterprise data warehouse market. There’s also a nice visualization of this prediction:


Here’s an alternative though. As showed in the picture above, the expansion of in-memory databases’ depends heavily on the evolution of the price of memory. It’s hard to argument against price predictions or Moore’s law. But accidents even if rare are still possible. Any significant change in the trend of memory costs, or other hardware market conditions (e.g. an unpredicted decrease of the price for SSDs), could give Teradata and Pivotal the extra time/conditions to break into advanced hybrid storage solutions that would offer slightly less fast but also less expensive products than their competitors’ in-memory databases.

Original title and link: Aster Data, HAWQ, GPDB and the First Hadoop Squeeze (NoSQL database©myNoSQL)


Cloudera shipped a mountain... what can you read between the lines

Cloudera Engineering (@ClouderaEng) shipped a mountain of new product (production-grade software, not just technical previews): Cloudera Impala, Cloudera Search, Cloudera Navigator, Cloudera Development Kit (now Kite SDK), new Apache Accumulo packages for CDH, and several iterative releases of CDH and Cloudera Manager. (And, the Cloudera Enterprise 5 Beta release was made available to the world.). Furthermore, as always, a ton of bug fixes and new features went upstream, with the features notably but not exclusively HiveServer2 and Apache Sentry (incubating).

How many things can you read in this paragraph?

  1. a not that subtle stab at Hortonwork’s series of technical previews.
  2. more and more projects brought under the CDH umbrella. Does more ever become too much? (I cannot explain why, but my first thought was “this feels so Oracle-style”)
  3. Cloudera’s current big bet is Impala. SQL and low latency querying. A big win for the project, but not necessarily a direct financial win for Cloudera, was its addition as a supported service on Amazon Elastic MapReduce.

Original title and link: Cloudera shipped a mountain… what can you read between the lines (NoSQL database©myNoSQL)


Cloudera's strategy for Hadoop

Alex Woodie about Cloudera’s strategy for Hadoop:

Cloudera has gone further than other Hadoop vendors in articulating a business-oriented strategy for converting Hadoop R&D into a profitable business model. The company unveiled its “enterprise data hub” strategy at the Strata + Hadoop World conference in October, in which it envisions Hadoop at the center of a new data-focused architecture. Every type of data, whether it’s analytical or transactional in nature, goes through Hadoop on its way to somewhere else. (Hortonworks, MapR Technologies, and Pivotal, for what it’s worth, have similar strategies in play, but Cloudera has jumped out front in articulating the marketing message in the cleanest manner.)

In the early days a coherent strategy is not a critical point as technology alone can win adopters quite easily through its direct value. Later, when penetrating the enterprise world, a big picture strategy is at least a way to keep the conversation going even if in the end the deployed solutions are highly customized.

Original title and link: Cloudera’s strategy for Hadoop (NoSQL database©myNoSQL)


LinkedIn's Hourglass: Incremental data processing in Hadoop

Matthew Hayes introduces a very interesting new framework from LinkedIn

Hourglass is designed to make computations over sliding windows more efficient. For these types of computations, the input data is partitioned in some way, usually according to time, and the range of input data to process is adjusted as new data arrives. Hourglass works with input data that is partitioned by day, as this is a common scheme for partitioning temporal data.

Hourglass is available on GitHub.

We have found that two types of sliding window computations are extremely common in practice:

  • Fixed-length: the length of the window is set to some constant number of days and the entire window moves forward as new data becomes available. Example: a daily report summarizing the the number of visitors to a site from the past 30 days.
  • Fixed-start: the beginning of the window stays constant, but the end slides forward as new input data becomes available. Example: a daily report summarizing all visitors to a site since the site launched.

Original title and link: LinkedIn’s Hourglass: Incremental data processing in Hadoop (NoSQL database©myNoSQL)


Hadoop will be made better through engineering

Dan Woods prefacing an interview with Scott Gnau of Teradata:

In this vision, because Hadoop can store unstructured and structured information, because it can scale massively, because it is open source, because it allows many forms of analysis, because it has a thriving ecosystem, it will become the one repository to rule them all.

In my view, the most extreme advocates for Hadoop need to sober up and right size both expectations and time frames. Hadoop is important but it won’t replace all other repositories. Hadoop will change the world of data, but not in the next 18 months. The open source core of Hadoop is a masterful accomplishment, but like many open source projects, it will be made better through engineering.

You have to agree: there’s no engineering behind Hadoop. Just a huge number of intoxicated… brogrammers.

Original title and link: Hadoop will be made better through engineering (NoSQL database©myNoSQL)


Pigs can build graphs too for graph analytics

Extremely interesting and intriguing usage and extension of Pig at Intel:

Pigs eat everything and Pig can ingest many data formats and data types from many data sources, in line with our objectives for Graph Builder. Also, Pig has native support for local file systems, HDFS, and HBase, and tools like Sqoop can be used upstream of Pig to transfer data into HDFS from relational databases. One of the most fascinating things about Pig is that it only takes a few lines of code to define a complex workflow comprised of a long chain of data operations. These operations can map to multiple MapReduce jobs and Pig compiles the logical plan into an optimized workflow of MapReduce jobs. With all of these advantages, Pig seemed like the right tool for graph ETL, so we re-architected Graph Builder 2.0 as a library of User- Defined Functions (UDF’s) and macros in Pig Latin.

Original title and link: Pigs can build graphs too for graph analytics (NoSQL database©myNoSQL)


Datameer raises $19 Million

Announced yesterday:

“This funding is entirely about allowing us to meet the nonstop global demand for our product. Across every industry, companies are moving past Hadoop science projects and realizing they need a proven big data analytics tool that finally frees them from schemas and ETL,” said Stefan Groschupf, CEO of Datameer.

Funding in the Hadoop space is at a higher level than the pure NoSQL databases market. In the Big Data/BI market it’s easier to grasp the competitors and the market potential they’re fighting for. In the NoSQL market, many are still afraid to think that some of these players will actually make (big) dents into incumbents’ market segments.

Original title and link: Datameer raises $19 Million (NoSQL database©myNoSQL)


Challenges and Opportunities for Big Data - an interview with Actian's CTO Mike Hoskins

Roberto V. Zicari interviews Actian’s CTO Mike Hoskins:

Until recently, most data projects were solely focused on preparation. Seminal developments in the big data landscape, including Hortonworks Data Platform (HDP) 2.0 and the arrival of YARN (Yet Another Resource Negotiator) – which takes Hadoop’s capabilities in data processing beyond the limitations of the highly regimented and restrictive MapReduce programming model – provides an opportunity to move beyond the initial hype of big data and instead towards the more high-value work of predictive analytics.

Original title and link: Challenges and Opportunities for Big Data - an interview with Actian’s CTO Mike Hoskins (NoSQL database©myNoSQL)


Picking the Right Platform: Big Data or Traditional Warehouse?

Stephen Swoyer (tdwi) is summarizing Richard Winter’s research into the topic of cost-based efficiency of Hadoop vs data warehouses:

“Under what circumstances, in fact, does Hadoop save you a lot of money, and under what circumstances does a data warehouse save you a lot of money?”

The conversation happened at a Teradata event, so you might already guess some of the findings. Anyways without seeing the data it’s difficult to agree or disagree:

In fact, he argued that misusing Hadoop for some types of decision support workloads could cost up to 2.8x more than a data warehouse.

Original title and link: Picking the Right Platform: Big Data or Traditional Warehouse? (NoSQL database©myNoSQL)


The three most common ways data junkies are using Hadoop

Shaun Connolly (Hortonworks) lists the 3 most commons usages of Hadoop in a guest post on GigaOm:

  1. Data refinery
  2. Data exploration
  3. Application enrichment

Nothing new here, except the new buzzwords used to describe those Hadoop use cases that were slowly, but steadily establishing as patterns. And even if they sound nicer than ETL, analytics, etc. I doubt anyone needed new terms.

Original title and link: The three most common ways data junkies are using Hadoop (NoSQL database©myNoSQL)

Hadoop in Fedora 20

Being included in the default Fedora distro is yet another big step for Hadoop.

The hardest part about getting Hadoop into Fedora? “Dependencies, dependencies, dependencies!” says Farrellee. […]

For Hadoop? It was more difficult than usual. “There were some dependencies that were just missing and we had to work through those as you’d expect - there were a lot of these. Then there were dependencies that were older than what upstream was using - rare, I know, for Fedora, which aims to be on the bleeding edge. The hardest to deal with were dependencies that were newer than what upstream was using. We tried to write patches for these, but we weren’t always successful. […]”

On the other hand, one thing that continues to puzzle me is: how many different people coming from different backgrounds need to say that Hadoop is crazy complex?

Original title and link: Hadoop in Fedora 20 (NoSQL database©myNoSQL)


HAWK: Performance monitoring tool for Hive

JunHo Cho’s slides introducing HAWK, a performance monitoring tool for Hive:

✚ I couldn’t find a link for HAWK. The slides are pointing to NexR.

Original title and link: HAWK: Performance monitoring tool for Hive (NoSQL database©myNoSQL)