NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



data warehouse: All content tagged as data warehouse in NoSQL databases and polyglot persistence

Notes on data warehouse appliance prices

Curt Monash:

Reasons people criticize per-terabyte data warehouse appliance price metrics include:

  • Price-per-terabyte metrics ignore issues of throughput, latency, workload, and so on.
  • Price-per-terabyte metrics ignore quality of storage medium (slow disks, fast disks, Flash, etc.)
  • Price-per-terabyte metrics can be radically affected by changes in disk size.

Not a specialist, but I don’t think the price-per-terabyte includes operational costs either.

Original title and link: Notes on data warehouse appliance prices (NoSQL databases © myNoSQL)


Hadoop at Twitter: An Interview with Kevin Weil, Twitter Analytics Lead

Kevin Weil[1] in an interview about Twitter’s usage of Hadoop:

Hadoop is our data warehouse; every piece of data we store is archived in HDFS. We use HBase for data that sees updates frequently, or data we occasionally need low-latency access to. Every node in our cluster runs HBase. We use Java MapReduce for simple jobs, or jobs which have tight performance requirements. We use Pig for most of our analysis jobs, because its flexibility helps us iterate rapidly to arrive at the right way of looking at the data.

Our Hadoop use is also evolving: initially it was primarily used as an analysis tool to help us better understand the Twitter ecosystem, and that’s not going to change. But it’s increasingly used to build parts of products you use on the site every day such as People Search, the data for which is built with Hadoop. There are many more products like this in development.

Undeniably, Twitter is (deep) into NoSQL.

  1. Kevin Weil: Twitter Analytics Lead, @kevinweil  ()

Original title and link: Hadoop at Twitter: An Interview with Kevin Weil, Twitter Analytics Lead (NoSQL databases © myNoSQL)


Netezza Acquired by IBM

Netezza, the data warehousing appliance maker, has been acquired by IBM for approximately $1.7 billion. While I haven’t covered Netezza before, this acquisition is interesting from the perspective of the BigData market.

Update: Daniel Abadi wrote ☞ here about a possible Netezza acquisition by IBM over an year ago.


Original title and link: Netezza Acquired by IBM (NoSQL databases © myNoSQL)

Teradata, Cloudera team up on Hadoop data warehousing

In other words, Hadoop and data warehousing isn’t a zero sum game. The two techniques technologies will co-exist. Teradata will bundle a connector (the Teradata Hadoop Connector) to its systems with Cloudera Enterprise at no additional cost. Cloudera will provide support for the connector as part of its enterprise subscription. The two parties will also jointly market the connector.

That’s why we are saying NoSQL is just another tool in our toolbox.

Original title and link: Teradata, Cloudera team up on Hadoop data warehousing (NoSQL databases © myNoSQL)


Pig and Hive at Yahoo!

Fantastic post on Yahoo! Hadoop blog presenting a series of scenarios where using Pig and Hive makes things a lot better:

The widespread use of Pig at Yahoo! has enabled the migration of our data factory processing to Hadoop. With the adoption of Hive, we will be able to move much of our data warehousing to Hadoop as well. Having the data factory and the data warehouse on the same system will lower data-loading time into the warehouse — as soon as the factory is finished, it is available in the warehouse. It will also enable us to share — across both the factory and the warehouse — metadata, monitoring, and management tools; support and operations teams; and hardware. So we are excited to add Hive to our toolkit, and look forward to using both these tools together as we lean on Hadoop to do more and more of our data processing.

The use cases mentioned in the post:

  • data preparation and presentation:

    Given the different workloads and different users for each phase, we have found that different tools work best in each phase. Pig (combined with a workflow system such as Oozie) is best suited for the data factory, and Hive for the data warehouse.

  • data factories: pipelines (Pig + Oozie), iterative processing (Pig), research (Pig)
  • data warehouse: business-intelligence analysis and ad-hoc queries

    In both cases, the relational model and SQL are the best fit. Indeed, data warehousing has been one of the core use cases for SQL through much of its history. It has the right constructs to support the types of queries and tools that analysts want to use. And it is already in use by both the tools and users in the field. The Hadoop subproject Hive provides a SQL interface and relational model for Hadoop.

Yahoo! gets way to little credit for its work on bigdata and its contributions to the open source.

Original title and link for this post: Pig and Hive at Yahoo! (published on the NoSQL blog: myNoSQL)


NoSQL Databases and Data Warehousing

I didn’t know data warehousing strictly imposes a relational model:

From a philosophical standpoint, my largest problem with NoSQL databases is that they don’t respect relational theory. In short, they aren’t meant to deal with sets of data, but lists. Relational algebra was created to deal with the large sets of data and have them interact. Reporting and analytics rely on that.

I’d bet people building and using Hive, Pig, Flume and other data warehousing tools would disagree with Eric Hewitt.

NoSQL Databases and Data Warehousing originally posted on the NoSQL blog: myNoSQL


More Integrations for Hive

Hive is data warehouse infrastructure built on top of Hadoop offering tools for data ETL, a mechanism to put structures on the data, and the capability to querying and analyzing large data sets stored in Hadoop[1]. To better understand the benefits of Hive you can check how Facebook is using Hive to deal with petabyte scale data warehouse.

Recently, John Sichi a member of the Data infrastructure team at Facebook published an article on integrating Hive and HBase. Also there is interest in having Hive work with Cassandra and this is ☞ tracked in Cassandra JIRA (nb: not sure there’s any advance on this yet though).

Hypertable, another wide-column store, provides a way to integrating with Hive described ☞ here:

Hypertable-Hive integration allows Hive QL statements to read and write to Hypertable via SELECT and INSERT commands. […] Currently the Hypertable storage handler only supports external, non-native tables.

Somehow all this work to provide a common data warehouse infrastructure on top of existing NoSQL solutions (or at least the wide-column stores which are focused on large scale datasets) seems to confirm there’s no need for a common NoSQL language.

Presentation: Hive - A Petabyte Scale Data Warehouse Using Hadoop

Lately I’ve been mentioning Hive quite a few times when writing about working with NoSQL data, but I was missing a good slidedeck providing details of the Hive architecture, usage scenarios, and other interesting details about Hive.

The presentation embedded below coming from the Facebook Data Infrastructure team provides all these details and much more (i.e. Hive usage at Facebook, Hadoop and Hive clusters, etc.)

Putting your NoSQL data to work

The fact that you are storing your data into a NoSQL solution, doesn’t mean that you are done with it. You’ll still have to put it to work, transform and move it, or do some data warehousing[1]. And the lack of SQL should not stop you for doing any of these.

One solution available in many NoSQL stores is MapReduce — as an example you can see how you can translate SQL to MongoDB MapReduce.

But MapReduce is not the only option available and I’d like to quickly introduce you to a couple of alternative solutions.


Working with HBase may be at times quite verbose and while Java is not very good at creating DSLs sometimes even a more fluent APIs are useful. This is exactly what HBase-dsl brings you:

However I found myself writing tons of code to perform some fairly simple tasks. So I set out to simply my HBase code and ended up writing a Java HBase DSL. It’s still fairly rough around the edges but it does allow the use of standard Java types and it’s extensible."test"). 
    col("col1", "hello world!");

String value = hBase.fetch("test").


    value("col1", String.class);


HBql goals is to bring, to those missing SQL, a more SQLish interface to HBase. You can take a look at ☞ HBql statements to get a better feeling of what it looks like.


Hive is a data warehouse infrastructure for Hadoop that proposes a SQL-like query language to enable easy data ETL.


Pig is a platform for analyzing large data sets built on Hadoop. I have found a great article ☞ comparing Pig Latin over Hadoop to SQL over a relational database

  1. Pig Latin is procedural, where SQL is declarative.
  2. Pig Latin allows pipeline developers to decide where to checkpoint data in the pipeline.
  3. Pig Latin allows the developer to select specific operator implementations directly rather than relying on the optimizer.
  4. Pig Latin supports splits in the pipeline.
  5. Pig Latin allows developers to insert their own code almost anywhere in the data pipeline.

But don’t think that the HBase and Hadoop are the only one getting such tools. In the graph databases world, there is Gremlin ☞: a graph-based programming language meant to ease graph query, analysis, and manipulation.

I think sooner than later we will see more such solutions appearing in the NoSQL environment.


Have you Heard of Kdb+?

Until two days ago, I didn’t know anything about Kdb+, a 16 year old solution,:

[…] a fast database for analyzing massive volumes of data.

Kdb+ is a unified database capturing and analyzing streaming and historical data.

I’ll have to read the papers to make sure I understand better what Kdb+ is:

But if you know something, don’t be shy and share it with us!

Update: I got this link from @lsbardel: A first look at kdb+, the article containing interesting info about kdb+:

kdb+ has embedded a Kx propriety language called q [… which] is a proprietary array processing language developed by Arthur Whitney. The language serves as the query language for kdb+. q evolved from APL as explained by its author in an ☞ interview.

The backbone of the q language is formed by atoms, lists, dictionaries and tables.

As any serious propriety software, kdb+ provides native interfaces in C/C++, Java, C# and Python.

[read the whole article]


Building the Data Warehouse

[…] we decided that we needed a data warehouse. After numerous false starts and blind alleys, we decided to make our own system from scratch using MySQL. We considered commercial products, and pre-built open source solutions, but they just didn’t seem to fit our needs properly. The data warehouse project commenced with phase 1.

Oldies, but goldies