data warehouse: All content tagged as data warehouse in NoSQL databases and polyglot persistence
Netezza, the data warehousing appliance maker, has been acquired by IBM for approximately $1.7 billion. While I haven’t covered Netezza before, this acquisition is interesting from the perspective of the BigData market.
Update: Daniel Abadi wrote ☞ here about a possible Netezza acquisition by IBM over an year ago.
Hive is data warehouse infrastructure built on top of Hadoop offering tools for data ETL, a mechanism to put structures on the data, and the capability to querying and analyzing large data sets stored in Hadoop. To better understand the benefits of Hive you can check how Facebook is using Hive to deal with petabyte scale data warehouse.
Recently, John Sichi a member of the Data infrastructure team at Facebook published an article on integrating Hive and HBase. Also there is interest in having Hive work with Cassandra and this is ☞ tracked in Cassandra JIRA (nb: not sure there’s any advance on this yet though).
Hypertable-Hive integration allows Hive QL statements to read and write to Hypertable via SELECT and INSERT commands. […] Currently the Hypertable storage handler only supports external, non-native tables.
Somehow all this work to provide a common data warehouse infrastructure on top of existing NoSQL solutions (or at least the wide-column stores which are focused on large scale datasets) seems to confirm there’s no need for a common NoSQL language.
Lately I’ve been mentioning Hive quite a few times when writing about working with NoSQL data, but I was missing a good slidedeck providing details of the Hive architecture, usage scenarios, and other interesting details about Hive.
The presentation embedded below coming from the Facebook Data Infrastructure team provides all these details and much more (i.e. Hive usage at Facebook, Hadoop and Hive clusters, etc.)
The fact that you are storing your data into a NoSQL solution, doesn’t mean that you are done with it. You’ll still have to put it to work, transform and move it, or do some data warehousing. And the lack of SQL should not stop you for doing any of these.
But MapReduce is not the only option available and I’d like to quickly introduce you to a couple of alternative solutions.
Working with HBase may be at times quite verbose and while Java is not very good at creating DSLs sometimes even a more fluent APIs are useful. This is exactly what HBase-dsl brings you:
However I found myself writing tons of code to perform some fairly simple tasks. So I set out to simply my HBase code and ended up writing a Java HBase DSL. It’s still fairly rough around the edges but it does allow the use of standard Java types and it’s extensible.
hBase.save("test"). row("abcd"). family("famA"). col("col1", "hello world!"); String value = hBase.fetch("test"). row("abcd"). family("famA"). value("col1", String.class);
HBql goals is to bring, to those missing SQL, a more SQLish interface to HBase. You can take a look at ☞ HBql statements to get a better feeling of what it looks like.
Hive is a data warehouse infrastructure for Hadoop that proposes a SQL-like query language to enable easy data ETL.
Pig is a platform for analyzing large data sets built on Hadoop. I have found a great article ☞ comparing Pig Latin over Hadoop to SQL over a relational database
- Pig Latin is procedural, where SQL is declarative.
- Pig Latin allows pipeline developers to decide where to checkpoint data in the pipeline.
- Pig Latin allows the developer to select specific operator implementations directly rather than relying on the optimizer.
- Pig Latin supports splits in the pipeline.
- Pig Latin allows developers to insert their own code almost anywhere in the data pipeline.
But don’t think that the HBase and Hadoop are the only one getting such tools. In the graph databases world, there is Gremlin ☞: a graph-based programming language meant to ease graph query, analysis, and manipulation.
I think sooner than later we will see more such solutions appearing in the NoSQL environment.