NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



sqoop: All content tagged as sqoop in NoSQL databases and polyglot persistence

Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop

Apache Bigtop:

The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.

Currently packaging:

  • Apache Hadoop 1.0.x
  • Apache Zookeeper 3.4.3
  • Apache HBase 0.92.0
  • Apache Hive 0.8.1
  • Apache Pig 0.9.2
  • Apache Mahout 0.6.1
  • Apache Oozie 3.1.3
  • Apache Sqoop 1.4.1
  • Apache Flume 1.0.0
  • Apache Whirr 0.7.0

Apache Bigtop looks like the first step towards the Big Data LAMP-like platform analysts are calling for. Practically though it’s goal is to ensure that all the components of the wide Hadoop ecosystem remain interoperable.

Original title and link: Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop (NoSQL database©myNoSQL)

The Timeline of the Sqoop Project

A bit of history of yet another BigData-ish/NoSQLish graduating project:

A timeline of Sqoop Project

Original title and link: The Timeline of the Sqoop Project (NoSQL database©myNoSQL)


Sqoop Becomes an Apache Top-Level Project

Sqoop, the tool used to efficiently transfer bulk data from external datastores and enterprise data warehouses into HDFS, HBase, etc., has become an Apache top-level project. Not only is Sqoop supported by many NoSQL databases—just as some quick examples see DataStax Enterprise 2.0 or Couchbase—, but being a top level project is a sign of the maturity of the project community.

Original title and link: Sqoop Becomes an Apache Top-Level Project (NoSQL database©myNoSQL)

Cassandra + Hadoop + Solr and Sqoop and Log4j => DataStax Enterprise 2.0

The tl;dr version is: DataStax has announced

Cassandra + Hadoop + Solr on the same cluster plus Sqoop, Log4j, and workload provisioning = DataStax Enterprise 2.0

For the longer version, there are a couple of new things worth emphasizing in this release:

  1. Fully integrated enterprise search
  2. RDBMS data migration
  3. Snap-in application log ingestion
  4. improvements to OpsCenter
  5. Elastic workload provisioning

Let’s take these one by one:

Fully integrated enterprise search or Solr on top of Cassandra

Cassandra distribution model is strongly inspired by Amazon Dynamo being characterized by high availability, elasticity, and fault tolerance. Solr is the search platform built on top of Lucene. Over time people learned how to scale Solr, but current approaches are far from being simple or offering an out of the box experience. Taking the Solr protocol and indexing capabilities and putting those on top of the Cassandra architecture makes a lot of sense.

Actually this has already been done in the form of Solandra (nb Solr integration in DataStax Enter. 2.0 is not based on Solandra though). For a scalable search solution there’s already ElasticSearch, but for someone running a Cassandra cluster, this looks like a useful addition to the stack.

DataStax has already showed this direction with what was called initially Brisk (or Brangelina for friends): Hadoop on top of the Cassandra cluster that became DataStax Enterprise 1.0. Solr on top of Cassandra is 2.0, but what will be the 3.0?

There are two cherries on top of this integration of Solr: easy index rebuild operations and CQL (Cassandra Query Language) access. I’ve seen XQuery translated to Lucene searches before, but I still need to see a SQL-like language translation.

As I’ve learned from Riak at Clipboard: Why Riak and How We Made Riak Search Faster, there is some complexity involved in scaling multi-matching search queries with term-based partitioning. Cassandra uses two partitioning strategies: random and order-preserving. It would be interesting to hear what partitioning strategy is used for Solr indexes. Update: I’ve got some answers so there’ll be a follow up with more details.

RDBMS data migration: it must be Sqoop

Nothing special here. You have a DataStax Enterprise cluster with some Hadoop nodes defined and you need to process data. But some of it lives in relational databases. Sqoop at rescue.

Snap-in application log ingestion: Flume or Scribe? No, it’s Log4j

When I read this bullet point my first thought was this is Flume. Or maybe Scribe. But most probably Flume. It looks like DataStax went a different route and offers log ingestion using Log4j. It’s true that Log4j or one of its flavors most probably exist in every Java project, but it still feels like an odd choice. On the other hand there’s a Cassandra plugin for Flume.

OpsCenter Enterprise 2.0

The OpsCenter is the management, monitoring, and control tool for DataStax Enterprise. The new version includes pretty much what you’d expect from an admin/monitoring tool:

  • multi-cluster monitoring
  • visual backup
  • search monitoring

Looking back at the NoSQL administration/monitoring tools I’ve seen lately, I’m pretty sure I’ve identified a trend: they all come in various shades of black.

DataStax OpsCenter Enterprise:

DataStax OpsCenter

Riak Control:

Riak Control

Elastic workload provisioning

I’ve left at the end the feature that got me most interested into: elastic workload provisioning.

To better understand what this is, I had to go back to DataStax Enterprise 1.x where a node could be either a Cassandra node (OLTP) or a Hadoop node (processing). The new version allows quasi-dynamic node provisioning by changing the mode of a cluster (between Hadoop, Cassandra, Solr) with a stop/start operation. So given a cluster one could adjust its capacity and performance for different workloads (e.g. time-sensitive applications or temporary cluster operations).

Workload management is a feature present in most of the commercial data warehouse solutions. Even if in the very early days, DataStax Enterprise’s workload provisioning is the first take towards workload management in the NoSQL space.

Original title and link: Cassandra + Hadoop + Solr and Sqoop and Log4j => DataStax Enterprise 2.0 (NoSQL database©myNoSQL)

The components and their functions in the Hadoop ecosystem

Edd Dumbill enumerates the various components of the Hadoop ecosystem:

Hadoop ecosystem

My quick reference of the Hadoop ecosystem is including a couple of other tools that are not in this list, with the exception of Ambari and HCatalog which were released later.

Original title and link: The components and their functions in the Hadoop ecosystem (NoSQL database©myNoSQL)

Project Isotope Will Bring Together Hadoop Toolchain With Microsoft’s Data Products

There’s a series of events lately that makes me think Microsoft is nowhere near accepting defeat in the cloud services area. As regards Microsoft’s Project Isotop, things are much simpler than ZDNet article make them sound[1]: Microsoft is working on integrating Hadoop and its toolchain with their own products (SQL Server Analysis Services, PowerPivot).

Microsoft Project Isotop

A picture worth more than the 626 words.

  1. I bet the details of integration are fascinating and far from being simple, but the article is not focusing on those  

Original title and link: Project Isotope Will Bring Together Hadoop Toolchain With Microsoft’s Data Products (NoSQL database©myNoSQL)

Couchbase Hadoop Connector: Another Sqoop Example

Announced a couple of days ago, the Couchbase Hadoop Connector is just another example of using Sqoop:

The Couchbase Hadoop Connecter utilizes the Sqoop plug-in to stream data from the Couchbase system to Cloudera’s Distribution Including Apache Hadoop (CDH), enabling consistent application performance while also allowing for heavy duty MapReduce processing of the same dataset. In an interactive web application environment, such as an ad targeting platform, this ability ensures low latency and high throughput to make optimized decisions about real-time ad placement.

I’m wondering if this connector have already been used by the AOL Advertising Architecture, which is using Hadoop and Membase. In case it wasn’t how it would improve things[1]?

  1. If you know anyone that could speak about this (from Couchbase, Cloudera, or AOL) please contact me  

Original title and link: Couchbase Hadoop Connector: Another Sqoop Example (NoSQL database©myNoSQL)

Apache Sqoop: What, When Where, How

The other day I’ve posted about Sqoop’s first release under Apache umbrella, so I’ve thought of providing a bit more details about where Sqoop fits in picture. I’ve embedded below 3 presentations that will answer questions like what is Sqoop, when and where to use Sqoop, how to use Sqoop.

Apache Sqoop Announces First Release

Sqoop, originally created at Cloudera and now on Apache incubator, is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores. You can use Sqoop to import data from external structured datastores into HDFS or related systems like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.

Yesterday Apache Sqoop announced its first release under the Apache umbrella with a long list of changes.

To get a better idea of where Apache Sqoop fits, check this video from Hadoop World 2011 (requires registration) which describes key scenarios driving Hadoop and RDBMS integration and reviewes Apache Sqoop project, which besides supporting data movement between Hadoop and any JDBC database, it is also providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc.

Original title and link: Apache Sqoop Announces First Release (NoSQL database©myNoSQL)

Biodiversity Indexing: Offline Processing With Hadoop, Hive, Sqoop, Oozie

The architecture for offline processing biodiversity based on Sqoop, Hadoop, Oozie, and Hive:

Hadoop Sqoop Oozie Hive Biodiversity Indexing

And its future:

Following this processing work, we expect to modify our crawling to harvest directly into HBase. The flexibility HBase offers will allow us to grow incrementally the richness of the terms indexed in the Portal, while integrating nicely into Hadoop based workflows. The addition of coprocessors to HBase is of particular interest to further reduce the latency involved in processing, by eliminating batch processing altogether.

Many companies working with large datasets have to deal with multiple systems and duplicate data between the online services and offline processors. While the infrastructure costs are going down, the costs of complexity are not. The HBase + Hadoop and Cassandra + Brisk combos are starting to address this problem.

Original title and link: Biodiversity Indexing: Offline Processing With Hadoop, Hive, Sqoop, Oozie (NoSQL database©myNoSQL)


Experimenting with Hadoop using Cloudera VirtualBox Demo

CDH Mac OS X VirtualBox VM

If you don’t count the download, you’ll get this up and running in 5 minutes tops. At the end you’ll have Hadoop, Sqoop, Pig, Hive, HBase, ZooKeeper, Oozie, Hume, Flume, and Whirr all configured and ready to experiment with.

Making it easy for users to experiment with these tools increases the chances for adoption. Adoption means business.

Original title and link: Experimenting with Hadoop using Cloudera VirtualBox Demo (NoSQL databases © myNoSQL)


Cloudera’s Distribution for Apache Hadoop version 3 Beta 4

New version of Cloudera’s Hadoop distro — complete release notes available here:

CDH3 Beta 4 also includes new versions of many components. Highlights include:

  • HBase 0.90.1, including much improved stability and operability.
  • Hive 0.7.0rc0, including the beginnings of authorization support, support for multiple databases, and many other new features.
  • Pig 0.8.0, including many new features like scalar types, custom partitioners, and improved UDF language support.
  • Flume 0.9.3, including support for Windows and improved monitoring capabilities.
  • Sqoop 1.2, including improvements to usability and Oracle integration.
  • Whirr 0.3, including support for starting HBase clusters on popular cloud platforms.

Plus many scalability improvements contributed by Yahoo!.

Cloudera’s CDH is the most popular Hadoop distro bringing together many components of the Hadoop ecosystem. Yahoo remains the main innovator behind Hadoop.

Original title and link: Cloudera’s Distribution for Apache Hadoop version 3 Beta 4 (NoSQL databases © myNoSQL)