sqoop: All content tagged as sqoop in NoSQL databases and polyglot persistence
The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.
- Apache Hadoop 1.0.x
- Apache Zookeeper 3.4.3
- Apache HBase 0.92.0
- Apache Hive 0.8.1
- Apache Pig 0.9.2
- Apache Mahout 0.6.1
- Apache Oozie 3.1.3
- Apache Sqoop 1.4.1
- Apache Flume 1.0.0
- Apache Whirr 0.7.0
Apache Bigtop looks like the first step towards the Big Data LAMP-like platform analysts are calling for. Practically though it’s goal is to ensure that all the components of the wide Hadoop ecosystem remain interoperable.
Original title and link: Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop ( ©myNoSQL)
Sqoop, the tool used to efficiently transfer bulk data from external datastores and enterprise data warehouses into HDFS, HBase, etc., has become an Apache top-level project. Not only is Sqoop supported by many NoSQL databases—just as some quick examples see DataStax Enterprise 2.0 or Couchbase—, but being a top level project is a sign of the maturity of the project community.
Original title and link: Sqoop Becomes an Apache Top-Level Project ( ©myNoSQL)
The tl;dr version is: DataStax has announced
Cassandra + Hadoop + Solr on the same cluster plus Sqoop, Log4j, and workload provisioning = DataStax Enterprise 2.0
For the longer version, there are a couple of new things worth emphasizing in this release:
- Fully integrated enterprise search
- RDBMS data migration
- Snap-in application log ingestion
- improvements to OpsCenter
- Elastic workload provisioning
Let’s take these one by one:
Fully integrated enterprise search or Solr on top of Cassandra
Cassandra distribution model is strongly inspired by Amazon Dynamo being characterized by high availability, elasticity, and fault tolerance. Solr is the search platform built on top of Lucene. Over time people learned how to scale Solr, but current approaches are far from being simple or offering an out of the box experience. Taking the Solr protocol and indexing capabilities and putting those on top of the Cassandra architecture makes a lot of sense.
Actually this has already been done in the form of Solandra (nb Solr integration in DataStax Enter. 2.0 is not based on Solandra though). For a scalable search solution there’s already ElasticSearch, but for someone running a Cassandra cluster, this looks like a useful addition to the stack.
DataStax has already showed this direction with what was called initially Brisk (or Brangelina for friends): Hadoop on top of the Cassandra cluster that became DataStax Enterprise 1.0. Solr on top of Cassandra is 2.0, but what will be the 3.0?
There are two cherries on top of this integration of Solr: easy index rebuild operations and CQL (Cassandra Query Language) access. I’ve seen XQuery translated to Lucene searches before, but I still need to see a SQL-like language translation.
As I’ve learned from Riak at Clipboard: Why Riak and How We Made Riak Search Faster, there is some complexity involved in scaling multi-matching search queries with term-based partitioning. Cassandra uses two partitioning strategies: random and order-preserving. It would be interesting to hear what partitioning strategy is used for Solr indexes. Update: I’ve got some answers so there’ll be a follow up with more details.
RDBMS data migration: it must be Sqoop
Nothing special here. You have a DataStax Enterprise cluster with some Hadoop nodes defined and you need to process data. But some of it lives in relational databases. Sqoop at rescue.
Snap-in application log ingestion: Flume or Scribe? No, it’s Log4j
When I read this bullet point my first thought was this is Flume. Or maybe Scribe. But most probably Flume. It looks like DataStax went a different route and offers log ingestion using Log4j. It’s true that Log4j or one of its flavors most probably exist in every Java project, but it still feels like an odd choice. On the other hand there’s a Cassandra plugin for Flume.
OpsCenter Enterprise 2.0
The OpsCenter is the management, monitoring, and control tool for DataStax Enterprise. The new version includes pretty much what you’d expect from an admin/monitoring tool:
- multi-cluster monitoring
- visual backup
- search monitoring
Looking back at the NoSQL administration/monitoring tools I’ve seen lately, I’m pretty sure I’ve identified a trend: they all come in various shades of black.
DataStax OpsCenter Enterprise:
Elastic workload provisioning
I’ve left at the end the feature that got me most interested into: elastic workload provisioning.
To better understand what this is, I had to go back to DataStax Enterprise 1.x where a node could be either a Cassandra node (OLTP) or a Hadoop node (processing). The new version allows quasi-dynamic node provisioning by changing the mode of a cluster (between Hadoop, Cassandra, Solr) with a stop/start operation. So given a cluster one could adjust its capacity and performance for different workloads (e.g. time-sensitive applications or temporary cluster operations).
Workload management is a feature present in most of the commercial data warehouse solutions. Even if in the very early days, DataStax Enterprise’s workload provisioning is the first take towards workload management in the NoSQL space.
Original title and link: Cassandra + Hadoop + Solr and Sqoop and Log4j => DataStax Enterprise 2.0 ( ©myNoSQL)
Edd Dumbill enumerates the various components of the Hadoop ecosystem:
Original title and link: The components and their functions in the Hadoop ecosystem ( ©myNoSQL)
There’s a series of events lately that makes me think Microsoft is nowhere near accepting defeat in the cloud services area. As regards Microsoft’s Project Isotop, things are much simpler than ZDNet article make them sound: Microsoft is working on integrating Hadoop and its toolchain with their own products (SQL Server Analysis Services, PowerPivot).
A picture worth more than the 626 words.
I bet the details of integration are fascinating and far from being simple, but the article is not focusing on those ↩
Original title and link: Project Isotope Will Bring Together Hadoop Toolchain With Microsoft’s Data Products ( ©myNoSQL)
The Couchbase Hadoop Connecter utilizes the Sqoop plug-in to stream data from the Couchbase system to Cloudera’s Distribution Including Apache Hadoop (CDH), enabling consistent application performance while also allowing for heavy duty MapReduce processing of the same dataset. In an interactive web application environment, such as an ad targeting platform, this ability ensures low latency and high throughput to make optimized decisions about real-time ad placement.
I’m wondering if this connector have already been used by the AOL Advertising Architecture, which is using Hadoop and Membase. In case it wasn’t how it would improve things?
Original title and link: Couchbase Hadoop Connector: Another Sqoop Example ( ©myNoSQL)
The other day I’ve posted about Sqoop’s first release under Apache umbrella, so I’ve thought of providing a bit more details about where Sqoop fits in picture. I’ve embedded below 3 presentations that will answer questions like what is Sqoop, when and where to use Sqoop, how to use Sqoop.
Sqoop, originally created at Cloudera and now on Apache incubator, is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores. You can use Sqoop to import data from external structured datastores into HDFS or related systems like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.
To get a better idea of where Apache Sqoop fits, check this video from Hadoop World 2011 (requires registration) which describes key scenarios driving Hadoop and RDBMS integration and reviewes Apache Sqoop project, which besides supporting data movement between Hadoop and any JDBC database, it is also providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc.
Original title and link: Apache Sqoop Announces First Release ( ©myNoSQL)