sqoop: All content tagged as sqoop in NoSQL databases and polyglot persistence
Wednesday, 4 April 2012
Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop
The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.
Currently packaging:
- Apache Hadoop 1.0.x
- Apache Zookeeper 3.4.3
- Apache HBase 0.92.0
- Apache Hive 0.8.1
- Apache Pig 0.9.2
- Apache Mahout 0.6.1
- Apache Oozie 3.1.3
- Apache Sqoop 1.4.1
- Apache Flume 1.0.0
- Apache Whirr 0.7.0
Apache Bigtop looks like the first step towards the Big Data LAMP-like platform analysts are calling for. Practically though it’s goal is to ensure that all the components of the wide Hadoop ecosystem remain interoperable.
Original title and link: Apache Bigtop: Apache Big Data Management Distribution Based on Apache Hadoop (©myNoSQL)
Tuesday, 3 April 2012
The Timeline of the Sqoop Project
A bit of history of yet another BigData-ish/NoSQLish graduating project:

Original title and link: The Timeline of the Sqoop Project (©myNoSQL)
via: https://blogs.apache.org/sqoop/entry/apache_sqoop_graduates_from_incubator
Sqoop Becomes an Apache Top-Level Project
Sqoop, the tool used to efficiently transfer bulk data from external datastores and enterprise data warehouses into HDFS, HBase, etc., has become an Apache top-level project. Not only is Sqoop supported by many NoSQL databases—just as some quick examples see DataStax Enterprise 2.0 or Couchbase—, but being a top level project is a sign of the maturity of the project community.
Original title and link: Sqoop Becomes an Apache Top-Level Project (©myNoSQL)
Wednesday, 21 March 2012
Cassandra + Hadoop + Solr and Sqoop and Log4j => DataStax Enterprise 2.0
The tl;dr version is: DataStax has announced
Cassandra + Hadoop + Solr on the same cluster plus Sqoop, Log4j, and workload provisioning = DataStax Enterprise 2.0
For the longer version, there are a couple of new things worth emphasizing in this release:
- Fully integrated enterprise search
- RDBMS data migration
- Snap-in application log ingestion
- improvements to OpsCenter
- Elastic workload provisioning
Let’s take these one by one:
Fully integrated enterprise search or Solr on top of Cassandra
Cassandra distribution model is strongly inspired by Amazon Dynamo being characterized by high availability, elasticity, and fault tolerance. Solr is the search platform built on top of Lucene. Over time people learned how to scale Solr, but current approaches are far from being simple or offering an out of the box experience. Taking the Solr protocol and indexing capabilities and putting those on top of the Cassandra architecture makes a lot of sense.
Actually this has already been done in the form of Solandra (nb Solr integration in DataStax Enter. 2.0 is not based on Solandra though). For a scalable search solution there’s already ElasticSearch, but for someone running a Cassandra cluster, this looks like a useful addition to the stack.
DataStax has already showed this direction with what was called initially Brisk (or Brangelina for friends): Hadoop on top of the Cassandra cluster that became DataStax Enterprise 1.0. Solr on top of Cassandra is 2.0, but what will be the 3.0?
There are two cherries on top of this integration of Solr: easy index rebuild operations and CQL (Cassandra Query Language) access. I’ve seen XQuery translated to Lucene searches before, but I still need to see a SQL-like language translation.
As I’ve learned from Riak at Clipboard: Why Riak and How We Made Riak Search Faster, there is some complexity involved in scaling multi-matching search queries with term-based partitioning. Cassandra uses two partitioning strategies: random and order-preserving. It would be interesting to hear what partitioning strategy is used for Solr indexes. Update: I’ve got some answers so there’ll be a follow up with more details.
RDBMS data migration: it must be Sqoop
Nothing special here. You have a DataStax Enterprise cluster with some Hadoop nodes defined and you need to process data. But some of it lives in relational databases. Sqoop at rescue.
Snap-in application log ingestion: Flume or Scribe? No, it’s Log4j
When I read this bullet point my first thought was this is Flume. Or maybe Scribe. But most probably Flume. It looks like DataStax went a different route and offers log ingestion using Log4j. It’s true that Log4j or one of its flavors most probably exist in every Java project, but it still feels like an odd choice. On the other hand there’s a Cassandra plugin for Flume.
OpsCenter Enterprise 2.0
The OpsCenter is the management, monitoring, and control tool for DataStax Enterprise. The new version includes pretty much what you’d expect from an admin/monitoring tool:
- multi-cluster monitoring
- visual backup
- search monitoring
Looking back at the NoSQL administration/monitoring tools I’ve seen lately, I’m pretty sure I’ve identified a trend: they all come in various shades of black.
DataStax OpsCenter Enterprise:


Elastic workload provisioning
I’ve left at the end the feature that got me most interested into: elastic workload provisioning.
To better understand what this is, I had to go back to DataStax Enterprise 1.x where a node could be either a Cassandra node (OLTP) or a Hadoop node (processing). The new version allows quasi-dynamic node provisioning by changing the mode of a cluster (between Hadoop, Cassandra, Solr) with a stop/start operation. So given a cluster one could adjust its capacity and performance for different workloads (e.g. time-sensitive applications or temporary cluster operations).
Workload management is a feature present in most of the commercial data warehouse solutions. Even if in the very early days, DataStax Enterprise’s workload provisioning is the first take towards workload management in the NoSQL space.
Original title and link: Cassandra + Hadoop + Solr and Sqoop and Log4j => DataStax Enterprise 2.0 (©myNoSQL)
Monday, 13 February 2012
The components and their functions in the Hadoop ecosystem
Edd Dumbill enumerates the various components of the Hadoop ecosystem:

My quick reference of the Hadoop ecosystem is including a couple of other tools that are not in this list, with the exception of Ambari and HCatalog which were released later.
Original title and link: The components and their functions in the Hadoop ecosystem (©myNoSQL)
Wednesday, 21 December 2011
Project Isotope Will Bring Together Hadoop Toolchain With Microsoft’s Data Products
There’s a series of events lately that makes me think Microsoft is nowhere near accepting defeat in the cloud services area. As regards Microsoft’s Project Isotop, things are much simpler than ZDNet article make them sound[1]: Microsoft is working on integrating Hadoop and its toolchain with their own products (SQL Server Analysis Services, PowerPivot).

A picture worth more than the 626 words.
-
I bet the details of integration are fascinating and far from being simple, but the article is not focusing on those ↩
Original title and link: Project Isotope Will Bring Together Hadoop Toolchain With Microsoft’s Data Products (©myNoSQL)
Thursday, 1 December 2011
Couchbase Hadoop Connector: Another Sqoop Example
Announced a couple of days ago, the Couchbase Hadoop Connector is just another example of using Sqoop:
The Couchbase Hadoop Connecter utilizes the Sqoop plug-in to stream data from the Couchbase system to Cloudera’s Distribution Including Apache Hadoop (CDH), enabling consistent application performance while also allowing for heavy duty MapReduce processing of the same dataset. In an interactive web application environment, such as an ad targeting platform, this ability ensures low latency and high throughput to make optimized decisions about real-time ad placement.
I’m wondering if this connector have already been used by the AOL Advertising Architecture, which is using Hadoop and Membase. In case it wasn’t how it would improve things[1]?
-
If you know anyone that could speak about this (from Couchbase, Cloudera, or AOL) please contact me ↩
Original title and link: Couchbase Hadoop Connector: Another Sqoop Example (©myNoSQL)
Apache Sqoop: What, When Where, How
The other day I’ve posted about Sqoop’s first release under Apache umbrella, so I’ve thought of providing a bit more details about where Sqoop fits in picture. I’ve embedded below 3 presentations that will answer questions like what is Sqoop, when and where to use Sqoop, how to use Sqoop.
Wednesday, 30 November 2011
Apache Sqoop Announces First Release
Sqoop, originally created at Cloudera and now on Apache incubator, is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores. You can use Sqoop to import data from external structured datastores into HDFS or related systems like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.
Yesterday Apache Sqoop announced its first release under the Apache umbrella with a long list of changes.
To get a better idea of where Apache Sqoop fits, check this video from Hadoop World 2011 (requires registration) which describes key scenarios driving Hadoop and RDBMS integration and reviewes Apache Sqoop project, which besides supporting data movement between Hadoop and any JDBC database, it is also providing an framework which allows developers and vendors to create connectors optimized for specific targets such as Oracle, Netezza etc.
Original title and link: Apache Sqoop Announces First Release (©myNoSQL)
Monday, 27 June 2011
Biodiversity Indexing: Offline Processing With Hadoop, Hive, Sqoop, Oozie
The architecture for offline processing biodiversity based on Sqoop, Hadoop, Oozie, and Hive:

And its future:
Following this processing work, we expect to modify our crawling to harvest directly into HBase. The flexibility HBase offers will allow us to grow incrementally the richness of the terms indexed in the Portal, while integrating nicely into Hadoop based workflows. The addition of coprocessors to HBase is of particular interest to further reduce the latency involved in processing, by eliminating batch processing altogether.
Many companies working with large datasets have to deal with multiple systems and duplicate data between the online services and offline processors. While the infrastructure costs are going down, the costs of complexity are not. The HBase + Hadoop and Cassandra + Brisk combos are starting to address this problem.
Original title and link: Biodiversity Indexing: Offline Processing With Hadoop, Hive, Sqoop, Oozie (©myNoSQL)
via: http://www.cloudera.com/blog/2011/06/biodiversity-indexing-migration-from-mysql-to-hadoop/
Friday, 3 June 2011
Experimenting with Hadoop using Cloudera VirtualBox Demo

If you don’t count the download, you’ll get this up and running in 5 minutes tops. At the end you’ll have Hadoop, Sqoop, Pig, Hive, HBase, ZooKeeper, Oozie, Hume, Flume, and Whirr all configured and ready to experiment with.
Making it easy for users to experiment with these tools increases the chances for adoption. Adoption means business.
Original title and link: Experimenting with Hadoop using Cloudera VirtualBox Demo (NoSQL databases © myNoSQL)
Monday, 28 February 2011
Cloudera’s Distribution for Apache Hadoop version 3 Beta 4
New version of Cloudera’s Hadoop distro — complete release notes available here:
CDH3 Beta 4 also includes new versions of many components. Highlights include:
- HBase 0.90.1, including much improved stability and operability.
- Hive 0.7.0rc0, including the beginnings of authorization support, support for multiple databases, and many other new features.
- Pig 0.8.0, including many new features like scalar types, custom partitioners, and improved UDF language support.
- Flume 0.9.3, including support for Windows and improved monitoring capabilities.
- Sqoop 1.2, including improvements to usability and Oracle integration.
- Whirr 0.3, including support for starting HBase clusters on popular cloud platforms.
Plus many scalability improvements contributed by Yahoo!.
Cloudera’s CDH is the most popular Hadoop distro bringing together many components of the Hadoop ecosystem. Yahoo remains the main innovator behind Hadoop.
Original title and link: Cloudera’s Distribution for Apache Hadoop version 3 Beta 4 (NoSQL databases © myNoSQL)
via: http://www.cloudera.com/blog/2011/02/cdh3-beta-4-now-available
Most Popular Articles
- Translate SQL to MongoDB MapReduce
- Tutorial: Getting Started With Cassandra
- CouchDB vs MongoDB: An attempt for a More Informed Comparison
- Cassandra @ Twitter: An Interview with Ryan King
- A Couple of Nice GUI Tools for MongoDB
- NoSQL benchmarks and performance evaluations
- Ehcache: Distributed Cache or NoSQL Store?
- Document Databases Compared: CouchDB, MongoDB, RavenDB
- Quick Review of Existing Graph Databases
- NoSQL Data Modeling