NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter



hadoop: All content tagged as hadoop in NoSQL databases and polyglot persistence

Complex data manipulation in Cascalog, Pig, and Hive

Bruno Bonacci brings up some very good points why using a single and coherent solution to manipulate data results in higher productivity by comparing what Pig and Hive require:

In languages like Pig and Hive, in order to make complex manipulation of your data you have to write User Defined Functions (UDF). UDFs are a great way to extend the basic functionality, however for Hive and Pig you have to use a different language to write your UDFs as the basic SQL or Pig Latin languages have only a handful of functions and they lack of basic control structures. Both they offer the possibility to write UDFs in a number of different languages (which is great), however this requires a programming paradigm switch by the developer. Pig allows to write UDFs in Java, Jython, JavaScript, Groovy, Ruby and Python, for Hive you need to write then in Java (good article here). I won’t make the example of UDFs in Java as the comparison won’t be fair, life is too short to write them in Java, but let’s assume that you want to write a UDF for Pig and you want to use Python. If you go for the JVM platform version (Jython) you won’t be able to use existing modules coming from Python ecosystem (unless they are in pure Python). Same for Ruby and Javascript. If you decide to use Python you will have the setup burden of installing Python and all the modules that you intend to use in every Hadoop task node. So, you start with a language such as Pig Latin or SQL, you have to write, compile and bundle UDFs in a different language, you are constrained to use only the plain language without importing modules or face the extra burden of additional setup and, as if is not enough, you have to smooth the type difference between the two languages during their communication back and forth with the UDF. For me that’s enough to say that we can do better than that. Cascalog is a Clojure DSL, so your main language is Clojure, your custom functions are Clojure, the data are represented in Clojure data types, and the runtime is the JVM, no-switch required, no additional compilation required, no installation burden, and you can use all available libraries in the JVM ecosystem.

I’m not a big fan of SQL, except the cases where it really belongs to; SQL-on-Hadoop is my least favorite topic, probably except the whole complexity of the ecosystem. In the space of multi-format/unstructured data I’ve always liked the pragmatism and legibility of Pig. But the OP is definitely right about the added complexity.

This also reminded me about the Python vs R “war”.

Original title and link: Complex data manipulation in Cascalog, Pig, and Hive (NoSQL database©myNoSQL)


Cascading components for a Big Data applications

Jules S. Damji in a quick intro to Cascading:

At the core of most data-driven applications is a data pipeline through which data flows, originating from Taps and Sources (ingestion) and ending in a Sink (retention) while undergoing transformation along a pipeline (Pipes, Traps, and Flows). And should something fail, a Trap (exception) must handle it. In the big data parlance, these are aspects of ETL operations.

You have to agree that when compared with the MapReduce model, these components could bring a lot of readability to your code. On the other hand, at a first glance Cascading API still feels verbose.

Original title and link: Cascading components for a Big Data applications (NoSQL database©myNoSQL)


Cloudera, Hadoop, Data warehouses and SLR camera

Amr Adawallah in an interview with Dan Woods for Forbes:

Our advantage is that we can encompass more data and run more workloads with less friction than any other platform. The analogy I use most often is the difference between the SLR camera and the camera on your smart phone. Almost everyone takes more pictures on their smart phone than on their SLR.

The SLR camera is like the enterprise data warehouse. The SLR camera is really, really good at taking pictures, in the same sense that an enterprise data warehouse is really, really good at running queries. But that’s the only thing it does. The data it picks is only exposed to that workload. The system we provide, the enterprise data hub, is more like the smartphone. It can take decent pictures—they won’t be as good as the SLR camera, and in this I’m referring to the Impala system. So Impala will run queries. The queries won’t run at the same interactive OLAP speeds that you get from a high-end data warehouse. However, for many use cases, that performance might be good enough, given that the cost is 10 times lower.

I’ve linked in the past to Ben Thomspon‘s visualizations of the innovator’s dillema:

ben thompson - innovator dilemma

The explanation goes like this: incumbents’ products are usually over-serving consumer needs thus leaving room to new entrants’ good-enough lower-priced products.

Original title and link: Cloudera, Hadoop, Data warehouses and SLR camera (NoSQL database©myNoSQL)


The beauty and challenge of Hadoop

Chad Carson describes in a short but persuasive way how Hadoop gets inside companies and the first challenges that follow:

We hear stories like this all the time, though sometimes the urgent email turns out to be from the CEO! These scenarios follow a common pattern in Hadoop adoption: Hadoop is such a flexible, scalable system that it’s easy for an engineer to quickly grab data that could never before be combined in one place, write some jobs, and get interesting results. Sometimes the results are so interesting that other teams start using them, and all of a sudden the company’s business depends on something that started as an experiment.

Original title and link: The beauty and challenge of Hadoop (NoSQL database©myNoSQL)


Docker, Hadoop and YARN

Jack Clark (The Register) covers the work done to integrate Docker with Hadoop:

“Where Docker makes perfect sense for YARN is that we can use Docker Images to fully describe the entire unix filesystem image for any YARN container,” explained Arun Murthy, a founder and architect at Hortonworks, to El Reg in an email.

Original title and link: Docker, Hadoop and YARN (NoSQL database©myNoSQL)


The essence of Pig

I love this line from Wes Floyd’s slidedeck:

“Essence of Pig: Map-Reduce is too low a level, SQL too high”

Original title and link: The essence of Pig (NoSQL database©myNoSQL)

A retrospective of two years of Big Data with Andrew Brust

Andrew Brust on his way out from ZDNet to GigaOm Research:

As much as I chide the Hadoop world for having started out artificially siloed and aloof, it did the industry a great service: it took the mostly- ossified world of databases, data warehouses and BI and made it dynamic again.

Suddenly, the incumbent players had to respond, add value to their products, and innovate rapidly. It’s hard to imagine that having happened without Hadoop.

Original title and link: A retrospective of two years of Big Data with Andrew Brust (NoSQL database©myNoSQL)


Hadoop distro for IBM's Mainframe

IBM and its partner Veristorm are working to merge the worlds of big data and Big Iron with zDoop, a new offering unveiled last week that offers Apache Hadoop running in the mainframe’s Linux environment.

3 hip hip hoorays for Hadoop on mainframes.

Original title and link: Hadoop distro for IBM’s Mainframe (NoSQL database©myNoSQL)


Hadoop and big data: Where Apache Slider slots in and why it matters

Arun Murthy for ZDNet about Apache Slider:

Slider is a framework that allows you to bridge existing always-on services and makes sure they work really well on top of YARN without having to modify the application itself. That’s really important.

Right now it’s HBase and Accumulo but it could be Cassandra, it could be MongoDB, it could be anything in the world. That’s the key part.

I couldn’t find the project on the Incubator page.

Original title and link: Hadoop and big data: Where Apache Slider slots in and why it matters (NoSQL database©myNoSQL)


Price Comparison for Big Data Appliance and Hadoop

The main differences between Oracle Big Data Appliance and a DIY approach are:

  1. A DIY system - at list price with basic installation but no optimization - is a staggering $220 cheaper as an initial purchase
  2. A DIY system - at list price with basic installation but no optimization - is almost $250,000 more expensive over 3 years.
  3. The support for the DIY system includes five (5) vendors. Your hardware support vendor, the OS vendor, your Hadoop vendor, your encryption vendor as well as your database vendor. Oracle Big Data Appliance is supported end-to- end by a single vendor: Oracle
  4. Time to value. While we trust that your IT staff will get the DIY system up and running, the Oracle system allows for a much faster “loading dock to loading data” time. Typically a few days instead of a few weeks (or even months)
  5. Oracle Big Data Appliance is tuned and configured to take advantage of the software stack, the CPUs and InfiniBand network it runs on
  6. Any issue we, you or any other BDA customer finds in the system is fixed for all customers. You do not have a unique configuration, with unique issues on top of the generic issues.

This is coming from Oracle. Now, without nitpicking prices — I’m pretty sure you’ll find better numbers for the different components — how do you sell Hadoop to the potential customer that took a look at this?

Original title and link: Price Comparison for Big Data Appliance and Hadoop (NoSQL database©myNoSQL)


Hadoop analytics startup Karmasphere sells itself to FICO

Derrick Harris (GigaOm):

The Fair Isaac Corporation, better known as FICO, has acquired the intellectual property of Hadoop startup Karmasphere. Karmasphere launched in 2010, and was one of the first companies to push the idea of an easy, visual interface for analyzing Hadoop data, and even analyzing it using traditional SQL queries.

Original title and link: Hadoop analytics startup Karmasphere sells itself to FICO (NoSQL database©myNoSQL)


Hortonworks: the Red Hat of Hadoop

However, John Furrier, founder of SiliconANGLE, posits that Hortonworks, with their similar DNA being applied in the data world, is, in fact, the Red Hat of Hadoop. “The discipline required,” he says, “really is a long game.”

It looks like Hortonworks’s positioning has been successful in that they are now perceived as the true (and only) open sourcerers.

Original title and link: Hortonworks: the Red Hat of Hadoop (NoSQL database©myNoSQL)