ALL COVERED TOPICS

NoSQL Benchmarks NoSQL use cases NoSQL Videos NoSQL Hybrid Solutions NoSQL Presentations Big Data Hadoop MapReduce Pig Hive Flume Oozie Sqoop HDFS ZooKeeper Cascading Cascalog BigTable Cassandra HBase Hypertable Couchbase CouchDB MongoDB OrientDB RavenDB Jackrabbit Terrastore Amazon DynamoDB Redis Riak Project Voldemort Tokyo Cabinet Kyoto Cabinet memcached Amazon SimpleDB Datomic MemcacheDB M/DB GT.M Amazon Dynamo Dynomite Mnesia Yahoo! PNUTS/Sherpa Neo4j InfoGrid Sones GraphDB InfiniteGraph AllegroGraph MarkLogic Clustrix CouchDB Case Studies MongoDB Case Studies NoSQL at Adobe NoSQL at Facebook NoSQL at Twitter

NAVIGATE MAIN CATEGORIES

Close

Impala: All content tagged as Impala in NoSQL databases and polyglot persistence

Cloudera Impala 1.0 Release Notes and A Couple of Questions

This is what I’ve been looking for since posting about Impala 1.0: the release notes. From the new features list:

  • support for ALTER TABLE
  • REFRESH for a single table
  • Hints for specifying particular join strategies
  • Dynamic resource management, allowing high concurrency for Impala queries

Question: if I remember correctly Impala uses a single process on each machine to execute queries.

  1. is it multi-threaded?
  2. does it do any memory/CPU management so one query is not completely exhausting any of these resources?
  3. what happens with the queries executing when this process fails?

Original title and link: Cloudera Impala 1.0 Release Notes and A Couple of Questions (NoSQL database©myNoSQL)


Cloudera Impala Brings SQL Querying To Hadoop

InformationWeek about today’s Impala 1.0 release:

Impala supports direct querying of data in the Hadoop Distributed File System (HDFS) and HBase (NoSQL database) indexes, and Cloudera claims it’s 3X to 30X faster than Hive. Beta customers report results that are falling into that range. Six3 Systems, for example, a systems integrator serving federal agencies, has seen at least 14X faster querying than Hive, according to analytics developer Wayne Wheeles.

Original title and link: Cloudera Impala Brings SQL Querying To Hadoop (NoSQL database©myNoSQL)

via: http://www.informationweek.com/big-data/news/software/information-management/cloudera-impala-brings-sql-querying-to-h/240153861


Impala 1.0 - That was fast

Cloudera announces Impala 1.0 GA release.

That was fast—I guess this is one of the (little) advantages of having Hortonworks working on Stinger, Pivotal on HAWQ, Qubole offering Hive, Pig and Sqoop as-a-Service

Original title and link: Impala 1.0 - That was fast (NoSQL database©myNoSQL)


Cloudera Pissed Off

Charles Zedlewki takes position for Cloudera to the recent attacks to Hadoop and Impala:

I’m reminded of our open source strategy this week not only because of the further validation of Hadoop’s popularity but also because of the entry of a new round of proprietary imitators. At one point there were six distinct vendors all promoting proprietary filesystems as alternatives to HDFS, many of which included breathless claims of how they could make Apache Hadoop faster and “more powerful.” This year we get to see history repeat itself, this time with SQL engines. The marketing is nearly identical to that of the proprietary filesystem era: damning open source with faint praise, pointing out its limitations and extolling the virtues of some feature(s) proprietary to that particular vendor.

Proprietary SQL vendors will pull a page from the proprietary storage playbook: damn open source Impala with faint praise and point out its limitations, both real and contrived. They will be equally ineffective. We will continue to bet on an open, integrated, and highly flexible big data platform. Saying you are “all in on Hadoop” while simultaneously promoting a proprietary platform means you are missing the point.

Neither Cloudera, nor other companies that invested a lot and everything in the Hadoop ecosystem are at the size not to care about large corporations attacking their bets. Every corporation is trying to emulate the Microsoft strategy: wait for a new technology to be confirmed, then jump at the opportunity with all your forces. But I really hope open source will prevail.

Original title and link: Cloudera Pissed Off (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2013/02/open-source-flattery-and-the-platform-for-big-data/


Inside Cloudera Impala: Runtime Code Generation

Nong Li about Cloudera’s Impala implementation:

Cloudera Impala, the open-source real-time query engine for Apache Hadoop, uses many tools and techniques to get the best query performance. This blog post will discuss how we use runtime code generation to significantly improve our CPU efficiency and overall query execution time. We’ll explain the types of inefficiency that code-generation eliminates and go over in more detail one of the queries in the TPCH workload where code generation improves overall query speeds by close to 3x.

This reminded me of the days I was working on Java AOP frameworks whose implementation was based on bytecode generation for the same purpose of optimization. Everything worked perfectly well as long as the underlying assumptions remained the same.

Original title and link: Inside Cloudera Impala: Runtime Code Generation (NoSQL database©myNoSQL)

via: http://blog.cloudera.com/blog/2013/02/inside-cloudera-impala-runtime-code-generation/


Hadoop in 2013: What Hortonworks Will Focus On

Shaun Connolly summarizing a recent webinar about where Hortonwork’s work on Hadoop will focus in 2013:

[…] Interactive Query, Business Continuity (DR, Snapshots, etc.), Secure Access, as well as ongoing investments in Data Integration, Management (i.e. Ambari), and Online Data (i.e. HBase).
[…] Rather than abandon the Apache Hive community, Hortonworks is focused on working in the community to optimize Hive’s ability to serve big data exploration and interactive query in support of important BI use cases. Moreover, we are focused on enabling Hive to take advantage of YARN in Apache Hadoop 2.0, which will help ensure fast query workloads don’t compete for resources with the other jobs running in the cluster. Enabling Hadoop to predictably support enterprise workloads that span Batch, Interactive, and Online use cases is an important area of focus for us.

Basically this says that Hortonworks sees YARN and Hive as the answer to online or real-time interactive querying of Hadoop data. Cloudera’s take on this is different.

Original title and link: Hadoop in 2013: What Hortonworks Will Focus On (NoSQL database©myNoSQL)

via: http://hortonworks.com/blog/the-road-ahead-for-hortonworks-and-hadoop/


Overview of Dremel-Like Solutions: Moving Beyond Hadoop for Big Data Needs

Until I learn more about the recently announced Cloudera Impala and Druid from Metamarkets, this article by Jaikumar Vijayan should offer—with some inherent mistakes1—a good overview of the solutions aiming to offer alternatives to the batch-processing nature of Hadoop:

  • Google Dremel (BigQuery)
  • Cloudera Impala
  • Metamarkets Druid
  • Nodeable StreamReduce
  • SAP HANA integrated with Hadoop, etc.

  1. Just an example: “If you can stand latencies of a few seconds, Hadoop is fine. But Hadoop MapReduce is never going to be useful for sub-second latencies”. Then “The technology [nb Google Dremel] can run queries over trillion-row data tables in seconds…”

    Maybe just one more: consider the title “Moving beyond Hadoop” and then the quote from Google’s Ju-kay Kwek: “Google uses Dremel in conjuction with MapReduce. […] Hadoop and Dremel are distributed computing technologies, but each was built to address very different problems.” 

Original title and link: Overview of Dremel-Like Solutions: Moving Beyond Hadoop for Big Data Needs (NoSQL database©myNoSQL)

via: http://www.infoworld.com/print/205879