Impala: All content tagged as Impala in NoSQL databases and polyglot persistence
Tuesday, 30 April 2013
Cloudera Impala 1.0 Release Notes and A Couple of Questions
This is what I’ve been looking for since posting about Impala 1.0: the release notes. From the new features list:
- support for
ALTER TABLE REFRESHfor a single table- Hints for specifying particular join strategies
- Dynamic resource management, allowing high concurrency for Impala queries
Question: if I remember correctly Impala uses a single process on each machine to execute queries.
- is it multi-threaded?
- does it do any memory/CPU management so one query is not completely exhausting any of these resources?
- what happens with the queries executing when this process fails?
Original title and link: Cloudera Impala 1.0 Release Notes and A Couple of Questions (©myNoSQL)
Cloudera Impala Brings SQL Querying To Hadoop
InformationWeek about today’s Impala 1.0 release:
Impala supports direct querying of data in the Hadoop Distributed File System (HDFS) and HBase (NoSQL database) indexes, and Cloudera claims it’s 3X to 30X faster than Hive. Beta customers report results that are falling into that range. Six3 Systems, for example, a systems integrator serving federal agencies, has seen at least 14X faster querying than Hive, according to analytics developer Wayne Wheeles.
Original title and link: Cloudera Impala Brings SQL Querying To Hadoop (©myNoSQL)
Impala 1.0 - That was fast
Cloudera announces Impala 1.0 GA release.
That was fast—I guess this is one of the (little) advantages of having Hortonworks working on Stinger, Pivotal on HAWQ, Qubole offering Hive, Pig and Sqoop as-a-Service
Original title and link: Impala 1.0 - That was fast (©myNoSQL)
Tuesday, 26 February 2013
Cloudera Pissed Off
Charles Zedlewki takes position for Cloudera to the recent attacks to Hadoop and Impala:
I’m reminded of our open source strategy this week not only because of the further validation of Hadoop’s popularity but also because of the entry of a new round of proprietary imitators. At one point there were six distinct vendors all promoting proprietary filesystems as alternatives to HDFS, many of which included breathless claims of how they could make Apache Hadoop faster and “more powerful.” This year we get to see history repeat itself, this time with SQL engines. The marketing is nearly identical to that of the proprietary filesystem era: damning open source with faint praise, pointing out its limitations and extolling the virtues of some feature(s) proprietary to that particular vendor.
Proprietary SQL vendors will pull a page from the proprietary storage playbook: damn open source Impala with faint praise and point out its limitations, both real and contrived. They will be equally ineffective. We will continue to bet on an open, integrated, and highly flexible big data platform. Saying you are “all in on Hadoop” while simultaneously promoting a proprietary platform means you are missing the point.
Neither Cloudera, nor other companies that invested a lot and everything in the Hadoop ecosystem are at the size not to care about large corporations attacking their bets. Every corporation is trying to emulate the Microsoft strategy: wait for a new technology to be confirmed, then jump at the opportunity with all your forces. But I really hope open source will prevail.
Original title and link: Cloudera Pissed Off (©myNoSQL)
via: http://blog.cloudera.com/blog/2013/02/open-source-flattery-and-the-platform-for-big-data/
Tuesday, 19 February 2013
Inside Cloudera Impala: Runtime Code Generation
Nong Li about Cloudera’s Impala implementation:
Cloudera Impala, the open-source real-time query engine for Apache Hadoop, uses many tools and techniques to get the best query performance. This blog post will discuss how we use runtime code generation to significantly improve our CPU efficiency and overall query execution time. We’ll explain the types of inefficiency that code-generation eliminates and go over in more detail one of the queries in the TPCH workload where code generation improves overall query speeds by close to 3x.
This reminded me of the days I was working on Java AOP frameworks whose implementation was based on bytecode generation for the same purpose of optimization. Everything worked perfectly well as long as the underlying assumptions remained the same.
Original title and link: Inside Cloudera Impala: Runtime Code Generation (©myNoSQL)
via: http://blog.cloudera.com/blog/2013/02/inside-cloudera-impala-runtime-code-generation/
Wednesday, 30 January 2013
Hadoop in 2013: What Hortonworks Will Focus On
Shaun Connolly summarizing a recent webinar about where Hortonwork’s work on Hadoop will focus in 2013:
[…] Interactive Query, Business Continuity (DR, Snapshots, etc.), Secure Access, as well as ongoing investments in Data Integration, Management (i.e. Ambari), and Online Data (i.e. HBase).
[…] Rather than abandon the Apache Hive community, Hortonworks is focused on working in the community to optimize Hive’s ability to serve big data exploration and interactive query in support of important BI use cases. Moreover, we are focused on enabling Hive to take advantage of YARN in Apache Hadoop 2.0, which will help ensure fast query workloads don’t compete for resources with the other jobs running in the cluster. Enabling Hadoop to predictably support enterprise workloads that span Batch, Interactive, and Online use cases is an important area of focus for us.
Basically this says that Hortonworks sees YARN and Hive as the answer to online or real-time interactive querying of Hadoop data. Cloudera’s take on this is different.
Original title and link: Hadoop in 2013: What Hortonworks Will Focus On (©myNoSQL)
via: http://hortonworks.com/blog/the-road-ahead-for-hortonworks-and-hadoop/
Monday, 29 October 2012
Overview of Dremel-Like Solutions: Moving Beyond Hadoop for Big Data Needs
Until I learn more about the recently announced Cloudera Impala and Druid from Metamarkets, this article by Jaikumar Vijayan should offer—with some inherent mistakes1—a good overview of the solutions aiming to offer alternatives to the batch-processing nature of Hadoop:
- Google Dremel (BigQuery)
- Cloudera Impala
- Metamarkets Druid
- Nodeable StreamReduce
- SAP HANA integrated with Hadoop, etc.
-
Just an example: “If you can stand latencies of a few seconds, Hadoop is fine. But Hadoop MapReduce is never going to be useful for sub-second latencies”. Then “The technology [nb Google Dremel] can run queries over trillion-row data tables in seconds…”
Maybe just one more: consider the title “Moving beyond Hadoop” and then the quote from Google’s Ju-kay Kwek: “Google uses Dremel in conjuction with MapReduce. […] Hadoop and Dremel are distributed computing technologies, but each was built to address very different problems.” ↩
Original title and link: Overview of Dremel-Like Solutions: Moving Beyond Hadoop for Big Data Needs (©myNoSQL)