Cloudera Impala an Exciting Technology

cloudera-impalaCloudera Impala is an exciting Apache Hadoop based technology for doing fast sql queries on big data. Historically people have used Apache Hive, part of the hadoop tool set, but queries running on substantial data sets can take a long time to run. Hive turns the queries into map-reduce jobs and runs them on hadoop. Impala is a massively parallel query engine, chops up the data and the query facets into chunks and splits it up over the cluster which can run queries dramatically faster. A query can complete in seconds, where it might take an hour in hive. Essentially it is a subset of the functionality that hive provides. Impala does not support the map, list and set or json data types for instance, which one might use with the serde functionality, you might not be able to do with Impala. Some of the data transformation aspects of hive also aren’t supported. Some of the dml functionality update, delete are missing. You can connect to it with a Hive Server 2 driver to the impala specific port, using odbc, jdbc, and similar tools.

Impala prefers the Parquet storage format, which is a column oriented compressed binary format, though it can also create and insert in text formats. It can also query with Avro, RCFile, and SequenceFile, but can’t insert into. One particular issue working with Impala along side with Hive, using Parquet format tables, using timestamp columns or decimal fields is not supported in hive earlier, but will be provided in Hive 0.14 which is being tested at present.

Although the big data sql field has been changing recently with hive on Tez which Hive 0.13 will support, spark SQL and facebook’s Presto engine