Although I said in an earlier post that Impala is an exciting technology, using Impala under modest resources is very problematic. Although queries using small amounts of data come back faster, with large amounts of data, the problem just becomes the queries Fail to execute, and frankly only run in Hive. If you can afford new top end servers with 256GB, impala will work for you (this is the recommended setting), in reality for those with modest budgets, it can be a real problem. It’s better if a query takes a longer time to run, but not running it all poses a real problem.
The benchmarks where they say that Impala is faster than other sql on hadoop, tend to have top end servers, and some really large queries in those benchmarks fail to run. I was hoping that Impala would be useful for adhoc querying on limited amounts of data. Sometimes you just need hive, and the performance of hive queries, particularly on complicated processing flows is just too slow.
I was at a couple of meetups which were presented by Cloudera people, as well as listening to talks at Strata Hadoop New York, 2014 by Cloudera people, including the Project Lead, I am impressed with the work that is going on and the direction the project is going. However some of the new direction, seems to be diverging from Hive compatibility.
One frustration with Cloudera CDH 5.2 is it does not come with the Tez engine for hive, support says install at your own risk. Second, the installation of Spark is less than perfect. I am heading toward exploring and testing out Hortonworks implementation, hopefully Hive on Tez is less frustrating.