Cloudera Impala an Exciting Technology

cloudera-impalaCloudera Impala is an exciting Apache Hadoop based technology for doing fast sql queries on big data. Historically people have used Apache Hive, part of the hadoop tool set, but queries running on substantial data sets can take a long time to run. Hive turns the queries into map-reduce jobs and runs them on hadoop. Impala is a massively parallel query engine, chops up the data and the query facets into chunks and splits it up over the cluster which can run queries dramatically faster. A query can complete in seconds, where it might take an hour in hive. Essentially it is a subset of the functionality that hive provides. Impala does not support the map, list and set or json data types for instance, which one might use with the serde functionality, you might not be able to do with Impala. Some of the data transformation aspects of hive also aren’t supported. Some of the dml functionality update, delete are missing. You can connect to it with a Hive Server 2 driver to the impala specific port, using odbc, jdbc, and similar tools.

Impala prefers the Parquet storage format, which is a column oriented compressed binary format, though it can also create and insert in text formats. It can also query with Avro, RCFile, and SequenceFile, but can’t insert into. One particular issue working with Impala along side with Hive, using Parquet format tables, using timestamp columns or decimal fields is not supported in hive earlier, but will be provided in Hive 0.14 which is being tested at present.

Although the big data sql field has been changing recently with hive on Tez which Hive 0.13 will support, spark SQL and facebook’s Presto engine


Redis, the more Powerful NoSql Alternative to Memcached

The NoSql Paradigm has proven to be an interesting alternative to rdms systems, particularly where having a system that is more flexible than one requiring a very complicated Schema to account for complex aspects of an application, particularly those that change quickly. Document oriented systems such as Mongodb are useful to fit this type of problem. It supports highly scalable sharding system. Mongodb can become complicated, when sharding is involved, but other systems can as well. If you want to store complicated model data, Mongodb is a good choice. It offers good performance and features. Memcached is a key value store which has had a following for caching website content and mysql queries. It is a purely memory based store. Mongodb, on the other side is a disk based store using memory mapping of the file, disk storage and disk I/O is always involved.

An interesting alternative to Memcached with more than just a get/set model as well as solving the problem that Memcached has to solve. This is Redis, a key value store featuring in addition to simple key value features, lists, sets, and hashes. Even in the plain key space, redis ads features such as in place string manipulation. With these new data types, it features atomic operations on these protecting the integrity of the data contained within. Memcached is a good product for what it is, a cache system and key value store but it’s limited in terms of the sematics it supports for more complicated uses, and the developer winds up writing more complicated code to deal with the limitations.

Redis is also fast. It is comparable in speed to Memcached. A writer did a benchmark comparison, though Salvatore Sanfililipo the lead developer of Redis found the benchmark problematical including it  being a tight loop which didn’t show how it behaved with many clients. This benchmark showed Redis slightly slower than memcached. A rewritten apples to apples benchmark showed Redis performing slightly better, particularly under more clients. Even with this comparison, it seems like comparing apples to oranges in a way, the comparison doesn’t show a comparison to a product with comparable features.

Redis is a newer product, though the developers focus on a high quality product with fast performance. The newest version is in release candidate status, with new features to manage multiple keys in one command, and future clustering and sharding features coming. Personally I am not sure of the value of these, but I am user there are those who can use them. One challenge are the clients. There are clients for many languages, even the estoeric. There are a number of php clients. The most mature however only works with php-5.3 and requires namespaces, a feature that has value, but if you are supporting 5.2 servers it won’t work. There is a C based php module, but it is very new and not in the Pecl collection yet. One I like, Rediska, written by a russian group which is well designed, performs well, and integrates very well with Zend Framework.