Cloudera Impala an Exciting Technology

cloudera-impalaCloudera Impala is an exciting Apache Hadoop based technology for doing fast sql queries on big data. Historically people have used Apache Hive, part of the hadoop tool set, but queries running on substantial data sets can take a long time to run. Hive turns the queries into map-reduce jobs and runs them on hadoop. Impala is a massively parallel query engine, chops up the data and the query facets into chunks and splits it up over the cluster which can run queries dramatically faster. A query can complete in seconds, where it might take an hour in hive. Essentially it is a subset of the functionality that hive provides. Impala does not support the map, list and set or json data types for instance, which one might use with the serde functionality, you might not be able to do with Impala. Some of the data transformation aspects of hive also aren’t supported. Some of the dml functionality update, delete are missing. You can connect to it with a Hive Server 2 driver to the impala specific port, using odbc, jdbc, and similar tools.

Impala prefers the Parquet storage format, which is a column oriented compressed binary format, though it can also create and insert in text formats. It can also query with Avro, RCFile, and SequenceFile, but can’t insert into. One particular issue working with Impala along side with Hive, using Parquet format tables, using timestamp columns or decimal fields is not supported in hive earlier, but will be provided in Hive 0.14 which is being tested at present.

Although the big data sql field has been changing recently with hive on Tez which Hive 0.13 will support, spark SQL and facebook’s Presto engine


Cassandra challenges

CassandraI have been recently working on doing research and development using Apache Cassandra (DataStax). Cassandra is an amazing piece of software. I went to a modeling class that a DataStax engineer ran that was quite impressive. He essentially said if you follow our advice it will work well, otherwise it might suck. I was struck by the need to ignore a lot of what we know about using relational databases, which I think can become a problem for some because the cql language makes you think that it is a relational database. when one works with it, one needs to build a model that both works well in Cassandra storage terms, but also in terms of your application. You can’t join, and entries based on the hash from the primary or cluster key might be scattered across your cluster. There are few functions to use, you really need to rethink how you architect and design your application.

Cassandra has also changed a lot from the earlier incantations to the new 2.1 version. The early versions used this thrift based api and the CQL language was introduced and enhanced and thrift is now essentially deprecated. There is a lot of drivers and solutions that have been built up using the old thrift based api, which going forward will not be usable. Several design ideas, for instance dynamic column families where you might have entries with the same column family or table having very different schemas, worked well in thrift, but will not in CQL. When researching compatible drivers, one should look for those implemented using CQL not thrift.

Loading large amounts of data into Cassandra is more difficult. It’s not like Mysql or Oracle where you can quickly load from text file or sql file, or a loader file. You essentially have two options. First write code that inserts into Cassandra using CQL through a driver, with this you might improve performance using async inserts and updates. Your other option may be building a sstable writer tool that rights into a sstable, essentially what Cassandra uses internally for storage, and streaming it into cassandra using sstableloader or the jmxloader. With this you are writing in java territory, fortunately there is a cql based sstable writer class you can use.

New York Times being Hacked, Implications

We heard this week about The New York Times being hacked by the Chinese government in retaliation for articles they have written. These remind me that we live  in unsafe times. Waiting for the government to make us safe from the outside world is silly. We need to be more thoughtful about security issues. We need to see that our applications are properly secured, and our networks are secured. Many companies use software like Java, or open source applications like Joomla, WordPress, PhpMyAdmin, Drupal, osCommerce, Zend Cart, X Cart, Openx Adserver, which if not updated frequently and properly secured can allow hackers to exploit or corrupt systems.

Antifragile and implications for software

AntifragileI just finished reading Nassim Nicholas Taleb’s book Antifragile: Things That Gain from Disorder. This is a fascinating book, particularly for those interested in statistics and critical thinking and better understanding the world we currently deal with. He is clearly very bright, and makes a lot of good points. However I don’t agree with many of his arguments. I would not like us to go back to MS-DOS and Windows 3.1 or Java 1.3, just because they are old. I tend to take a more careful thinking and evaluation before moving forward toward new technology. For example products like NodeJs. I don’t want to replace nginx or apache webservers with javascript code running in NodeJs, I think NodeJs, the community and a lot of the libraries are far to immature, like things were with Java 1.2 or Microsoft’s first C++ compiler, it takes more time before things develop. I Like Mongodb, but decided to wait before building applications depending on it. Taleb talks about Black Swan events, which remind me of Hurricane Sandy and the damage it left, and several of the Amazon Aws outages. I think companies need to be careful of putting all their operations with one provider or getting too tightly coupled to platforms that, can have outage events, or problems with availability and developers need to build in to their applications handling to deal with problems like availability and alternative schemes that can be switched, for instance local databases in one’s data center. Companies need to anticipates big swings in demand, and assume that’s something you will deal with, not I’ll deal with it when it comes

Twitter4j + twitter + hadoop

I have been doing a bit of work with hadoop of late in my work life, mainly using streaming map reduce and pig working to extract additional data out of weblogs, which is a powerful paradigm. Before the election I wanted to develop a way to look at data during the election period. Twitter is a powerful communication tool often trivialized, but is a powerful way to promote and for mass sentiment to be made known.

Twitter has a powerful streaming api, that allows twitter to push to the client the data in a large mass. PHP is often a tool that I have used as a rapid development tool, but usually lacks a multi-threaded model and libraries that implement features like twitter’s streaming api. Twitter4j is a good library for java and also works with android, which works well with twitter. This allowed me to capture a significant amount of data for analysis. the code had matured significantly by the time the town hall debate took place, which led to capturing a good quality of data. This run used a the Query Stream, which allowed to filter from the global data set that twitter is, and limit it to the united states and topics relating to the debate and presidential election. Wanting to do more work with hadoop’s java libraries and features, I wrote the hadoop map reduce jobs in java and setup a single pseudo distributed node to process the data. These are the results imported into Google spreadsheets.

Barnes and Noble, You Could Do Better

Let’s give Barnes and Noble some credit. The Nook color is a nice device as a reader with some tablet stuff. At the time they came out with it, the android honeycomb code base was not available, so one can see why they went with the older code base when they designed their device. The Honeycomb source has been available for a while now. Why not upgrade your device to the new Android codebase so you can run better more modern apps which can be those optimized for a tablet versus a phone. Now Barnes and Noble comes out with the Nook Tablet and they are still in the old android code base. What gives, the honeycomb code has been around for a while now. Why use an aging code base that isn’t even intended for tablet use?

Kindle Fire, Lame Decision

Amazon made a lame decision to build the kindle fire with an android 2.2 tablet and not working with the android Honeycomb class codebase, which are typically 3.2 version. The device has a limited amount of memory as well (8GB) and not 16GB or 32GB which is standard with the honeycomb tablets. the User Interface also looks like it was  borrowed from Barnes and Noble’s Nook Color device, also a android 2.2 device. One wonders what manufacturer is building these devices for them. When the honeycomb devices started to come out, Google clamped down on the rights to the code base so only well established manufacturers could get the honeycomb code base.

Built new Celebrity Astrology Tablet App

I just built a new version of my Celebrity Astrology App for Honeycomb tablets with a new design. It turned out very nice, much better than the standard phone modal designs. Using the new Fragment architecture allows multiple views to be placed on screen, each with separate code, so user isn’t switching as often to new Activities and the user flow is far more natural and easy to work with. This is a real aspect where the 2.2 android devices fall short. I am sure there are those who will try to emulate them for devices like Kindle Fire, but Honeycomb and later devices are far superior and are designed for tablets.

Experiences Developing PHP Extensions

In the past I had done a lot of development in C and C++. It used to be the language I worked the most with. C++ can help make developing a complicated application far easier than writing it in pure C. I have moved away into doing more Java work in the last four years. I started with PHP during the latter part of the C/C++ period. I wanted to try to work on developing a PHP extension based on some C and C++ code I had written a number of years ago.

Reading about extension writing appears very intimidating because of the Macros and Zend api aspects. Starting off from scratch this can be very challenging, particularly setting up the structures and the boilerplate as it were for the extension. PHP is written in standard K&R C so that it can be compiled on any platform, which is a great feature, but the code is very intricate, particularly when it comes to object oriented code and reference counting. There have been articles written about wrapping C++ classes into PHP extensions, which seems like a possibility in terms of working with C++ libraries. You physically can build extensions with C++, but for wider distribution on PECL writing your extensions in C is probably the smarter way to go.

I found a Pear project to do the code generation for PHP extensions. This project takes an xml descriptor including code blocks, which generates all the code especially the boilerplate code. It seems to be promissing, It seems it’s best to  continually doing edits in the .xml file and generating new sets of code whenever possible as it makes it easier adding new functions. This approach also adds the necessary support for unit tests. Certain aspects of it seems to not be as well flushed out as could be. In a more complex extension, going beyond what is provided with this project.

PHP uses the the GNU Autotools system, autoconfigure and automake to configure builds. It makes use of M4 macros to manage and extending the basic scripts. It provides a way to interrogate the system and determine the options needed to build a piece of code on a given system. Unfortunately this doesn’t work on Windows. Windows requires a separate build script system in JScript. Aspects of the M4 macros can be challenging to understand which may be a problem with more complex projects.

PHP uses a unit test system using files with snippets of code, expectations, etc, that once one completes make on the extension make test will execute the test and show test results. You write these tests through the .xml described above.

Redis, the more Powerful NoSql Alternative to Memcached

The NoSql Paradigm has proven to be an interesting alternative to rdms systems, particularly where having a system that is more flexible than one requiring a very complicated Schema to account for complex aspects of an application, particularly those that change quickly. Document oriented systems such as Mongodb are useful to fit this type of problem. It supports highly scalable sharding system. Mongodb can become complicated, when sharding is involved, but other systems can as well. If you want to store complicated model data, Mongodb is a good choice. It offers good performance and features. Memcached is a key value store which has had a following for caching website content and mysql queries. It is a purely memory based store. Mongodb, on the other side is a disk based store using memory mapping of the file, disk storage and disk I/O is always involved.

An interesting alternative to Memcached with more than just a get/set model as well as solving the problem that Memcached has to solve. This is Redis, a key value store featuring in addition to simple key value features, lists, sets, and hashes. Even in the plain key space, redis ads features such as in place string manipulation. With these new data types, it features atomic operations on these protecting the integrity of the data contained within. Memcached is a good product for what it is, a cache system and key value store but it’s limited in terms of the sematics it supports for more complicated uses, and the developer winds up writing more complicated code to deal with the limitations.

Redis is also fast. It is comparable in speed to Memcached. A writer did a benchmark comparison, though Salvatore Sanfililipo the lead developer of Redis found the benchmark problematical including it  being a tight loop which didn’t show how it behaved with many clients. This benchmark showed Redis slightly slower than memcached. A rewritten apples to apples benchmark showed Redis performing slightly better, particularly under more clients. Even with this comparison, it seems like comparing apples to oranges in a way, the comparison doesn’t show a comparison to a product with comparable features.

Redis is a newer product, though the developers focus on a high quality product with fast performance. The newest version is in release candidate status, with new features to manage multiple keys in one command, and future clustering and sharding features coming. Personally I am not sure of the value of these, but I am user there are those who can use them. One challenge are the clients. There are clients for many languages, even the estoeric. There are a number of php clients. The most mature however only works with php-5.3 and requires namespaces, a feature that has value, but if you are supporting 5.2 servers it won’t work. There is a C based php module, but it is very new and not in the Pecl collection yet. One I like, Rediska, written by a russian group which is well designed, performs well, and integrates very well with Zend Framework.