Resource Requirements of Impala can be frustrating

cloudera-impalaAlthough I said in an earlier post that Impala is an exciting technology, using Impala under modest resources is very problematic. Although queries using small amounts of data come back faster, with large amounts of data, the problem just becomes the queries Fail to execute, and frankly only run in Hive. If you can afford new top end servers with 256GB, impala will work for you (this is the recommended setting), in reality for those with modest budgets, it can be a real problem. It’s better if a query takes a longer time to run, but not running it all poses a real problem.

The benchmarks where they say that Impala is faster than other sql on hadoop, tend to have top end servers, and some really large queries in those benchmarks fail to run. I was hoping that Impala would be useful for adhoc querying on limited amounts of data. Sometimes you just need hive, and the performance of hive queries, particularly on complicated processing flows is just too slow.

I was at a couple of meetups which were presented by Cloudera people, as well as listening to talks at Strata Hadoop New York, 2014 by Cloudera people, including the Project Lead, I am impressed with the work that is going on and the direction the project is going. However some of the new direction, seems to be diverging from Hive compatibility.

One frustration with Cloudera CDH 5.2 is it does not come with the Tez engine for hive, support says install at your own risk. Second, the installation of Spark is less than perfect. I am heading toward exploring and testing out Hortonworks implementation, hopefully Hive on Tez is less frustrating.

Cloudera Impala an Exciting Technology

cloudera-impalaCloudera Impala is an exciting Apache Hadoop based technology for doing fast sql queries on big data. Historically people have used Apache Hive, part of the hadoop tool set, but queries running on substantial data sets can take a long time to run. Hive turns the queries into map-reduce jobs and runs them on hadoop. Impala is a massively parallel query engine, chops up the data and the query facets into chunks and splits it up over the cluster which can run queries dramatically faster. A query can complete in seconds, where it might take an hour in hive. Essentially it is a subset of the functionality that hive provides. Impala does not support the map, list and set or json data types for instance, which one might use with the serde functionality, you might not be able to do with Impala. Some of the data transformation aspects of hive also aren’t supported. Some of the dml functionality update, delete are missing. You can connect to it with a Hive Server 2 driver to the impala specific port, using odbc, jdbc, and similar tools.

Impala prefers the Parquet storage format, which is a column oriented compressed binary format, though it can also create and insert in text formats. It can also query with Avro, RCFile, and SequenceFile, but can’t insert into. One particular issue working with Impala along side with Hive, using Parquet format tables, using timestamp columns or decimal fields is not supported in hive earlier, but will be provided in Hive 0.14 which is being tested at present.

Although the big data sql field has been changing recently with hive on Tez which Hive 0.13 will support, spark SQL and facebook’s Presto engine


Cassandra challenges

CassandraI have been recently working on doing research and development using Apache Cassandra (DataStax). Cassandra is an amazing piece of software. I went to a modeling class that a DataStax engineer ran that was quite impressive. He essentially said if you follow our advice it will work well, otherwise it might suck. I was struck by the need to ignore a lot of what we know about using relational databases, which I think can become a problem for some because the cql language makes you think that it is a relational database. when one works with it, one needs to build a model that both works well in Cassandra storage terms, but also in terms of your application. You can’t join, and entries based on the hash from the primary or cluster key might be scattered across your cluster. There are few functions to use, you really need to rethink how you architect and design your application.

Cassandra has also changed a lot from the earlier incantations to the new 2.1 version. The early versions used this thrift based api and the CQL language was introduced and enhanced and thrift is now essentially deprecated. There is a lot of drivers and solutions that have been built up using the old thrift based api, which going forward will not be usable. Several design ideas, for instance dynamic column families where you might have entries with the same column family or table having very different schemas, worked well in thrift, but will not in CQL. When researching compatible drivers, one should look for those implemented using CQL not thrift.

Loading large amounts of data into Cassandra is more difficult. It’s not like Mysql or Oracle where you can quickly load from text file or sql file, or a loader file. You essentially have two options. First write code that inserts into Cassandra using CQL through a driver, with this you might improve performance using async inserts and updates. Your other option may be building a sstable writer tool that rights into a sstable, essentially what Cassandra uses internally for storage, and streaming it into cassandra using sstableloader or the jmxloader. With this you are writing in java territory, fortunately there is a cql based sstable writer class you can use.

Twitter4j + twitter + hadoop

I have been doing a bit of work with hadoop of late in my work life, mainly using streaming map reduce and pig working to extract additional data out of weblogs, which is a powerful paradigm. Before the election I wanted to develop a way to look at data during the election period. Twitter is a powerful communication tool often trivialized, but is a powerful way to promote and for mass sentiment to be made known.

Twitter has a powerful streaming api, that allows twitter to push to the client the data in a large mass. PHP is often a tool that I have used as a rapid development tool, but usually lacks a multi-threaded model and libraries that implement features like twitter’s streaming api. Twitter4j is a good library for java and also works with android, which works well with twitter. This allowed me to capture a significant amount of data for analysis. the code had matured significantly by the time the town hall debate took place, which led to capturing a good quality of data. This run used a the Query Stream, which allowed to filter from the global data set that twitter is, and limit it to the united states and topics relating to the debate and presidential election. Wanting to do more work with hadoop’s java libraries and features, I wrote the hadoop map reduce jobs in java and setup a single pseudo distributed node to process the data. These are the results imported into Google spreadsheets.

Great visualizations with D3

I’ve been seeing a lot of amazing infographics and visualizations. I was at a conference presented by Actuate the BI company. They presented a talk on visualizations in part because of purchasing a company to help compete with products like Tableau whose product helps visualizing data. In the discussion was d3 a javascript html5 library, which sites like The New York Times uses to do some of the wonderful graphics they do. You can see from the samples gallery some of the amazing things you can do with it. If you have good skills with Css and Javascript, you can create very dynamic graphics for projects you are working on.

You can see to the left a clip from a project I have been working on in my free time. During the period leading up to the elections I was working on a project to capture twitter data during the debates and later build hadoop jobs to crunch the data and reduce it down to data. The sample to the left from the town hall debate is from the source data, top 100 sources, which represents the twitter clients that people were using with the larget text representing the most used clients. These used a word cloud type visualization, which is hard to draw conclusions from it, though you can pick out the important information. The data needed to be scaled from a range of approximately 30:58000, so I scaled using log10(n/500).