Resource Requirements of Impala can be frustrating

cloudera-impalaAlthough I said in an earlier post that Impala is an exciting technology, using Impala under modest resources is very problematic. Although queries using small amounts of data come back faster, with large amounts of data, the problem just becomes the queries Fail to execute, and frankly only run in Hive. If you can afford new top end servers with 256GB, impala will work for you (this is the recommended setting), in reality for those with modest budgets, it can be a real problem. It’s better if a query takes a longer time to run, but not running it all poses a real problem.

The benchmarks where they say that Impala is faster than other sql on hadoop, tend to have top end servers, and some really large queries in those benchmarks fail to run. I was hoping that Impala would be useful for adhoc querying on limited amounts of data. Sometimes you just need hive, and the performance of hive queries, particularly on complicated processing flows is just too slow.

I was at a couple of meetups which were presented by Cloudera people, as well as listening to talks at Strata Hadoop New York, 2014 by Cloudera people, including the Project Lead, I am impressed with the work that is going on and the direction the project is going. However some of the new direction, seems to be diverging from Hive compatibility.

One frustration with Cloudera CDH 5.2 is it does not come with the Tez engine for hive, support says install at your own risk. Second, the installation of Spark is less than perfect. I am heading toward exploring and testing out Hortonworks implementation, hopefully Hive on Tez is less frustrating.

Cloudera Impala an Exciting Technology

cloudera-impalaCloudera Impala is an exciting Apache Hadoop based technology for doing fast sql queries on big data. Historically people have used Apache Hive, part of the hadoop tool set, but queries running on substantial data sets can take a long time to run. Hive turns the queries into map-reduce jobs and runs them on hadoop. Impala is a massively parallel query engine, chops up the data and the query facets into chunks and splits it up over the cluster which can run queries dramatically faster. A query can complete in seconds, where it might take an hour in hive. Essentially it is a subset of the functionality that hive provides. Impala does not support the map, list and set or json data types for instance, which one might use with the serde functionality, you might not be able to do with Impala. Some of the data transformation aspects of hive also aren’t supported. Some of the dml functionality update, delete are missing. You can connect to it with a Hive Server 2 driver to the impala specific port, using odbc, jdbc, and similar tools.

Impala prefers the Parquet storage format, which is a column oriented compressed binary format, though it can also create and insert in text formats. It can also query with Avro, RCFile, and SequenceFile, but can’t insert into. One particular issue working with Impala along side with Hive, using Parquet format tables, using timestamp columns or decimal fields is not supported in hive earlier, but will be provided in Hive 0.14 which is being tested at present.

Although the big data sql field has been changing recently with hive on Tez which Hive 0.13 will support, spark SQL and facebook’s Presto engine


Cassandra challenges

CassandraI have been recently working on doing research and development using Apache Cassandra (DataStax). Cassandra is an amazing piece of software. I went to a modeling class that a DataStax engineer ran that was quite impressive. He essentially said if you follow our advice it will work well, otherwise it might suck. I was struck by the need to ignore a lot of what we know about using relational databases, which I think can become a problem for some because the cql language makes you think that it is a relational database. when one works with it, one needs to build a model that both works well in Cassandra storage terms, but also in terms of your application. You can’t join, and entries based on the hash from the primary or cluster key might be scattered across your cluster. There are few functions to use, you really need to rethink how you architect and design your application.

Cassandra has also changed a lot from the earlier incantations to the new 2.1 version. The early versions used this thrift based api and the CQL language was introduced and enhanced and thrift is now essentially deprecated. There is a lot of drivers and solutions that have been built up using the old thrift based api, which going forward will not be usable. Several design ideas, for instance dynamic column families where you might have entries with the same column family or table having very different schemas, worked well in thrift, but will not in CQL. When researching compatible drivers, one should look for those implemented using CQL not thrift.

Loading large amounts of data into Cassandra is more difficult. It’s not like Mysql or Oracle where you can quickly load from text file or sql file, or a loader file. You essentially have two options. First write code that inserts into Cassandra using CQL through a driver, with this you might improve performance using async inserts and updates. Your other option may be building a sstable writer tool that rights into a sstable, essentially what Cassandra uses internally for storage, and streaming it into cassandra using sstableloader or the jmxloader. With this you are writing in java territory, fortunately there is a cql based sstable writer class you can use.

Shellshock the hyperbola and reality

bashdb-breakThere has been a lot about the Shellshock vunlnerability in the media, that has it’s roots in the bash command line tool on Linux and Unix environments. for hackers, coders and system administrators there are issues that should be checked out.

However whenever I see security horror shows like we have seen recently, I am reminded that many of these are dangerous for the unsophisticated, lazy and stupid. Unsophisticated users may create websites with many security issues and not know what to avoid. The lazy are those professionals who don’t take proper steps when settting up systems and machines, the stupid, I reserve for the arrogant who fail to secure systems.

In looking at this issue, there is much hyperbola, such as in this article and this self serving one. Symantec want’s to sell software. This is not going to lay waste to the internet.

A few facts. For most Mac users, shell scripting using bash is not enabled and Apple added security on top of its unix. Small home routers generally use smaller scale implementations of linux, using busybox, largely for performance reasons. Windows computers are not a problem, at least for this issue. Most large systems are likely sitting behind firewalls and many linux and unix systems can’t be accessed from the outside. The vulnerability is predominantly related to systems that lie outside of firewalls and webserver software. Even commodity webhosting platforms like cpanel/whm have auto update running and update software without human intervention.

Here is a list of security tips that can help.

  1. Disable cgi on apache or remove ExecCGI from vhosts.
  2. If you run php on Apache. engable suphp to tighten security.
  3. disable shell access for accounts that webservers run under, particularly apache.
  4. Use a more modern apache, or nginx and run what ever you can under fast-cgi, which is more secure.
  5. Enable security plugins like mod_security on apache, and the suhosin module, howeber suhosin can be overkill for some users, remove it with care.
  6. Remove cgi scripts or if they are absolutely necessary secure them.
  7. Do not run applications like bugzilla on mod_cgi unless they are not accessible beyond the firewall.
  8. Remove bash as the default shell on system accounts.
  9. Update bash.
  10. disable and remove vulnerable software outside of the firewall.

New York Times being Hacked, Implications

We heard this week about The New York Times being hacked by the Chinese government in retaliation for articles they have written. These remind me that we liveĀ  in unsafe times. Waiting for the government to make us safe from the outside world is silly. We need to be more thoughtful about security issues. We need to see that our applications are properly secured, and our networks are secured. Many companies use software like Java, or open source applications like Joomla, WordPress, PhpMyAdmin, Drupal, osCommerce, Zend Cart, X Cart, Openx Adserver, which if not updated frequently and properly secured can allow hackers to exploit or corrupt systems.

Antifragile and implications for software

AntifragileI just finished reading Nassim Nicholas Taleb’s book Antifragile: Things That Gain from Disorder. This is a fascinating book, particularly for those interested in statistics and critical thinking and better understanding the world we currently deal with. He is clearly very bright, and makes a lot of good points. However I don’t agree with many of his arguments. I would not like us to go back to MS-DOS and Windows 3.1 or Java 1.3, just because they are old. I tend to take a more careful thinking and evaluation before moving forward toward new technology. For example products like NodeJs. I don’t want to replace nginx or apache webservers with javascript code running in NodeJs, I think NodeJs, the community and a lot of the libraries are far to immature, like things were with Java 1.2 or Microsoft’s first C++ compiler, it takes more time before things develop. I Like Mongodb, but decided to wait before building applications depending on it. Taleb talks about Black Swan events, which remind me of Hurricane Sandy and the damage it left, and several of the Amazon Aws outages. I think companies need to be careful of putting all their operations with one provider or getting too tightly coupled to platforms that, can have outage events, or problems with availability and developers need to build in to their applications handling to deal with problems like availability and alternative schemes that can be switched, for instance local databases in one’s data center. Companies need to anticipates big swings in demand, and assume that’s something you will deal with, not I’ll deal with it when it comes

Twitter4j + twitter + hadoop

I have been doing a bit of work with hadoop of late in my work life, mainly using streaming map reduce and pig working to extract additional data out of weblogs, which is a powerful paradigm. Before the election I wanted to develop a way to look at data during the election period. Twitter is a powerful communication tool often trivialized, but is a powerful way to promote and for mass sentiment to be made known.

Twitter has a powerful streaming api, that allows twitter to push to the client the data in a large mass. PHP is often a tool that I have used as a rapid development tool, but usually lacks a multi-threaded model and libraries that implement features like twitter’s streaming api. Twitter4j is a good library for java and also works with android, which works well with twitter. This allowed me to capture a significant amount of data for analysis. the code had matured significantly by the time the town hall debate took place, which led to capturing a good quality of data. This run used a the Query Stream, which allowed to filter from the global data set that twitter is, and limit it to the united states and topics relating to the debate and presidential election. Wanting to do more work with hadoop’s java libraries and features, I wrote the hadoop map reduce jobs in java and setup a single pseudo distributed node to process the data. These are the results imported into Google spreadsheets.

Great visualizations with D3

I’ve been seeing a lot of amazing infographics and visualizations. I was at a conference presented by Actuate the BI company. They presented a talk on visualizations in part because of purchasing a company to help compete with products like Tableau whose product helps visualizing data. In the discussion was d3 a javascript html5 library, which sites like The New York Times uses to do some of the wonderful graphics they do. You can see from the samples gallery some of the amazing things you can do with it. If you have good skills with Css and Javascript, you can create very dynamic graphics for projects you are working on.

You can see to the left a clip from a project I have been working on in my free time. During the period leading up to the elections I was working on a project to capture twitter data during the debates and later build hadoop jobs to crunch the data and reduce it down to data. The sample to the left from the town hall debate is from the source data, top 100 sources, which represents the twitter clients that people were using with the larget text representing the most used clients. These used a word cloud type visualization, which is hard to draw conclusions from it, though you can pick out the important information. The data needed to be scaled from a range of approximately 30:58000, so I scaled using log10(n/500).

Personal Project to Benefit Animals

Over the last months I have been working on a personal project of a cause that I care about. Helping more dogs and cats and other pets to get adopted and not fall through the cracks and be destroyed due to overcrowding in local shelters. Petfinder is an organization that a lot of smaller shelters work with and provides an xml api to their data that other developers can use. I have three sites which are based on the same source code and same data, one which has all the information, which has a filtered data set which is for shelters that have cats or specialize in cats and which has the same for dogs.

The sites are built using the CodeIgniter php application framework, which has worked resonably well. I use Zend Framework a lot at work, and I might have chosen Zend if I started today, but for a site of the scale I am working on CodeIgniter works well. CodeIgniter is efficient in design and coding, but has a bit of ugliness underneath the core. I built classes which can make requests to petfinder’s api. For the content that is more constant, I built utilities to pull using the xml api (using the classes) the shelters that I was interested in (I chose the tri-state area). Then the imports were converted using another script into insert and update statements.

For more dynamic content I use direct access to the petfinder api, for instance for a random pet feature they have. Initially the features I didn’t want to build I linked out to petfinder’s site. Later I have built out more. I have also integrated twitter, facebook, the openx adserver, and joomla to help with more content which would involve less building of code. Side bars content is built using jquery using an ajax implementation, allowing the sidebar content to be integrated dynamically, which can come from either random pet pictures, petfinder’s random pet feed or the adserver. To improve the design, I have integrated Webfonts into the site using adobe’s api.

I have wanted to build a state of the art mobile site for this as well. Unfortunately petfinder’s mobile presence is very inconsistent with petfinder’s web presence in terms of url structure and how it can be integrated, so I knew I would have to build all the features that I wanted to present. I have worked with jquery over the years and like a lot of things that this framework can do. I have seen people build mobile functionality with it, but until the jquery mobile came out it seemed much more work. I have been working using jquery mobile which works resonably well. It has some deficiencies, for instance css fixed positioning doesn’t work as well on android 2.3 and earlier, the transitions can be a little jumpy, but in general it works.

Moving from Apache to Nginx with php-fpm

In order to address performance issues we have seen, our company has been switching from apache with mod_php to nginx with php-fpm.

Given that apache httpd webserver is an older code base and does not take advantage of event driven and asyncronous technology this seems to be the future. There are numbers of differences between these two.

Nginx webserver is product that came out of Russia, it had been more obscure, though now it is the #2 webserver. Apache webserver has been an important product in web based technologies for many years, and has been the dominant webserver. Php as a technology is often used in a module in apache (mod_php) or run through a cgi type where php runs in a separate process. To improve performance of cgi type operations the php supports fastcgi. Fastcgi runs a server which keeps a number of php processes live and waiting for traffic and runs them as a server, there are numerous advantages to using this.

At high load apache has been shown to not manage memory as efficiently as possible.

Where nginx really shines is in serving static content, where it’s asynchronous technology allows it to serve more content than other webservers.

Switching to Nginx has required changes, for instance taking out use of apache specific functions like getallheaders() which provides the headers in normal http header form, not in the cgi name scheme that you find in the $_SERVER variable, for instance User-Agent vs HTTP_USER_AGENT, and switching configuration from .htacess files to nginx rewrites.

Configuration of Nginx is different from apache, and ins more scripted than the style of apache.

As far as performance, the verdict is mixed, it can take time to tune nginx and php-fpm. If you are switching thinking it will be a fast win for php applications with peformance problems, think again, you need to look at your code. If you serve a lot of static assets, this might be a boost for you.