Twitter4j + twitter + hadoop

I have been doing a bit of work with hadoop of late in my work life, mainly using streaming map reduce and pig working to extract additional data out of weblogs, which is a powerful paradigm. Before the election I wanted to develop a way to look at data during the election period. Twitter is a powerful communication tool often trivialized, but is a powerful way to promote and for mass sentiment to be made known.

Twitter has a powerful streaming api, that allows twitter to push to the client the data in a large mass. PHP is often a tool that I have used as a rapid development tool, but usually lacks a multi-threaded model and libraries that implement features like twitter’s streaming api. Twitter4j is a good library for java and also works with android, which works well with twitter. This allowed me to capture a significant amount of data for analysis. the code had matured significantly by the time the town hall debate took place, which led to capturing a good quality of data. This run used a the Query Stream, which allowed to filter from the global data set that twitter is, and limit it to the united states and topics relating to the debate and presidential election. Wanting to do more work with hadoop’s java libraries and features, I wrote the hadoop map reduce jobs in java and setup a single pseudo distributed node to process the data. These are the results imported into Google spreadsheets.

Great visualizations with D3

I’ve been seeing a lot of amazing infographics and visualizations. I was at a conference presented by Actuate the BI company. They presented a talk on visualizations in part because of purchasing a company to help compete with products like Tableau whose product helps visualizing data. In the discussion was d3 a javascript html5 library, which sites like The New York Times uses to do some of the wonderful graphics they do. You can see from the samples gallery some of the amazing things you can do with it. If you have good skills with Css and Javascript, you can create very dynamic graphics for projects you are working on.

You can see to the left a clip from a project I have been working on in my free time. During the period leading up to the elections I was working on a project to capture twitter data during the debates and later build hadoop jobs to crunch the data and reduce it down to data. The sample to the left from the town hall debate is from the source data, top 100 sources, which represents the twitter clients that people were using with the larget text representing the most used clients. These used a word cloud type visualization, which is hard to draw conclusions from it, though you can pick out the important information. The data needed to be scaled from a range of approximately 30:58000, so I scaled using log10(n/500).

Comparisons of Three PHP MVC frameworks.

PHP is fast becoming one of the dominant languages in the web. It is easy enough for beginners, but has significant power for serious sites. Model View Controller (MVC) frameworks which have been used in many different contexts for desktop applications, mobile phone achitectures and web sites have become a significant force. Java’s Spring Framework  is great in Java and ported to C#. In PHP there are a number that are work considering, some key ones are Zend Framework, by the developers and maintainers of the PHP interpreter, Cake PHP, Code Igniter, and Symphony. I have been working with Zend professionally at work, and have been using Cake and Code Igniter in some personal projects, though have not worked with Symphony. I am going to discuss comparisons of the first three.

Cake PHP is a powerful strongly MVC standards based framework. It has a popular following, and a lot of nice features. If your application can fit into the strict cake pattern, it works well, when you can’t work within the framework’s expectations, it can be challenging and frustrating. This is similar to Spring in Java. Understanding the framework well enough to know how to implement your design is important. One challenge is how tightly coupled the database structure and the model is, including field and table naming. It also has substantial ORM features featureSimple joins where difficult, requiring unnecessary association tables for a simple project. Scripts are used to manage the application dealing with the mappings and creating objects. Cake is in active development and has a strong following, including fanatics.

Code Igniter is a lighter more flexible framework, with base classes that are highly optimzed and kept light by comparison, though you can add complexity as you need it. It has a strong user base and is in active development. For a smaller project, this is a good match. The coupling between model and database are looser. It also has the ORM mapping.  The model can be simple or complicated as necessary. A big project with a lot of complexity, may not be a good fit, but it has good features. Development is pretty fast. Php templating is standard, though a template system is integrated as a feature.

Zend Framework is a framework that is one of the best, but with limitations. I have been using it to build a large REST middle tier that to extend xmlrpc services provided by the Openx Adserver. It has features for many types of technologies, including integrating STOMP JMS features, web services, ajax,  json and front end view functionality. It has some decent database technology which  can work with more databases than just mysql. It uses a modern php style reminiscent of java, including interfaces, abstract base classes, access levels i.e. public, private, protected. It uses a substantial number of design patterns. Class naming is somewhat strict, though by doing so, gives wonderful autoloading features, so no include_once code just construct an object. Zend does not have the strong ORM features, though users have intergrated the Doctrine modeling framework. Others have also included the Ezcomponents framework which can help to do nice features. Php views are standard, though it’s easy to integrate smarty templates. An additional challenge is the structure of the app has changed frequently and technology has been added and improved so some documentation you may find from other sources may be out of date.