Cassandra challenges

CassandraI have been recently working on doing research and development using Apache Cassandra (DataStax). Cassandra is an amazing piece of software. I went to a modeling class that a DataStax engineer ran that was quite impressive. He essentially said if you follow our advice it will work well, otherwise it might suck. I was struck by the need to ignore a lot of what we know about using relational databases, which I think can become a problem for some because the cql language makes you think that it is a relational database. when one works with it, one needs to build a model that both works well in Cassandra storage terms, but also in terms of your application. You can’t join, and entries based on the hash from the primary or cluster key might be scattered across your cluster. There are few functions to use, you really need to rethink how you architect and design your application.

Cassandra has also changed a lot from the earlier incantations to the new 2.1 version. The early versions used this thrift based api and the CQL language was introduced and enhanced and thrift is now essentially deprecated. There is a lot of drivers and solutions that have been built up using the old thrift based api, which going forward will not be usable. Several design ideas, for instance dynamic column families where you might have entries with the same column family or table having very different schemas, worked well in thrift, but will not in CQL. When researching compatible drivers, one should look for those implemented using CQL not thrift.

Loading large amounts of data into Cassandra is more difficult. It’s not like Mysql or Oracle where you can quickly load from text file or sql file, or a loader file. You essentially have two options. First write code that inserts into Cassandra using CQL through a driver, with this you might improve performance using async inserts and updates. Your other option may be building a sstable writer tool that rights into a sstable, essentially what Cassandra uses internally for storage, and streaming it into cassandra using sstableloader or the jmxloader. With this you are writing in java territory, fortunately there is a cql based sstable writer class you can use.