Zero to Billion records – Time series data processing with Cassandra

Time series data can be described as information generated by applications/systems at regular time intervals. For example, on a Linux server, following data can be gathered for Cassandra process using “perf stat” command.

sampleperfstat

Now, the interesting thing to note here is the above data is collected at 2 different time points. Counters are the same but their value varies across different time points. The business my employer is in, we gather such data points for thousands of systems, every hour and for hundreds of such counters. Now you do the math!!

So, we are talking about about 10 billion counter data points being generated every week and it is very critical from business standpoint to have the ability to mine through this data. And this is where Apache Cassandra as a data storage technology plays an interesting role. 2 features of Cassandra help make this happen:

  1. Infinite columns … well almost. Cassandra supports up to 2 billion columns per row. So, if you store the timestamp (in seconds) as your column name, a single row can take ~63 years worth of data
  2. Row level data storage optimization – as more columns get added to the same row, Cassandra will keep modifying the SSTable files (where data and indices are maintained). This ensure when there is a need to fetch the data for a row, the seek operation on disk is highly optimized. Data in a time range can be fetched extremely fast (talking milli secs here – even when there are billions of records)

How to setup single node Cassandra?

Well, this is very easy.

  1. Install Jave 7 JRE on the Linux box
  2. Download latest Apache Cassandra (2.x) from their home page OR get the DataStax Community edition from DataStax site.
  3. Expand the Cassandra tar file
  4. Edit the listen_address and rpc_address fields of cassandra.yaml to refer to current hostname
  5. To make use of default Cassandra settings (mentioned in conf/cassandra.yaml file), do the following:
    	bash-4.2$ sudo mkdir /var/lib/cassandra
    	bash-4.2$ sudo mkdir /var/log/cassandra
    	bash-4.2$ sudo chown -R vijayi /var/lib/cassandra
    	bash-4.2$ sudo chown -R vijayi /var/log/cassandra
     

Then get started with Cassandra, keyspace creation and table creation using CQLSH

	./cassandra -f &

	cqlsh> create keyspace perf_objects with replication = {'class':'SimpleStrategy', 'replication_factor':1};
	
	cqlsh:perf_objects> CREATE TABLE PERF_SYSTEM ( counterkey text, capture_time timestamp, counter_value text, PRIMARY KEY (counterkey ,capture_time));

First line starts Cassandra in background. Then we connect to the node using CQLSH and create the required keyspace. Finally, the table is created and this is most interesting part. PRIMARY KEY comprises of 2 fields – counterkey and capture_time. Cassandra treats counterkey (first field) as the partition key and value for capture_time will be used a column names. This ensures for a given counterkey, we can execute really fast queries that provide a date range.

That is it! Your single Cassandra server is up and running. Now start inserting rows with CQLSH or separate client (Java in my case).

	 insert into perf_system (counterkey , capture_time , counter_value) values ('013505.system.system.http_ops','2013-04-03 08:02:00','12')
	 insert into perf_system (counterkey, capture_time , counter_value) values ('013505.system.system.cifs_ops','2013-04-03 09:02:00','11043827')

A very easy to use Jave client is from DataStax itself –

Setting up a multi node Cassandra cluster?
This is a lot easier than it sounds. So far, you have installed and started Cassandra on a single server. Now to create a cluster (multiple nodes), you simply install Cassandra on the other server. Make a common change to cassandra.yaml files on both nodes of the cluster. Set ” seeds:” value to both server names / IP (comma separated) in cassandra.yaml of both instances. The servers will then start talking to each other. Partition and replication of data will be done completely transparently in the background.

What next?
I have a Java client that parses the incoming data and loads it into Cassandra using the DataStax Java client. Inserts are very fast – close to 30,000 records per sec. I did not benchmark the read requests but they are also in milliseconds. You should use the batch feature of DataStax Java driver if the number of inserts is high – it just saves on network bandwidth and unnecessary commits.

Another thing you could try for such massive batch uploads is to create a SSTable file for the incoming data and finally load that file into Cassandra. This should be lot more effective in my case.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s