Dipesh Majumdar

Blog and Paintings


July 21, 2012

So what is all this buzz about Cassandra?

Well to kick start into the basics of Cassandra, let us begin with this simple concept that in Cassandra, we don't have a master slave concept architecture. But we have a configuration of nodes where each node is equally powerful and can act as a proxy for any other node.

It operates on gossip architecture.

Cassandra is not an enemy of RDBMS kind of databases like the heavy weight Oracle or sql-server, but it definitely provides significant benefits when considered for some specific cases.

For example when booking a railway ticket we need absolute data consistency...So in such cases, RDMBS rules; however consider a case where we can live with a little bit of inconsistent data, for example a search query in the Google search engine... and we cannot afford to be slow... in such cases cassandra proves to be very useful.

In this era of data explosion, the nature of data itself is changing. When we have to deal with influx of many petabytes of data each day, we have to come up with new ways to manage and structure data... and always RDBMS might not be the right choice!

How is data stored in Cassandra?

To understand this, first we have to look at it from sql perspective. First we create a table with name employee in an oracle database.

create table employee (emp_id number(6), name varchar2(20), BUILDING_LOCATION varchar2(20), project varchar2(10));

This table has the columns –
desc employee
Name Null Type

After this I am inserting two rows in the above table - employee.

Now in Cassandra, instead of the table employee, we will have a column family, employee. And this is the way you create column family – employee.

CREATE COLUMN FAMILY employee WITH comparator = UTF8Type AND key_validation_class=UTF8Type AND column_metadata = [ {column_name: name, validation_class: UTF8Type} {column_name: building_location, validation_class: UTF8Type} {column_name: project, validation_class: UTF8Type} ];

Insert values –
SET employee['15961']['name']='dipesh majumdar';
SET employee['15961']['building_location']='electronic_city';
SET employee['15961']['project']='xyz';
SET employee['15962']['name']='ramesh jain';
SET employee['15962']['building_location']='MTP';
SET employee['15962']['project']='PQR';

Note: there is no difference between update and insert here. You simply set it. Also there is no concept of null values here. If a column name has no value then it is simply not set.

How queries work –

We can query in casadra keeping the key in the where clause. GET employee WHERE emp_id=15961;

What if we want to query with column_name project. Well then, we have to create another column family holding the same data but the key being project. Well this definitely is denormalization but the basic purpose of Cassandra is not normalization. That is why there are no joins in Cassandra. Joins lead to poor performance. Also show below is another method by using update command-

UPDATE COLUMN FAMILY employee WITH comparator = UTF8Type AND column_metadata = [{column_name: project, validation_class: UTF8Type, index_type: KEYS}];Because of the secondary index created for the column project, its values can be queried directly for users born in a given year as follows: GET employee WHERE project = ‘xyz’;

OK, so now we are pretty familiar with how data is stored and queried in Cassandra, we are all set into the other features like tunable consistency and fault tolerance. For this we need to know two important terms – 1. Replication factor (RF) 2. Consistency Level (CL)

Replication Factor (RF) – Recall the demon Raktabija and his fight with Goddess Kali. Raktabija was a difficult opponent because each drop of blood from him replicated into another Raktabija and so there would soon be an army of Raktabija fighting against Goddess Kali. This story helps us visualise the process of replication. Now superimpose this process to the database system of cassandra. Suppose there are 20 nodes and replication factor is 5. That means any new data written on a particular node should get replicated to 4 other nodes, so the total annihilation of that data due to any outage or other unpredictable event, becomes difficult. Because the replicas are already existing in the other 4 nodes. So this is the concept of replication in cassandra. Now the question can arise that if this new data belogns to node'16 to node' 20 and I am querying at node' 3 what will happen? the answere is there is a mechanism that the data can be brought from the desired node to node'3. I am not elaborating on that mechanism here. But I would like the readers to post comments on that. Now as we have understood replicaton factor, let me reframe little bit all that we have learnt. The purpose is this will lay the foundation to understand consistency level.

Consistency Level (CL)- If there are 20 nodes in a cassandra cluster and RF is 5, then data written in one node (say parent node) is now replicated to four other nodes. So we have 5 distinct nodes responsible for this piece of data ( let it be a table). Let these 5 distinct nodes be referred to as RF-Nodes. So RF=5 means there will be 5 RF-Nodes and so on. Also note that RF should be an odd number always greater than equal to 3 so that we can have a meaningful quorum. More on quorum a little later.

To start with, note that the node on which write operation takes place is the parent node. This we are assuming in this post for the sake of clarity of understanding. Now when we say consistency level=ALL, write can be considered a success in the parent node only when all the remaining 4 nodes of the RF-Nodes acknowledge write. When CL=1, then out of the 5 RF-Nodes, atleast one should acknowledge write and then only write will be successful in the parent node. When CL=ANY, that means write will be considered successful in parent node and it will not wait for any acknowledgement from RF-Nodes. For CL=Quorum, a majority of RF-Nodes should send acknowledgement of write-success so that write is successful on the parent node.

We can compare this concept of consistency level in Cassandra with Oracle Dataguard modes of operation. Max-performance in oracle dataguard is like CL=ANY in cassandra. In Max performance, the commit is successful and the LGWR doesn't wait for LNS to send an acknowledgement that the given redo has been written to the standby database. Max-protection, i.e, sync mode is like CL=ALL, because in this case LGWR has to wait for LNS approval just like write in parent node of Cassandra cluster waits for acknowledgement of write-success from all other RF-Nodes. Max-Availability, which lies somewhere between Max-performance and Max-protection is somewhat (not completely) similar to CL=QUORUM.

Conclusion – There are many more features and concepts in Cassandra but it’s not possible to capture everything in this post. Basically the purpose of this is to familiarize one with basic concepts of Cassandra. Many other features of Cassandra specified below can be explored after this basic refresher on cassandra :

1. No sql database 2. open source 3. gossip architecture 4. tunable consistency 5. fault tolerant 6. decentralized 7. scalable 8. column based storage engine 9. no joins

Note: I am a beginner in cassandra and know very little; so if the reader finds any error, please let me know in the comments and I will try to rectify that.  

Go Back

I have one doubt. If I have an one master and two slave instance. I insert the bulk data to the table from java code. In the middle time, disconnect the master instance cassandra service. So, totally the transaction will be stopped. Does not happen to the transaction data to slave instances.

What i need is the data fully inserted in slave instances except master.