CassandraDB Data Model Basis

Creating a data model in Cassandra is similar to traditional Databases, you can create tables, define schemas and query tables, but because the distruited nature of Cassandra, concept as foreing keys doesn't exist, and simple task like a join are not possible, for this reason denormalization is a key factor when you desing a Cassandra data model. Before star you will need install CassandraDB, at this post you can find how to do it

Primary Keys

As any traditional Database, a CassandraDb Primary Key helps to index data in order to query it quickly, but keep in mind that CassandraDb is a distribuitedDB, so your data will be randomly distribuited across the nodes. Good datatypes candidates as primary key are UUID and TIMEUUID data types. As the dabase model designer is your responsability choose a primaray key that ensures a correct distribution of records accross the nodes, because Cassandra Won't do it for you. If you choose a Primary as an Secuencial Integer similar to SQL Secuence, is very possible that almost all the records will be stored at one node, this will cause an overload of this node, and an incorrect Load Balancing process.

A Product/Order basic model

    CREATE TABLE product  (
      id uuid PRIMARY KEY,
      bar_code text,
      name text,
      price double
     );


    CREATE TABLE purchase_order  (
      id uuid PRIMARY KEY,
      product_id uuid,
      product_name text,
      quantity int
    );

Let's try to insert some products:

    INSERT INTO product(id, bar_code, name, price) VALUES(bdf3e070-4f75-11e1-a5c5-00215a17aed0,
                                                          '4F5F4', 'TABLET', 300);

    INSERT INTO product(id, bar_code, name, price) VALUES(0ba10600-6712-11e1-b653-00215a17aed0,
                                                         '4Fu84', 'TV', 800);

    INSERT INTO product(id, bar_code, name, price) VALUES(14bc9100-6712-11e1-990c-00215a17aed0,
                                                         '4R5F', 'PC', 400);

    INSERT INTO product(id, bar_code, name, price) VALUES(1a16bc50-6714-11e1-8366-00215a17aed0,
                                                         '8D4F', 'WATCH', 250);

Now we can create an order:

    INSERT INTO purchase_order(id, product_id, product_name, quantity)
        VALUES(35643ec0-1690-11e2-947c-00215a17aed0, bdf3e070-4f75-11e1-a5c5-00215a17aed0, 'TABLET', 3);

    INSERT INTO purchase_order(id, product_id, product_name, quantity) 
        VALUES(35643ec0-1690-11e2-947c-00215a17aed0, 0ba10600-6712-11e1-b653-00215a17aed0, 'TV', 8);

    INSERT INTO purchase_order(id, product_id, product_name, quantity)
        VALUES(35643ec0-1690-11e2-947c-00215a17aed0, 14bc9100-6712-11e1-990c-00215a17aed0, 'PC', 4);

    INSERT INTO purchase_order(id, product_id, product_name, quantity)
        VALUES(35643ec0-1690-11e2-947c-00215a17aed0, 1a16bc50-6714-11e1-8366-00215a17aed0, 'WATCH', 2);

After 4 inserts we expect see 4 records at purchase_order table, one by every product, actually this assumption is wrong as you will see:

    cqlsh:store> select * from purchase_order ;

     id                                   | product_id                           | product_name | quantity
    --------------------------------------+--------------------------------------+--------------+----------
     35643ec0-1690-11e2-947c-00215a17aed0 | 1a16bc50-6714-11e1-8366-00215a17aed0 |        WATCH |        2

Now let's try to query all purchase orders that have some product:

    cqlsh:store> select * from purchase_order where product_id=08d181c0-dc83-11e1-9c22-00215a17aed0;
    Bad Request: No indexed columns present in by-columns clause with Equal operator

This suggests that the purchase_order design is wrong, if we need to store differents products on a same purchase order we need to create a table with a COMPOUND PRIMARY KEY:

COMPOUND PRIMARY KEY

Similar to traditional Databases, a compound primary key is an index compounding for two or more values, the first value is known as Partion key and define where inside the cluster the record will be stored, The remaining column or columns of the PRIMARY KEY, are clustered columns, an CassadraDb uses this column - columns to create an ordered index this grants a very efficient retrieval of rows, let's drop purchase_order table, and create it again with the compound primary key:

    DROP TABLE purchase_order;

    CREATE TABLE purchase_order  (
      id uuid,
      product_id uuid,
      product_name text,
      quantity int,
      PRIMARY KEY(id, product_id)
    );

    INSERT INTO purchase_order(id, product_id, product_name, quantity)
        VALUES(35643ec0-1690-11e2-947c-00215a17aed0, bdf3e070-4f75-11e1-a5c5-00215a17aed0, 'TABLET', 3);

    INSERT INTO purchase_order(id, product_id, product_name, quantity) 
        VALUES(35643ec0-1690-11e2-947c-00215a17aed0, 0ba10600-6712-11e1-b653-00215a17aed0, 'TV', 8);

    INSERT INTO purchase_order(id, product_id, product_name, quantity)
        VALUES(35643ec0-1690-11e2-947c-00215a17aed0, 14bc9100-6712-11e1-990c-00215a17aed0, 'PC', 4);

    INSERT INTO purchase_order(id, product_id, product_name, quantity)
        VALUES(35643ec0-1690-11e2-947c-00215a17aed0, 1a16bc50-6714-11e1-8366-00215a17aed0, 'WATCH', 2);

Now we can query purchase_order table again and see what happends:

    cqlsh:store> select * from purchase_order;

     id                                   | product_id                           | product_name | quantity
    --------------------------------------+--------------------------------------+--------------+----------
     35643ec0-1690-11e2-947c-00215a17aed0 | bdf3e070-4f75-11e1-a5c5-00215a17aed0 |       TABLET |        3
     35643ec0-1690-11e2-947c-00215a17aed0 | 0ba10600-6712-11e1-b653-00215a17aed0 |           TV |        8
     35643ec0-1690-11e2-947c-00215a17aed0 | 14bc9100-6712-11e1-990c-00215a17aed0 |           PC |        4
     35643ec0-1690-11e2-947c-00215a17aed0 | 1a16bc50-6714-11e1-8366-00215a17aed0 |        WATCH |        2

Let's create another purchase order:

    INSERT INTO purchase_order(id, product_id, product_name, quantity)
        VALUES(5e15135e-4b72-11e1-b84d-00215a17aed0, 1a16bc50-6714-11e1-8366-00215a17aed0, 'WATCH', 8);

    INSERT INTO purchase_order(id, product_id, product_name, quantity) 
        VALUES(5e15135e-4b72-11e1-b84d-00215a17aed0, 0ba10600-6712-11e1-b653-00215a17aed0, 'TV', 1);


    cqlsh:store> select * from purchase_order;

     id                                   | product_id                           | product_name | quantity
    --------------------------------------+--------------------------------------+--------------+----------
     35643ec0-1690-11e2-947c-00215a17aed0 | bdf3e070-4f75-11e1-a5c5-00215a17aed0 |       TABLET |        3
     35643ec0-1690-11e2-947c-00215a17aed0 | 0ba10600-6712-11e1-b653-00215a17aed0 |           TV |        8
     35643ec0-1690-11e2-947c-00215a17aed0 | 14bc9100-6712-11e1-990c-00215a17aed0 |           PC |        4
     35643ec0-1690-11e2-947c-00215a17aed0 | 1a16bc50-6714-11e1-8366-00215a17aed0 |        WATCH |        2
     5e15135e-4b72-11e1-b84d-00215a17aed0 | 0ba10600-6712-11e1-b653-00215a17aed0 |           TV |        1
     5e15135e-4b72-11e1-b84d-00215a17aed0 | 1a16bc50-6714-11e1-8366-00215a17aed0 |        WATCH |        8

Let's filter by order id:

    cqlsh:store> SELECT * from purchase_order where id=5e15135e-4b72-11e1-b84d-00215a17aed0;

     id                                   | product_id                           | product_name | quantity
    --------------------------------------+--------------------------------------+--------------+----------
     5e15135e-4b72-11e1-b84d-00215a17aed0 | 0ba10600-6712-11e1-b653-00215a17aed0 |           TV |        1
     5e15135e-4b72-11e1-b84d-00215a17aed0 | 1a16bc50-6714-11e1-8366-00215a17aed0 |        WATCH |        8

Indexing

Let's filter by order product_id:

    cqlsh:store> SELECT * FROM purchase_order WHERE product_id=0ba10600-6712-11e1-b653-00215a17aed0;

    Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING

In the query above, we tried to get purchase_order by product_id, but the PARTITION KEY is the order_id, so we can have several orders with one or more products that match the WHERE clause distribuited across the nodes, the operation required to get this information requeries read all prochase_order across all the nodes, check if match with the criteria, merge the results, CassandraDb can not handle this because the performance would be inaceptable. A solution that can be implemented in this case is create another table that store information by purchase_order_id , product_id and create two indexes:

    CREATE TABLE product_purchase_order  (
      id uuid PRIMARY KEY,
      purchase_id uuid,
      product_id uuid,
      product_name text,
      quantity int
    );

    CREATE INDEX ON product_purchase_order(purchase_id);
    CREATE INDEX ON product_purchase_order(product_id);


    INSERT INTO product_purchase_order(id, purchase_id, product_id, product_name, quantity)
        VALUES(f5b87f80-913c-11e1-8995-00215a17aed0 ,35643ec0-1690-11e2-947c-00215a17aed0, 
               bdf3e070-4f75-11e1-a5c5-00215a17aed0, 'TABLET', 3);

    INSERT INTO product_purchase_order(id, purchase_id, product_id, product_name, quantity)
        VALUES(54e793f0-a11a-11e1-bba9-00215a17aed0, 35643ec0-1690-11e2-947c-00215a17aed0,
               0ba10600-6712-11e1-b653-00215a17aed0, 'TV', 8);

    INSERT INTO product_purchase_order(id, purchase_id, product_id, product_name, quantity)
        VALUES(9ba4eda2-38a6-11e2-b72b-00215a17aed0, 35643ec0-1690-11e2-947c-00215a17aed0, 
               14bc9100-6712-11e1-990c-00215a17aed0, 'PC', 4);

    INSERT INTO product_purchase_order(id, purchase_id, product_id, product_name, quantity)
        VALUES(4b01bf90-7853-11e1-85e9-00215a17aed0, 35643ec0-1690-11e2-947c-00215a17aed0, 
               1a16bc50-6714-11e1-8366-00215a17aed0, 'WATCH', 2);

    INSERT INTO product_purchase_order(id, purchase_id, product_id, product_name, quantity)
        VALUES(d33dc000-1d1e-11e2-943c-00215a17aed0, 5e15135e-4b72-11e1-b84d-00215a17aed0,
               1a16bc50-6714-11e1-8366-00215a17aed0, 'WATCH', 8);

    INSERT INTO product_purchase_order(id, purchase_id, product_id, product_name, quantity) 
        VALUES(862fdbd0-26ac-11e2-95ee-00215a17aed0, 5e15135e-4b72-11e1-b84d-00215a17aed0, 0ba10600-6712-11e1-b653-00215a17aed0, 'TV', 1);

    

Now we can filter all the orders that have some product

    cqlsh:store> SELECT *  FROM product_purchase_order WHERE product_id=0ba10600-6712-11e1-b653-00215a17aed0;

     id                                   | product_id                           | product_name | purchase_id                          | quantity
    --------------------------------------+--------------------------------------+--------------+--------------------------------------+----------
     54e793f0-a11a-11e1-bba9-00215a17aed0 | 0ba10600-6712-11e1-b653-00215a17aed0 |           TV | 35643ec0-1690-11e2-947c-00215a17aed0 |        8
     862fdbd0-26ac-11e2-95ee-00215a17aed0 | 0ba10600-6712-11e1-b653-00215a17aed0 |           TV | 5e15135e-4b72-11e1-b84d-00215a17aed0 |        1

Behind the scenes, every Index is a new hidden table.

Some personal conclusions

  • The information contained at this post is not enough, CassandraDB is too Big and Complex.
  • Is easy insert data but difficult query data on CassandraDb.
  • Think very carefully about the queries that you will need when you are desinging your database.
  • Because joins and relational operations are not allowed on Cassandra, don't be afraid to duplicate your data accoss models
  • If choose a primary/partition key datataype different from uuid or timeuuid, make sure that the distribution of your data across nodes will be uniform before put your code on production

Resources