Counting, Indexing and Ordering in Cassandra

Traditional databases are very good performing common tasks like ordering, filtering, counting. These tasks are not always easy on Cassandra because different records of the same table are distribuited across different nodes and the performance could be innaceptable. This post try to show some examples about when is possible perform these tasks.

Counting

Let's create a Product Table and after a couple inserts let's count how many records are stored.

    CREATE TABLE product  (
      id uuid PRIMARY KEY,
      category text,
      bar_code text,
      name text,
      price double,
     );

    CREATE TABLE purchase_order  (
      id uuid ,
      product_id uuid,
      product_name text,
      quantity int,
      PRIMARY KEY(id, product_id)
    );

    INSERT INTO product(id, category, bar_code, name, price) 
        VALUES(bdf3e070-4f75-11e1-a5c5-00215a17aed0, 'mobile', '4F5F4', 'TABLET ORANGE', 300);

    INSERT INTO product(id, category, bar_code, name, price) 
        VALUES(d86cb040-0e02-11e0-9677-361934cac9ba, 'mobile', '4F5F4', 'PHONE BANANA', 100);

    INSERT INTO product(id, category, bar_code, name, price) 
        VALUES(4066b980-2f09-11e0-8551-361934cac9ba, 'tv', '4F5F4', 'SMART TV', 100);

    cqlsh:store> SELECT COUNT(*) FROM product;

     count
    -------
         3

Now let's see whats happens after 20000 inserts:

    cqlsh:store> SELECT COUNT(*) FROM product;

     count
    -------
     10000

    Default LIMIT of 10000 was used. Specify your own LIMIT clause to get more results.


    cqlsh:store> SELECT COUNT(*) FROM product limit 5000;

     count
    -------
     5000


    cqlsh:store> SELECT COUNT(*) FROM product limit 80000;

     count
    -------
     20000

Cassandra counts record by record, across the nodes, applying a default limit, so the result could not be real, if you set a limit not big enough, the count could be wrong, now let's try to count after 6 millions of inserts:

    cqlsh:store> SELECT COUNT(*) FROM product limit 6000000;
    Request did not complete within rpc_timeout.

Counter Table

It's a special kind of table, that contains a Primary Key that can be of any Cassandra DataType and one or more Columns of type counter

    CREATE TABLE counter_store(
        object text PRIMARY KEY,
        count counter,
    );

Inserting Data and Incrementing Counter

Only UPDATE opetarions are allowed on Counter Tables, the example show how to Increment a Counter Table:


    UPDATE counter_store set count= count + 1 where object='product';

    cqlsh:store> select * from counter_store ;

     object  | count
    ---------+-------
     product |     1

    UPDATE counter_store set count = count + 3 where object='product';

    cqlsh:store> select * from counter_store ;

     object  | count
    ---------+-------
     product |     4


    UPDATE counter_store set count= count + 1 where object='purchase_order';


    cqlsh:store> select * from counter_store ;

     object         | count
    ----------------+-------
            product |     4
     purchase_order |     1

Indexing

Let's try to get all products for the mobile category:

    cqlsh:store> SELECT * FROM product WHERE category='mobile';
    Bad Request: No indexed columns present in by-columns clause with Equal operator

This means that is no possible by default, perform a query by a Non Primary Key, this can be solved creating a secundary Index over the fields to be filtered. Every secundary index creates a hidden table to handle the index in order to improve the performarmance:

    cqlsh:store> CREATE INDEX product_category_idx ON product(category);

    cqlsh:store> SELECT * FROM product WHERE category='mobile';

     id                                   | bar_code | category | name          | price
    --------------------------------------+----------+----------+---------------+-------
     d86cb040-0e02-11e0-9677-361934cac9ba |    4F5F4 |   mobile |  PHONE BANANA |   100
     bdf3e070-4f75-11e1-a5c5-00215a17aed0 |    4F5F4 |   mobile | TABLET ORANGE |   300

When is ok to use Cassandra Index

Indexes are good to query data from tables that have a lots of records that can potencialy match the query criteria, ie. If there are too many products with a certain category. But if there is a lot of categories with too few products, the index will not have effect.

Ordering

Let's try to order products by price:

    cqlsh:store> SELECT * FROM product order by price;
    Bad Request: ORDER BY is only supported when the partition key is restricted by an EQ or an IN.

This error message means that is only possible make an order by when we restrict the query criteria using a WHERE OR AN IN CLAUSE , this has not sense on product table, becasue the id is unique and exists one product by id. Maybe is better have this table inside a relational Database? Let's create some purchase orders and see if we can order this table:


    INSERT INTO purchase_order(id, product_id, product_name, quantity)
        VALUES(5e15135e-4b72-11e1-b84d-00215a17aed0, bdf3e070-4f75-11e1-a5c5-00215a17aed0, 'TABLET ORANGE', 3);

    INSERT INTO purchase_order(id, product_id, product_name, quantity)
        VALUES(5e15135e-4b72-11e1-b84d-00215a17aed0, d86cb040-0e02-11e0-9677-361934cac9ba, 'PHONE BANANA', 8);

    INSERT INTO purchase_order(id, product_id, product_name, quantity) 
        VALUES(5e15135e-4b72-11e1-b84d-00215a17aed0, 4066b980-2f09-11e0-8551-361934cac9ba, 'SMART TV', 1);


    INSERT INTO purchase_order(id, product_id, product_name, quantity)
        VALUES(c0a874f0-392a-11e0-a0df-361934cac9ba, d86cb040-0e02-11e0-9677-361934cac9ba, 'PHONE BANANA', 2);

    INSERT INTO purchase_order(id, product_id, product_name, quantity) 
        VALUES(c0a874f0-392a-11e0-a0df-361934cac9ba, 4066b980-2f09-11e0-8551-361934cac9ba, 'SMART TV', 3);


    cqlsh:store> select * from purchase_order ;

     id                                   | product_id                           | product_name  | quantity
    --------------------------------------+--------------------------------------+---------------+----------
     5e15135e-4b72-11e1-b84d-00215a17aed0 | d86cb040-0e02-11e0-9677-361934cac9ba |  PHONE BANANA |        8
     5e15135e-4b72-11e1-b84d-00215a17aed0 | 4066b980-2f09-11e0-8551-361934cac9ba |      SMART TV |        1
     5e15135e-4b72-11e1-b84d-00215a17aed0 | bdf3e070-4f75-11e1-a5c5-00215a17aed0 | TABLET ORANGE |        3
     c0a874f0-392a-11e0-a0df-361934cac9ba | d86cb040-0e02-11e0-9677-361934cac9ba |  PHONE BANANA |        2
     c0a874f0-392a-11e0-a0df-361934cac9ba | 4066b980-2f09-11e0-8551-361934cac9ba |      SMART TV |        3


    cqlsh:store> select * from purchase_order where id=c0a874f0-392a-11e0-a0df-361934cac9ba;

     id                                   | product_id                           | product_name | quantity
    --------------------------------------+--------------------------------------+--------------+----------
     c0a874f0-392a-11e0-a0df-361934cac9ba | d86cb040-0e02-11e0-9677-361934cac9ba | PHONE BANANA |        2
     c0a874f0-392a-11e0-a0df-361934cac9ba | 4066b980-2f09-11e0-8551-361934cac9ba |     SMART TV |        3


Now let's try order purchase orders by quantity

    cqlsh:store> select * from purchase_order where id=c0a874f0-392a-11e0-a0df-361934cac9ba order by quantity;
    Bad Request: Order by is currently only supported on the clustered columns of the PRIMARY KEY, got quantity

Is only possible order records applying a WHERE CLAUSE first and the order criteria Column is part of a compound Primary Key

    DROP TABLE purchase_order;

    CREATE TABLE purchase_order  (
      id uuid ,
      product_id uuid,
      product_name text,
      quantity int,
      PRIMARY KEY(id, quantity, product_id)
    );



    INSERT INTO purchase_order(id, product_id, product_name, quantity)
        VALUES(5e15135e-4b72-11e1-b84d-00215a17aed0, bdf3e070-4f75-11e1-a5c5-00215a17aed0, 'TABLET ORANGE', 3);

    INSERT INTO purchase_order(id, product_id, product_name, quantity)
        VALUES(5e15135e-4b72-11e1-b84d-00215a17aed0, d86cb040-0e02-11e0-9677-361934cac9ba, 'PHONE BANANA', 8);

    INSERT INTO purchase_order(id, product_id, product_name, quantity) 
        VALUES(5e15135e-4b72-11e1-b84d-00215a17aed0, 4066b980-2f09-11e0-8551-361934cac9ba, 'SMART TV', 1);


    INSERT INTO purchase_order(id, product_id, product_name, quantity)
        VALUES(c0a874f0-392a-11e0-a0df-361934cac9ba, d86cb040-0e02-11e0-9677-361934cac9ba, 'PHONE BANANA', 2);

    INSERT INTO purchase_order(id, product_id, product_name, quantity) 
        VALUES(c0a874f0-392a-11e0-a0df-361934cac9ba, 4066b980-2f09-11e0-8551-361934cac9ba, 'SMART TV', 3);



    select * from purchase_order where id=c0a874f0-392a-11e0-a0df-361934cac9ba order by quantity;

    cqlsh:store> select * from purchase_order where id=c0a874f0-392a-11e0-a0df-361934cac9ba order by quantity;

     id                                   | quantity | product_id                           | product_name
    --------------------------------------+----------+--------------------------------------+--------------
     c0a874f0-392a-11e0-a0df-361934cac9ba |        2 | d86cb040-0e02-11e0-9677-361934cac9ba | PHONE BANANA
     c0a874f0-392a-11e0-a0df-361934cac9ba |        3 | 4066b980-2f09-11e0-8551-361934cac9ba |     SMART TV

Some personal conclusions

  • Some common task like Ordering. Filtering are so easy for Relational Databases but difficut or even impossible on Cassandra.
  • Think very carefully about how the data will be retreived instead of how data will be inserted when design a CassandraDB data model.
  • Test your models, insert millions of records on a test enviroment, make sure that you can retreive the data.
  • Make a column counter for every table that you need to know how many records you have.
  • Realize that a count operation is slow and could be wrong. So desing your model to be inserted ordered.