Data Modeling Basics: Materialized Views
Data modeling basics: Materialized views
Akbar S. Ahmed | Jan 6, 2015

Data modeling in CQL

Data modeling is one of the most important parts of designing a scalable database in Cassandra. With a proper data model, your Cassandra cluster will deliver some excellent performance characteristics including extremely fast, linearly scalable writes and reads.

One important part of designing a data model is for you to have a good conceptual understanding of Cassandra. The purpose of this post is to clearly define one the most confusing aspects of modeling data in Cassandra for those of us who come from a relational world.

When learning a new database, such as Cassandra, my first instinct is to recreate a schema that I’ve already implemented in an RDBMS. As part of the process I start with some basic tables and a simple query with a join. However, this process runs into problems when I try to apply the logical and conceptual model of a relational database on a non-relational database, such as Cassandra.

This post assumes you understand Cassandra and CQL basics. If not, we’ll cover these topics in future posts.

Tables and rows

Let’s begin by defining some terminology and improving our conceptual understanding of Cassandra.

In CQL, Cassandra’s SQL-like query language, we can define a table. While the CQL table and an SQL table share a common term, they are conceptually different. More specifically, a CQL table, unlike a relational table, contains partitions. Each partition in turn contains rows.

A CQL table contains partitions. Each partition contains rows.

When querying a CQL table, you will access a table and (typically) one partition within it. Once you query the table plus partition, you then gain access to rows that resemble SQL rows.

In summary, a CQL table plus a partition gives you a data structure that somewhat resembles an SQL table with columns and rows.

Materialized views

Wikipedia defines a materialized view as a database object that contains the results of a query, such as a join result. In an RDBMS, materialized views are used to speed up queries by avoiding joins. Simply stated, a materialized view is like a table that contains the result of a complex query.

Why are we discussing materialized views? A CQL table (plus partition) is more like an SQL materialized view than like an SQL table.

A CQL table is more like an SQL materialized view than like an SQL table.

Once we start to think of CQL tables as materialized views and stop thinking of them as SQL tables, then data modeling becomes a lot easier. Put another way, a shortcut to data modeling in Cassandra is to think about how you would model data in an RDBMS if you were only allowed to use fully materialized views with no ordinary tables or joins.

An example

Let’s walk through a basic example in Cassandra where we have a customers table and a purchases table.

Customers entity

The customers entity contains basic customer information (in this example only the customer id and name).

Column Type Description
id uuid The customer id
name text The customer's name

Purchases entity and relationship

The purchases entity tracks all purchase transactions and associates each purchase with a customer.

Column Type Description
id uuid The purchase id
product text The purchased product name
date timestamp The date/time of the purchase
customer_id uuid The customer's id so that we can associated a purchase with a customer

SQL solution (tables and joins)

In a relational database we would create two tables. Our queries would use joins to select data. For now, we’ll limit this example to two queries.

Select all purchases and order by date in descending order.

SELECT p.id, p.product, p.date, c.name
FROM purchases AS p
INNER JOIN customers AS c
    ON c.customer_id = p.id
ORDER BY p.date DESC;

Select all purchases by customer with id = ‘a1b2c3’;

SELECT p.id, p.product, p.date, c.name
FROM purchases AS p
INNER JOIN customers AS c
    ON p.customer_id = c.id
WHERE p.customer_id = 'a1b2c3'
ORDER BY p.date DESC;

SQL solution (materialized views)

Let’s now change our SQL data model to use fully materialized view (i.e. no joins). With materialized views in SQL we gain faster read access at the expense of some data duplication and slower writes. This tradeoff may make sense in a system with a high volume of reads.

We’ll only update the purchases table since it is the only one that needs to be changed.

purchases table

Column Type Description
id uuid The purchase id
product text The purchased product name
date timestamp The date/time of the purchase
customer_id uuid The customer's id
customer_name text The customer's name

Next, we can update our SELECT queries to eliminate the JOIN clause. Notice how the queries have been simplified. In a real schema we would add some indexes to the purchases table to improve query performance.

SELECT id, product, date, customer_name
FROM purchases
ORDER BY date DESC;

Select all purchases by customer with id = ‘a1b2c3’;

SELECT id, product, date, customer_name
FROM purchases
WHERE customer_id = 'a1b2c3'
ORDER BY date DESC;

CQL solution

We’re now ready to evaluate the CQL solution. Remember, think of the CQL table as an SQL materialized view, not an SQL table.

Before we show the CQL data model, there are a few key points about Cassandra that will affect how we model our data.

  1. Writes in Cassandra are extremely fast and are linearly scalable. Unlike an RDBMS, writes are not a bottleneck we have to worry about.
  2. Cassandra stores data in sorted order on disk which gives us extremely fast range scans within a partition.

Putting 1 and 2 together, we’re going to take the materialized view to the extreme by creating one CQL table for each of the SELECT queries. Why is this not insane? Unlike an RDBMS, writes will not be our bottleneck in Cassandra. Further, disk is cheap and I/O is expensive. So we’re going to take advantage of the fact that Cassandra allows us to increase disk usage in exchange for lower I/O which is a cost effective way to scale a database.

Select all purchases and order by date in descending order.

We have added one new column year_month which we set to the partition key. Remember that a CQL table contains partitions, partitions contain rows, and that we typically query a single partition at a time. By using year_month as the partition key we can easily query all purchases for a year and month, such as 2015-01 for January 2015.

If we have other queries that select all purchases per quarter, half year, or year, then we’d create separate CQL tables for each query.

CREATE TABLE purchases_by_year_month (
    id uuid,
    product text,
    date timestamp,
    year_month text,
    customer_id uuid,
    customer_name text,
    PRIMARY KEY(year_month, date)
) WITH CLUSTERING ORDER BY (date DESC);

The partition key year_month is the first position in the PRIMARY KEY.

SELECT id, product, date, customer_name
FROM purchases_by_year_month
WHERE year_month = "2015-01";

Select all purchases by customer with id = ‘a1b2c3’;

CREATE TABLE purchases_by_customer (
    id uuid,
    product text,
    date timestamp,
    customer_id uuid,
    customer_name text,
    PRIMARY KEY(customer_id, date)
) WITH CLUSTERING ORDER BY (date DESC);
SELECT id, product, date, customer_name
FROM purchases_by_customer
WHERE customer_id = 'a1b2c3';

As mentioned above, Cassandra stores the data on disk in sorted order per the clustering column. As a result, we can perform efficient range scans within a partition on the date.

Let’s now query all purchases by a customer in December 2014.

SELECT id, product, date, customer_name
FROM purchases_by_customer
WHERE customer_id = 'a1b2c3'
    AND date >= '2014-12-01'
    AND date <= '2014-12-31';

Summary

Data modeling is one of the most important factors in creating a scalable, high performance Cassandra cluster. We need to understand Cassandra’s unique characteristics to model data in Cassandra.

A CQL table is different from an SQL table. Unlike an SQL table, a CQL table contains partitions and each partition contains rows. A CQL table plus partition is more similar to an SQL materialized view than to an SQL table.

Write performance is rarely a constraint in Cassandra. In terms of costs, disk is cheap and I/O is expensive. Cassandra allows us to increase disk usage in exchange for I/O.

To achieve fast performance we create multiple CQL tables that contain the complete result set for a query.

With a proper data model we ensure that our Cassandra cluster is both high-performance and linearly scalable.




Subscribe to our newsletter

Contact Information

ABOUT EXPONENTIAL.IO

We specialize in helping professional developers, like you, expand your skill set. Our courses are focused on enabling you to learn everything necessary to use a new technology in a live, production application.

LOCATION

All courses are made with love in
Palo Alto, CA.

Subscribe to our newsletter