Do we need standardized terminology?
If you’ve spent any time watching Cassandra videos on YouTube or reading blog posts about Cassandra you’ll notice that the same term is often used to mean different things. For example, the term “row” is used to refer to a partition and to the CQL rows within a partition.
In this post, I propose that we (the Cassandra community) start to standardize terminology around CQL and try to limit terms to a single abstraction layer when possible. When a term must span multiple abstraction layers, say from CQL to physical, then it should refer to the same thing.
Please bear in mind I’m not trying to make this an “I’m right” kind of article. The goal is to improve Cassandra terminology by making it less ambiguous so that we all use the same terms to refer to the same things and to start a discussion on what terminology should be used to discuss Cassandra.
So what’s the problem? Let’s pick one simple term, “row”. Is the following a correct or incorrect statement?
Range scans across rows are efficient in Cassandra.
I say it’s true, you say it’s false, but who’s right? Actually, we both are given the fact that the community currently uses row to refer to either a partition or to a CQL row (i.e. an abstract row within a partition).
If “row” refers to a partition, then range scans are not efficient. However, if we are using “row” to refer to a CQL row, then range scans are efficient.
The cause of this mixed use is well known. Jay Patel (Architect at eBay) has a slide the explains the problem well: “The more I read, the more I’m confused.” Please refer to slide 7 of Cassandra data modeling best practices.
In the case of Cassandra, the mixed terminology occurred due to the transition from thrift to CQL. However, CQL is clearly Cassandra’s query language moving forward, so it’s time to upgrade our wording to simplify communication and help make Cassandra easier for newcomers to learn.
This post focuses on the two layers with the most confusing terminology:
CQL terminology should be focused on explaining the CQL abstraction layer. Some terms, such as “partition key” are used in both the CQL layer and the storage layer. However, a consistent use of the term to refer to the same thing in both layers will help avoid ambiguity.
|Keyspace||A logical grouping of tables that all have the same replication factor and replication strategy.|
|Table||A CQL table is a logical grouping of partitions that share the same schema.|
|Partition||A partition is a collection of sorted CQL rows. A partition is the unit of data access as most queries access a single partition.|
|Row||A CQL row is an abstract data structure that is a collection of sorted columns. When a partition has multiple rows, the rows are physically sorted on disk per the clustering column(s).|
|Column name||A column name is how each column is identified. A column name may be used in the primary key as either the partition key or as a clustering column.|
|Column value||The value stored in a column.|
|Primary key||The CQL primary key is a composite key that defines the partition key and, optionally, one or more clustering columns. The partition key itself may be defined as a single key or composite key.|
|Partition key||The partition key is the first component of the primary key and must be unique within a CQL table. The partition key may be defined as a composite key if it is surrounded by parentheses and supplied with a comma separated list of values.|
|Clustering column||A clustering column is used to allow a partition to have multiple rows where each row is sorted per the clustering column(s).|
The physical layer is how Cassandra actually stores data on disk. Understanding the physical layer is an important part of performance turning and data modeling in Cassandra.
|Partition||A partition is a physical unit of data that consists of a collection of sorted cells and is identified by a partition key.|
|Partition key||A hashed value that provides fast access to a partition and that must be unique within a table.|
|Cell||A cell value is smallest unit of storage. At a minimum, a cell consists of a cell name, a cell value and a timestamp. The cell value may be empty.|
|Cell name||A cell name is either:
|Cell value||The cell value can contain a value or be left empty.|
|Timestamp||The timestamp is a cell property that is used internally by Cassandra to keep track of when a cell was inserted or updated. It is also used by Cassandra's merge process during a read request.|
|TTL||TTL, or time to live, is a cell property that you can set to define the date/time after which a cell will be automatically deleted by Cassandra.|
|Tombstone||The tombstone is cell property that represents a deletion marker. Cassandra will remove cells that contain a tombstone from the partition that is returned to a client during a read.|
It’s popular to show a table that maps CQL concepts to relational concepts. The mapping of SQL to CQL is designed ease SQL developers into the no-sql world of Cassandra.
However, the use of SQL-like terminology in CQL can confuse matters as many terms have very different meaning in SQL vs. CQL. I have found that Cassandra works more like a database that has only materialized views than it does like a database with relational tables.
|Database||Keyspace||These two concepts are relatively similar as both contain tables. A keyspace defines the replication factor and replication strategy for all tables that it contains.|
|Table||No equivalent||See Materialized view below.|
|Materialized view||Table + Partition||A CQL table defines a schema much like an SQL table. However, CQL tables contain partitions and each partition contains rows. The combination of a CQL table plus a partition is similar to a materialized view in SQL.|
|Primary key||An SQL primary key is a unique identifier per row. There is no direct equivalent in CQL, although the term "primary key" is used in CQL.|
|Primary key||A CQL primary key is a composite key that may define the partition key and optionally clustering columns.|
|Column||Column||The concept of a column is very similar in Cassandra vs. an RDMBS. Although how a column is physically stored is very different in Cassandra vs. an RDBMS.|
|Value||Value||The concept of a value is very similar in Cassandra vs. an RDMBS.|
|ORDER BY||Clustering columns||Cassandra stores data in sorted order. Therefore, you achieve the equivalent of an SQL ORDER BY through the selection of clustering columns.|
|JOIN||Achieved via materialized view||As mentioned above, a CQL table plus partition is conceptually closer to a materialized view than a relational table. In a materialized view in an RDBMS you would achieve the equivalent of a JOIN by denormalizing data. The same concept applies to Cassandra where you denormalize data.|
As a community we all benefit from clear terminology. It makes it easier to understand, learn and communicate about Cassandra. Each term should be used in only one abstraction layer, if at all possible. When a term must be used across multiple layers then it should always refer to the same thing.
We specialize in helping professional developers, like you, expand your skill set. Our courses are focused on enabling you to learn everything necessary to use a new technology in a live, production application.
All courses are made with love in
Palo Alto, CA.
Subscribe to our newsletter