During our SQL optimisation training courses, I always stress the need of understanding how databases run SQL queries internally. Which is easier and more intuitive than you’d think, anyway. But if you overlook this aspect, there is always a risk of sticking to simple solutions, and consider them universal even if they aren’t. A typical example is… add an index on the columns used by your query. As I’ll show in this article, there are cases when adding an index can make some queries slower. It’s all about numbers, but no complex math is involved.
Quick reminder: what is an index?
An index can be defined on one or more columns. It can be a regular index, or a UNIQUE
index, which means that it won’t accept duplicate values other than NULL
.
Indexes are ordered data structures. Yes, you might know that they are B-Tree data structures, or a variation of B-Tree (usually B+Tree). But B+Tree is just an advanced ordered data structure: the lead nodes of the tree can be read from the leftmost to the rightmost to scan all values in ascending order. The other nodes can be seen as a map to find any specific value stored in a leaf node.
There are exceptions. Some databases support index types that are not ordered, the most common being the hash index type. Those indexes might be harmful too, but for different reasons. They’re outside of the scope of this article.
Index cardinality and selectivity
Index cardinality is the number of unique values contained in your index. A high cardinality means that there are many unique values, a low cardinality means that there are just a few unique values. Some examples:
- For a primary key, the cardinality is the highest possible: it is the number of rows, because each row is unique.
- For a
UNIQUE
index, it might be as high, or much lower: it’s the number of rows lessNULL
pseudo-values, becauseNULL
s can be duplicated. - For a
BOOLEAN NOT NULL
column, it is 2, which is a very low number!
In the context of query optimisation, it’s commonly said that good indexes have high cardinality. That’s a simplification, and sometimes it’s completely wrong. What counts is how much selective an index is for the particular queries it serves. In other words, the important aspect is how many rows an index can exclude from the query plan. Let me show you a couple of examples.
High cardinality, low selectivity
If the cardinality is the highest possible, the selectivity for each query should also be excellent, right? Wrong. Because you might search for ranges that are not very selective. For example, suppose you have an index on a surname
and name
columns. Each combination is unique. But if you’re looking for all surnames starting with B, they might be a lot!
SELECT * FROM customer WHERE surname ILIKE 'B%';
Suppose we have an index on (surname, name)
, and its content is the following:
surname | name |
Abbott | Jack |
Babbage | Charles |
Baker | Colin |
Baker | Tom |
Benson | Robert |
Book | Peter |
Brewster | Emma |
Constable | Lisa |
Keenan | Martha |
Windsor | Winston |
See? In this particular case, the above query will return more than half of the rows! You might say that this example is not very realistic, and you’re right. More realistic examples might include postcodes (if your customers are concentrated in a geographical area), country codes, product codes, and so on.
Low cardinality, high selectivity
Let’s see a different case, where cardinality is low but selectivity is high. How is it possible? Well, consider the case of a table with blog posts. The table has a deleted
column, which is BOOLEAN NOT NULL
, so only two values are admitted. But the number of deleted values is probably very low. So, the index will have terrible selectivity for a query like this:
SELECT * FROM blog_post WHERE deleted = FALSE;
But it will have great selectivity for the opposite query:
SELECT * FROM blog_post WHERE deleted = TRUE;
So, it’s important to look at index selectivity for practical queries, rather than cardinality.
Why is low selectivity so bad for query performance?
This is a great question. And once again, the correct answer is: sometimes it’s not!
Let me show you an example:
SELECT COUNT(*) FROM blog_post WHERE deleted = FALSE;
This query will have to read all the index entries where deleted = FALSE
. Suppose they are tens of thousands. And suppose that only 20 rows have DELETED = TRUE
. The index will exclude those 20 rows from the query execution. Over all, this is irrelevant. But it’s still the right thing to do or, at least, it won’t cause any harm.
But consider this query:
SELECT * FROM blog_post WHERE deleted = FALSE;
Do you see the difference? The first query could be executed just by reading the index. It had all the needed information. In other words, the index on deleted
is a covering index for the first query. But this time, the database will have to read all the columns. But the index only contains one column. So, for each column that matches our WHERE
condition, the database will have to do an additional search, to retrieve the row from the data. This increases the work to do a lot!
To be clear, normally a database will not use this index for this query. But it could. Maybe the statistics on indexes data distribution are obsolete or wrong, maybe the query planner has a nasty bug, maybe there are more complex reasons. Whatever the reason, it shouldn’t happen!
Another example:
SELECT * FROM blog_post WHERE (publication_date BETWEEN '2020-01-01' AND '2020-01-30') AND deleted = FALSE;
- 20,000 rows in the table;
- 19,980 rows with
deleted = FALSE
; - 100 rows for the requested time period;
- 99 undeleted rows for the requested time period.
Now, how many rows will be read?
- If no index is used (table scan), 20,000 rows are read.
- If an index on publication_date is used, 100 rows are read. This is good enough!
- For the sake of completeness: 1 row will be read even if this is not necessary. Again, I wouldn’t care.
- If an index on delete alone is used, 19,980 entries are read from the index.
- For each of them, the database will read the corresponding row to evaluate the condition on publication_date.
- Only 99 rows match this condition and really needed to be read.
- If an index on (publication_date, deleted) exists, only 99 index entries are read.
- For each of them, the corresponding row will be read, too.
An ASCII diagram from Claude
I asked Claude Sonnet 3.7 to draw a funny ASCII diagram to visually explain why using an index on deleted
would be terrible for performance.
+----------------------------------------------+
| QUERY: SELECT * FROM blog_post |
| WHERE deleted = 0 |
| |
| (Translation: "Give me basically |
| everything except the trash, please") |
+----------------------------------------------+
|
| "Let me check my handy index!"
v
+----------------------------------------------+
| THE WORLD'S LEAST HELPFUL INDEX |
| (Secondary Index on deleted) |
|----------------------------------------------|
| deleted: 0 | id: 1 <--- "Found one!" |
|----------------------------------------------|
| deleted: 0 | id: 2 <--- "Another one!" |
|----------------------------------------------|
| deleted: 0 | id: 3 <--- "And another!" |
|----------------------------------------------|
| deleted: 0 | id: 4 <--- "Still going!" |
|----------------------------------------------|
| ... | ... <--- "This may take |
| | a while..." |
|----------------------------------------------|
| deleted: 0 | id: 998 <--- "Almost done!" |
|----------------------------------------------|
| deleted: 0 | id: 999 <--- "Last one!" |
|----------------------------------------------|
| deleted: 1 | id: 5 <--- "Nope, trash!" |
+----------------------------------------------+
|
| "Now I'll look up EACH ONE individually!"
v
+----------------------------------------------+
| PRIMARY INDEX (looking very exhausted) |
|----------------------------------------------|
| id: 1 | title: 'Why databases hate me' | ... |
|----------------------------------------------|
| id: 2 | title: 'My index brings all | ... |
| | the rows to the yard' | |
|----------------------------------------------|
| id: 3 | title: '99 problems but an | ... |
| | efficient query | |
| | ain't one' | |
|----------------------------------------------|
| ... |
+----------------------------------------------+
|
"After 999 separate lookups, I'm exhausted!"
|
v
+----------------------------------------------+
| DATABASE PERFORMANCE |
|----------------------------------------------|
| |
| TABLE SCAN: |############ | Fast! |
| | | |
| USING INDEX: |################## | Slow! |
| |################## | |
| |################## | |
| |################## | "Still |
| |################## | going |
| | | ..." |
+----------------------------------------------+
+----------------------------------------------+
| MYSQL OPTIMIZER'S INNER MONOLOGUE |
+----------------------------------------------+
| |
| "Let's see, I could: |
| |
| 1. Read the whole table in one smooth pass |
| |
| -OR- |
| |
| 2. Check this index, then jump back and |
| forth 999 times like I'm playing the |
| world's worst game of database pinball |
| |
| Hmm, tough choice... |
| |
| *chooses option 2* |
| |
| Developer: *facepalm*" |
| |
+----------------------------------------------+
No bad for something that is believed to be a “stochastic parrot”, isn’t it? Let me know what you think about this use of AI. Personally, I think that it’s ok to use it, as long as you remain in control and verify the quality of its output. Some other companies use it to write article from scratch… and they end up describing database features that just don’t exist! But this is a good story for another time.
Conclusions
A friend told me: You database people are so annoying. Whatever I ask you, you reply: it depends! Unfortunately, that’s partly true. It’s because databases are complex pieces of software in charge of complex tasks, so there aren’t many universal truths that you can just apply to your specific case. You need to understand how they work (to some extent!) and then apply that knowledge differently, from case to case.
Indexes are perfect examples. They are supposed to speed up queries, and usually they do it very well, leading to impressive optimisations. But not always, and actually sometimes they might damage your application horribly. To debug the problem, you’ll need to find out how indexes are implemented, and how the DBMS uses them. If you need help, you can use our training courses!
Federico Razzoli
0 Comments