Last updated on 25 Maggio 2022
MariaDB, like most databases, allow to run queries that use regular expressions in the
clause. While this is powerful, a problem with using a regexp in a database is that it can’t take advantage of an index. The reason is simple: the query optimiser can’t understand regular expressions, or make assumptions on what they will return. This article suggests a trivial but counter-intuitive optimisation that can sometimes workaround this limitation. I called it the Regexp + LIKE optimisation.
Consider the following regexp:
SELECT id, code FROM product WHERE code RLIKE '^AB[[:digit:]]+';
Just in case you don’t understand regexp syntax:
RLIKEperforms a regular expression.
REGEXPis a synonym.
^is the beginning of the string.
ABis just a sequence of characters we’re looking for.
[[:digit:]]+means one or more digits.
The whole regexp means: “a string that starts with AB, immediately followed by at least one digit”.
LIKE to the rescue
As mentioned at the beginning of this article, a regexp can never make use of an index. That is a pity, because we are looking for a precise substring at the beginning of a string, and this search could theoretically use an index (range search).
This condition should use an index:
WHERE code LIKE 'AB%'
Unfortunately, this won’t check if AB is followed by digits.
But what about this condition?
SELECT id, code FROM product WHERE code LIKE 'AB%' AND code RLIKE '^AB[[:digit:]]+';
Here’s what will happen (in most cases):
- MariaDB will run use the
LIKEexpression over an index. This should filter out most of the rows.
- It will check the regular expression against the rows that satisfy the
This shouldn’t be sensibly slower than using the
statement. Check that an index is usable (
column) and is actually used (
). If so, the
column should show that the number of rows examined by the second query is much lower.
Why is this an optimisation?
An index is an ordered data structure, just like a phone book. If you had to look Tom Baker on a British phone book, you’d find it in seconds: because of the alphabetical order, you could just skip all names except for a small number of them. If it wasn’t ordered, you’d have to read all the names one by one, and I don’t know how much it would take – probably months, or years.
A computer is obviously faster than you and me, but the principle is the same. Finding a set of rows by index normally takes much less than a second. Scanning a whole table may take a long time if the table is big.
I understand that this is counter-intuitive: looking at the optimised version of the query, you see that it has to run both a regular expression and a
comparison. And the latter is redundant. So you’d probably think that the second version takes more time. But running
against a small set of rows is much faster than running a regular expression on all the rows in a table.
To master more advanced optimisation techniques, consider our SQL optimisation training for teams.