Last updated on 26 May 2022
The most widely known query optimisation rule is that developers should always avoid
. Even when all columns are needed, they should list their names, because more columns could be created in the future. But developers find this really annoying: listing the column names the first time is frustrating, maintaining the list when the application changes is annoying, fixing existing
is terrible. So they often question: is
really slow? Well, half-good news for them: often it’s fine.
DISCLAIMER: I got many comments on social networks that said that
is a bad practice for many reasons and I shouldn’t encourage it. Very true! That’s why, if you take the time to read the article, you’ll find out that I’m not denying it is a bad practice and I’m not encouraging it! 🙂 However, whether we like it or not, most queries out there have the infamous asterisk. But we do database consulting. Customers need efficient advise to improve their databases performance. And to put it in simple terms: you can spend 2 days removing the asterisks from some queries, and possibly see no measurable effect; or you can spend 2 days improving indexes in your tables and your database configuration, and see a big performance impact. This article tries to shred some light on when removing the asterisk is important.
Why SELECT * has a bad reputation
When you run
, you generally read unneeded columns. Sure, this may intuitively be a bad practice. But what are the practical problems?
- These columns are read from memory or, in the worst case, from disk. This means higher resource usage and higher latency.
- If they’re read form disk they may be cached, if they’re cached they’ll stay in the cache for a longer time. This may be a waste of memory.
- Columns will be sent over a network. This causes more work both on the database server and the application server.
When SELECT * is bad
But, as you may have guessed, this waste of time and resources is usually minimal.
There are some cases when
actually affects overall application performance.
Columnar DBMSs, like Redshift and Snowflake, store different columns into different data structures. This includes different memory pages and different physical files. This makes queries like
much faster, and improves data compression. But if you run
, all columns are accessed and queries will be sensibly slower than they should be.
Consider the following query:
SELECT id, surname FROM user WHERE surname < 'c';
If we have an index on the columns
, this query can be performed by only reading that index. But consider this query:
SELECT * FROM user WHERE surname < 'c';
In this case we’re reading all columns, not only those that are comprised in the index. So the query will use the index, but for each row that satisfies the condition it will have to access data. While the index allows to retrieve those rows quickly, it’s still an additional search for each relevant row. Listing the needed columns is all we have to do to avoid that.
Strength is in numbers
I saw production tables with an unreasonable number of columns. Occasionally more than 100. Reading 100 columns when you need an id is clearly a waste.
Most tables don’t have so many columns. But some queries run very often. Sometimes they run many times a seconds. So, even if they’re fast, it’s good to try to make them faster and less hungry of resources. This will most probably have an impact on application performance and resource usage.
Generated columns / Virtual columns
Generated columns, or Virtual columns, are columns defined as an SQL expression. They may be calculated on row writes and written to disk; or they can be calculated on the fly when needed. In the latter case, they should only be selected when needed to avoid executing SQL code that is not needed.
Views may be built on
s. But there are cases when you have a reason to select rows from a view, but you don’t need to read columns from all the tables.
Some DBMSs support a feature called table elimination. This means that, under certain circumstances, only selecting the needed columns will result in some tables not being accessed at all, making the query faster.
While this feature was mainly designed to make anchor modeling possible, this optimisation is occasionally useful for OLTP workloads as well.
InnoDB TEXTs and BLOBs
In InnoDB (MariaDB and MySQL default storage engine), big variable-length text and binary columns are stored in separate memory pages. Fortunately, there is an optimisation: a prefix is stored in the same page as the rest of the row, so when a column is empty or a value is short no extra page is required. But long data are stored in separate page(s), and reading it requires at least one more physical read. This affects performance.
How much is performance affected? Usually the difference is not measurable and therefore not relevant. But if you add a column list to the most often run queries, which selects many unneeded
columns, and retrieve many rows, it’s possible you’ll see an improvement in some HTTP calls performance or in the CPU usage.
An eye to the future
It’s worth mentioning that, while table definitions don’t change often, sometimes they do. So, today you may use
on a table with a few integer columns. But if tomorrow many columns are added, including virtual columns, the the number of retrieved rows per query increases, it’s possible that your
will hurt performance. So it can be argued that it would be best to try following good practices from the beginning, and list the columns explicitly.
is not a good practice, but its effect on a query performance and server resource usage is often exaggerated. Normally other aspect are much more important, namely using indexes and avoiding the use of internal temporary tables for two-steps sorts.
However, there are cases when developers should be careful not to make a moderate use of
. I tried to list them here, though I don’t claim that the list is complete. If you know more reasons to avoid
, please comment.
To master advanced optimisation techniques, consider our SQL optimisation training for teams.