The most widely known query optimisation rule is that developers should always avoid SELECT *
. Even when all columns are needed, they should list their names, because more columns could be created in the future. But developers find this really annoying: listing the column names the first time is frustrating, maintaining the list when the application changes is annoying, fixing existing SELECT *
is terrible. So they often question: is SELECT *
really slow? Well, half-good news for them: often it’s fine.
DISCLAIMER: I got many comments on social networks that said that SELECT *
is a bad practice for many reasons and I shouldn’t encourage it. Very true! That’s why, if you take the time to read the article, you’ll find out that I’m not denying it is a bad practice and I’m not encouraging it! 🙂 However, whether we like it or not, most queries out there have the infamous asterisk. But we do database consulting. Customers need efficient advise to improve their databases performance. And to put it in simple terms: you can spend 2 days removing the asterisks from some queries, and possibly see no measurable effect; or you can spend 2 days improving indexes in your tables and your database configuration, and see a big performance impact. This article tries to shred some light on when removing the asterisk is important.
Why SELECT * has a bad reputation
When you run SELECT *
, you generally read unneeded columns. Sure, this may intuitively be a bad practice. But what are the practical problems?
- These columns are read from memory or, in the worst case, from disk. This means higher resource usage and higher latency.
- If they’re read form disk they may be cached, if they’re cached they’ll stay in the cache for a longer time. This may be a waste of memory.
- Columns will be sent over a network. This causes more work both on the database server and the application server.
When SELECT * is bad
But, as you may have guessed, this waste of time and resources is usually minimal.
Usually.
There are some cases when SELECT *
actually affects overall application performance.
Columnar databases
Columnar DBMSs, like Redshift and Snowflake, store different columns into different data structures. This includes different memory pages and different physical files. This makes queries like SUM(column)
much faster, and improves data compression. But if you run SELECT *
, all columns are accessed and queries will be sensibly slower than they should be.
Covering indexes
Consider the following query:
SELECT id, surname FROM user WHERE surname < 'c';
If we have an index on the columns (surname, id)
, this query can be performed by only reading that index. But consider this query:
SELECT * FROM user WHERE surname < 'c';
In this case we’re reading all columns, not only those that are comprised in the index. So the query will use the index, but for each row that satisfies the condition it will have to access data. While the index allows to retrieve those rows quickly, it’s still an additional search for each relevant row. Listing the needed columns is all we have to do to avoid that.
Strength is in numbers
I saw production tables with an unreasonable number of columns. Occasionally more than 100. Reading 100 columns when you need an id is clearly a waste.
Most tables don’t have so many columns. But some queries run very often. Sometimes they run many times a seconds. So, even if they’re fast, it’s good to try to make them faster and less hungry of resources. This will most probably have an impact on application performance and resource usage.
Generated columns / Virtual columns
Generated columns, or Virtual columns, are columns defined as an SQL expression. They may be calculated on row writes and written to disk; or they can be calculated on the fly when needed. In the latter case, they should only be selected when needed to avoid executing SQL code that is not needed.
Views
Views may be built on JOIN
s. But there are cases when you have a reason to select rows from a view, but you don’t need to read columns from all the tables.
Some DBMSs support a feature called table elimination. This means that, under certain circumstances, only selecting the needed columns will result in some tables not being accessed at all, making the query faster.
While this feature was mainly designed to make anchor modeling possible, this optimisation is occasionally useful for OLTP workloads as well.
InnoDB TEXTs and BLOBs
In InnoDB (MariaDB and MySQL default storage engine), big variable-length text and binary columns are stored in separate memory pages. Fortunately, there is an optimisation: a prefix is stored in the same page as the rest of the row, so when a column is empty or a value is short no extra page is required. But long data are stored in separate page(s), and reading it requires at least one more physical read. This affects performance.
How much is performance affected? Usually the difference is not measurable and therefore not relevant. But if you add a column list to the most often run queries, which selects many unneeded TEXT
/BLOB
/VARCHAR
columns, and retrieve many rows, it’s possible you’ll see an improvement in some HTTP calls performance or in the CPU usage.
An eye to the future
It’s worth mentioning that, while table definitions don’t change often, sometimes they do. So, today you may use SELECT *
on a table with a few integer columns. But if tomorrow many columns are added, including virtual columns, the the number of retrieved rows per query increases, it’s possible that your SELECT *
will hurt performance. So it can be argued that it would be best to try following good practices from the beginning, and list the columns explicitly.
Conclusions
SELECT *
is not a good practice, but its effect on a query performance and server resource usage is often exaggerated. Normally other aspect are much more important, namely using indexes and avoiding the use of internal temporary tables for two-steps sorts.
However, there are cases when developers should be careful not to make a moderate use of SELECT *
. I tried to list them here, though I don’t claim that the list is complete. If you know more reasons to avoid SELECT *
, please comment.
To master advanced optimisation techniques, consider our SQL optimisation training for teams.
Federico Razzoli
Hi Federico,
Good advice. I think there is one other consideration I would add: how your retrieval from the database interacts with any caching systems.
I want to be careful in saying that caching is not for everyone because there are tonnes of pitfalls with naive implementations. *However* in the case that a row-cache or object cache makes sense, you typically will want to use SELECT *. Being too selective with which columns you retrieve can reduce the effectiveness of the cache (or duplicate its contents through different stored results).
Hey Morgan.
Thanks for the contribute. I tend to agree with you, but I didn’t touch this topic because I think it’s highly dependent with the specific workload. For example, using SELECT * every time could lead to cache many columns, or cache big TEXT and BLOB columns that are not often needed.
Hi Federico Razzoli,
In some circumstances, I think it would be acceptable to use SELECT *, like in the case of explicit need for every column in the table involved as opposed to every column that existed when the query was written. In this case database will internally expand the * into the complete list of column and I think there will be no performance difference.
Yes with respect to the points you mentioned, I would rather discourage to use SELECT * most of the time. One more thing I would like to mention is “Copying Data from one table to another”, in this case I have seen the most common way to copy data from one table to another is using SELECT *. Okay in some case it may work, but some cases you could potentially copy incorrect data into incorrect column if the order of the column is not same between both tables.
Hi Jolly Kannampuzha!
Note that tables may change over time, and columns may be added later. For this reason, maybe you’re selecting all columns because you need to do so – but when more columns are added, if you use SELECT *, your query will probably be reading more columns than necessary. At that point, some of the problems listed in this article may affect your query.