How slow is SELECT * ?

Last updated on 20 Settembre 2021

The most widely known query optimisation rule is that developers should always avoid SELECT * . Even when all columns are needed, they should list their names, because more columns could be created in the future. But developers find this really annoying: listing the column names the first time is frustrating, maintaining the list when the application changes is annoying, fixing existing SELECT * is terrible. So they often question: is SELECT * really slow? Well, half-good news for them: often it’s fine.

Why SELECT * has a bad reputation

When you run SELECT * , you generally read unneeded columns. Sure, this may intuitively be a bad practice. But what are the practical problems?

  • These columns are read from memory or, in the worst case, from disk. This means higher resource usage and higher latency.
  • If they’re read form disk they may be cached, if they’re cached they’ll stay in the cache for a longer time. This may be a waste of memory.
  • Columns will be sent over a network. This causes more work both on the database server and the application server.

When SELECT * is bad

But, as you may have guessed, this waste of time and resources is usually minimal.

Usually.

There are some cases when SELECT * actually affects overall application performance.

Columnar databases

Columnar DBMSs, like Redshift and Snowflake, store different columns into different data structures. This includes different memory pages and different physical files. This makes queries like SUM(column) much faster, and improves data compression. But if you run SELECT * , all columns are accessed and queries will be sensibly slower than they should be.

Covering indexes

Consider the following query:

SELECT id, surname FROM user WHERE surname < 'c';

If we have an index on the columns (surname, id) , this query can be performed by only reading that index. But consider this query:

SELECT * FROM user WHERE surname < 'c';

In this case we’re reading all columns, not only those that are comprised in the index. So the query will use the index, but for each row that satisfies the condition it will have to access data. While the index allows to retrieve those rows quickly, it’s still an additional search for each relevant row. Listing the needed columns is all we have to do to avoid that.

Strength is in numbers

I saw production tables with an unreasonable number of columns. Occasionally more than 100. Reading 100 columns when you need an id is clearly a waste.

Most tables don’t have so many columns. But some queries run very often. Sometimes they run many times a seconds. So, even if they’re fast, it’s good to try to make them faster and less hungry of resources. This will most probably have an impact on application performance and resource usage.

Generated columns / Virtual columns

Generated columns, or Virtual columns, are columns defined as an SQL expression. They may be calculated on row writes and written to disk; or they can be calculated on the fly when needed. In the latter case, they should only be selected when needed to avoid executing SQL code that is not needed.

Views

Views may be built on JOIN s. But there are cases when you have a reason to select rows from a view, but you don’t need to read columns from all the tables.

Some DBMSs support a feature called table elimination. This means that, under certain circumstances, only selecting the needed columns will result in some tables not being accessed at all, making the query faster.

While this feature was mainly designed to make anchor modeling possible, this optimisation is occasionally useful for OLTP workloads as well.

InnoDB TEXTs and BLOBs

In InnoDB (MariaDB and MySQL default storage engine), big variable-length text and binary columns are stored in separate memory pages. Fortunately, there is an optimisation: a prefix is stored in the same page as the rest of the row, so when a column is empty or a value is short no extra page is required. But long data are stored in separate page(s), and reading it requires at least one more physical read. This affects performance.

How much is performance affected? Usually the difference is not measurable and therefore not relevant. But if you add a column list to the most often run queries, which selects many unneeded TEXT / BLOB / VARCHAR columns, and retrieve many rows, it’s possible you’ll see an improvement in some HTTP calls performance or in the CPU usage.

Conclusions

SELECT * is not a good practice, but its effect on a query performance and server resource usage is often exaggerated. Normally other aspect are much more important, namely using indexes and avoiding the use of internal temporary tables for two-steps sorts.

However, there are cases when developers should be careful not to make a moderate use of SELECT * . I tried to list them here, though I don’t claim that the list is complete. If you know more reasons to avoid SELECT * , please comment.

Federico Razzoli

Did you like this article?

A proposito di Federico Razzoli

Federico is Vettabase Ltd director, and he's an expert database consultant specialised in the MariaDB and MySQL ecosystems.

4 Replies to “How slow is SELECT * ?”

  1. Hi Federico,

    Good advice. I think there is one other consideration I would add: how your retrieval from the database interacts with any caching systems.

    I want to be careful in saying that caching is not for everyone because there are tonnes of pitfalls with naive implementations. *However* in the case that a row-cache or object cache makes sense, you typically will want to use SELECT *. Being too selective with which columns you retrieve can reduce the effectiveness of the cache (or duplicate its contents through different stored results).

    • Hey Morgan.
      Thanks for the contribute. I tend to agree with you, but I didn’t touch this topic because I think it’s highly dependent with the specific workload. For example, using SELECT * every time could lead to cache many columns, or cache big TEXT and BLOB columns that are not often needed.

  2. Hi Federico Razzoli,

    In some circumstances, I think it would be acceptable to use SELECT *, like in the case of explicit need for every column in the table involved as opposed to every column that existed when the query was written. In this case database will internally expand the * into the complete list of column and I think there will be no performance difference.

    Yes with respect to the points you mentioned, I would rather discourage to use SELECT * most of the time. One more thing I would like to mention is “Copying Data from one table to another”, in this case I have seen the most common way to copy data from one table to another is using SELECT *. Okay in some case it may work, but some cases you could potentially copy incorrect data into incorrect column if the order of the column is not same between both tables.

    • Hi Jolly Kannampuzha!
      Note that tables may change over time, and columns may be added later. For this reason, maybe you’re selecting all columns because you need to do so – but when more columns are added, if you use SELECT *, your query will probably be reading more columns than necessary. At that point, some of the problems listed in this article may affect your query.

Lascia un commento

Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *

*