Schedule Meeting

a

Why Your Database Deserves Consistent Names and Types

by | Aug 22, 2025 | SQL Language

Need Help?  Click Here for Expert Support

I see a lot of problems with database schemas: missing indexes, wrong types, tables used like an spreadsheet… but all these problems are widely known. An often overlooked problem is inconsistency. Let me explain how you should guarantee database consistency, and how inconsistency is making your life more sad than it would be otherwise.

What Schema Inconsistency Is

The inconsistencies we’ll talk about today are of these types:

  • Table names inconsistencies;
  • Column names inconsistencies;
  • Column types inconsistencies.

There are other aspects of a database that should be consistent, but let’s ignore them in this article.

A column might be defined in this way:

email VARCHAR(100)

It doesn’t matter here if it’s NULL or NOT NULL. Maybe in certain tables emails are not mandatory and we have a UNIQUE constraint on them, so NULL is necessary; whereas in other tables emails might be mandatory, so they should be defined as NOT NULL (though, actually, you might also use a separate table for optional data). This is not an inconsistency.

Inconsistencies include the following:

  • Using the email name in some tables, and the mail name in others.
    • Maybe even using different languages!
  • Using different types for fields with the same name across different tables.
  • Using VARCHAR, but with different sizes: 100, 200, 255…

Why are these things a problem?

Index usage

If your organisation doesn’t use types consistently, chances are that wrong types are occasionally chosen. This happens in most databases, so don’t feel stupid if you do this. I’m not here to judge you, I’m here to tell you that you have a problem.

Suppose your product table has a quantity column. You’d expect it to be an INTEGER. For reasons that I don’t know (if you do, please lecture me) in some databases such columns occasionally end up being FLOAT or, even worse, VARCHAR. Now, suppose you have this query:

SELECT id FROM product WHERE quantity < 10;

You might say that the query is wrong, because strings must be quoted ('10'). I prefer to say that the query is right and the column is wrong. But the point here is… this query cannot use an index!

Why? Because the query should be performed in this way:

  • Read the first entry of the index;
  • Keep reading rightward until you find '10'.

Or the other way around:

  • Find the value '10';
  • Keep reading leftward until the end of the index.

But there’s a problem: the 10 integer can match many strings, not only '10'. It can match:

  • ' 10';
  • '0010';
  • ' 10 ';
  • etc.

Since there is not 1:1 conversion, using an index of strings and relying on its order is not possible.

JOIN performance

A JOIN between two tables is performed by comparing values from the left table with values from the right table to find matches. As you can imagine, this part is potentially slow because many combinations of values need to be tested. However, when we use indexes correctly and compare a reasonable amount of rows, a JOIN shouldn’t be too bad.

But type or size inconsistency can cause problems here. When you compare two different types (INTEGER vs VARCHAR) or similar types of different sizes (INTEGER vs BIGINT), a conversion must happen. When we have to compare many rows, this might degrade a JOIN‘s speed.

For this reason, it’s important that columns that match have the same type and size. This is guaranteed if we use type consistently: for example, if we make sure that all columns called email will be of type VARCHAR(100), we can also be sure that JOINs based on email columns won’t involve type conversions for row matching. More importantly, we better make sure that all primary keys are of the same type (BIGINT or UUID), and consequently all relationship columns (eg, customer_id) will be of that type.

Set operations performance

Relational databases support some operations that are often described as set operations (even though they are operations defined by relational algebra):

  • UNION;
  • INTERSECT;
  • EXCEPT or MINUS.

UNION means that the results of a query are added to the results of another query:

(SELECT id, name, phone FROM customer)
UNION ALL
(SELECT id, name, phone FROM supplier);

There will be a single resultset, with one type and one size for every column. In the above example, id, name and phone must be of the same type and size. Some DBMSs are tolerant to small differences, but this means that a conversion must happen. For example, if the first query returns id’s of type BIGINT names of type VARCHAR(100) but the second query has types INTEGER and VARCHAR(50), probably the latter will be converted. With big resultsets, this might affect query speed.

INTERSECT returns the rows that are returned by both the left and right queries. This is equivalent to an INNER JOIN that matches all the columns. EXCEPT returns the rows that are returned by the left query but not by the right query. This is equivalent to an anti-join that matches all the columns. Examples:

-- customers that are also suppliers
(SELECT id, name, phone FROM customer)
INTERSECT DISTINCT
(SELECT id, name, phone FROM supplier);

-- customers that are not suppliers
(SELECT id, name, phone FROM customer)
EXCEPT ALL
(SELECT id, name, phone FROM supplier);

Type mismatches affect INTERSECT and EXCEPT speed in the same way it affects JOINs.

DISTINCT doesn’t need to be specified, as it is the default behaviour. It means that duplicate rows must be eliminated from the final results. ALL means that duplicates are allowed. These modifiers are not affected by type mismatches, because they come into play after conversions.

Automation and Data interoperability

In some cases, other technologies or scripts need to connect to a database to automate some operation with the data. Typical cases are:

  • Data anonymisation for staging databases;
  • Exporting data to other technologies.

In both cases, probably the program that connects to the database is flexible enough to work with any schema. But it will require that some key columns have follow strict naming conventions, to be able to recognise them and do something special with them.

Anonymisation

Staging databases are often recreated nightly from a production backup. But some regulations require that data are anonymised before being exposed to the developers. Data anonymisation consists of replacing personal data (PII) with partially masked data, or random realistic data. An id is normally not considered sensitive information, but data like emails, phone numbers or bank data should not be in a staging database.

Columns that contain sensitive data should be identifiable by their name.While some universal names exist, other names might depend on the specific terminology used by the company, its developers mother tongue, and so on. The company should have a dictionary of names for columns that contain personal information, so that an anonymisation script can identify and scrub them.

It’s very important to follow this dictionary. If an anonymisation script expects email columns to be called email, there must not be columns called mail, email2, or email_new – or their data might not be anonymised. And keep in mind that even names like EMAIL, Email or e-mail might cause sensitive columns to be skipped.

It’s also important to pay attention to the data types. If the anonymisation script expects phone numbers to be strings, it could for example replace every digit character with another. But if phone numbers are stored as integers in some tables, the script will likely fail.

Data Exporting

Data are exported from OLTP databases to data warehouses and other data technologies. Data engineers build the data pipelines, exporting the data that is needed, excluding data that is not needed or not allowed, and identifying the columns with special meanings. Identifying the columns with special meanings is not necessarily intuitive for the application developers: for example, in a DWH primary keys are not production’s id columns, but other columns that make sense from a business perspective (see Primary Keys in Data Warehouses). You get the point, right? Data engineers should be able to identify all the columns they care about quickly, just by looking at their names.

Similarly they might have some automation that filters and transforms the data, but first needs to identify them. For example, this automation might need to recognise “relation tables” from their names, or it might need to know how to order data chronologically based on some columns name and type.

Without this level of consistency, data engineers will need to spend time to understand the source data, configure table and column names, code new ways to filter or transform the data. With time, inconsistency will make their automation more fragile and harder to maintain. Their daily job will also become more tasky and frustrating.

A Bad Solution: Domains

Most traditional DBMSs support domains, a feature included in SQL92 and later standards. Domains define a type and some constraints that can be applied to a set of columns. So, for example, there can be an email domain defined as VARCHAR(100) NOT NULL CHECK (VALUE LIKE '_%@_%.__%'). This domain can then be applied to all columns that are supposed to contain an email.

In theory, this is a great way to define the type and the set of valid values for columns that should contain the same class of data: this is exactly the goal of domains. Also, domains are defined in the database itself. This means that they completely support the database’s types and constraints, and they don’t require using external technologies for validation.

Unfortunately, domains are feature designed in the early 90s, when scalability was not a concern. Don’t misunderstand me: I’m not a relational database detractor, and I’m not implying that they don’t scale just because they were designed in the 70s. Relational databases are an example of solid, extensible design that survived over the decades and will survive for more decades. But this particular feature won’t scale.

Let me give you an example. Suppose you have a domain, and tens of columns associated to that domain. Some of them might even have millions of rows. Suppose you need to change its size, or even its type. This might trigger a massive series of data writes that will happen in a single transaction. Even in case you simply change a constraint, the new constraint will generally need to be verified again, in a single transaction. This kind of operations can easily kill your production database.

A Good Solution: Acceptance Tests

Our databases are – or should be – versioned. This means that every schema change is committed to a branch, and a migration of the database corresponds to a pull request. Probably our pull requests are reviewed by one or more colleagues before being accepted.

However, a pull request might also be required to pass some acceptance tests. And if it includes schema changes (diffs in SQL files), these tests should cover such changes. The workflow should be similar to the following:

  • Read the up migration. Verify if there is a corresponding down migration.
  • Run the up migration against a staging database or, even better, an unused database.
  • Run queries against the information_schema to verify that rules are not violated. For example, the script shouldn’t find any table without a primary key, any primary key consisting of more than two columns, etc.
  • There should be a dictionary table containing column names and their expected definition (type, size). Columns with unexpected definitions won’t pass the tests. If it makes sense, some names might actually be regular expressions.
  • Unknown names should not pass the tests. If they don’t follow the pattern, they need be fixed. If they are new, a new name needs be added to the dictionary table.

Other Good Practices

Every non-obvious part of your schema should be documented. Provided that you implement the solution described above, the dictionary table should also document the meaning of the column names and their formats. For example, if a string column contains lists, document the format of these lists (comma-separated words, a JSON array…). If the format is public (eg, EAN-13 or AWS ARN) just specify the documentation’s URL.

All DBMSs allow to associate every table and columns to a comment that will be shown by database GUIs. Use this feature, but avoid comments that are just noise. Don’t state that email contains an email, but there are other valuable piece of information that you might need to write into the comment. For example:

  • Why do we have that column? Maybe it’s legacy stuff that should be removed. Maybe it’s a redundant column that avoids big joins.
  • Information that’s beyond a physical data description. Is it verified information? Is there a risk that it’s obsolete? Where does that information come from?

I’m not suggesting that you should do it for all columns, or that you should write a book about every column. But make sure that colleagues that join the company in 2 years will read the name and understand what they’re seeing. If they don’t, respect their time: give them more details and save them a long investigation work.

Conclusions

We saw some reasons why your company should use column names and types consistently. Consistent typing will bring benefits to query performance. Naming and typing consistency will make both the data engineers and OLTP developers more productive.

As your application grows and your team expands, these small inconsistencies compound into major headaches. The schema choices you make today will either empower your team or frustrate them (and yourself) for years. Choose consistency now: establish naming and typing rules in your organisation.

I’d be curious to hear about your experience on this topic. Did you have problems because of schema inconsistency? Have you managed to solve them somehow? Do you recommend some particular tools?

Federico Razzoli

All content in this blog is distributed under the CreativeCommons Attribution-ShareAlike 4.0 International license. You can use it for your needs and even modify it, but please refer to Vettabase and the author of the original post. Read more about the terms and conditions: https://creativecommons.org/licenses/by-sa/4.0/

About Federico Razzoli
Federico Razzoli is a database professional, with a preference for open source databases, who has been working with DBMSs since year 2000. In the past 20+ years, he served in a number of companies as a DBA, Database Engineer, Database Consultant and Software Developer. In 2016, Federico summarized his extensive experience with MariaDB in the “Mastering MariaDB” book published by Packt. Being an experienced database events speaker, Federico speaks at professional conferences and meetups and conducts database trainings. He is also a supporter and advocate of open source software. As the Director of Vettabase, Federico does business worldwide but prefers to do it from Scotland where he lives.

Recent Posts

To BLOB or not to BLOB? The image storage dilemma

To BLOB or not to BLOB? The image storage dilemma

Years ago, it was extremely common to store user-uploaded images and other multimedia contents in a database. This practice became less common over time, but I still hear questions like: would it be a bad idea to store images in our database? Or, more specifically:...

Why managers should learn and use SQL

Why managers should learn and use SQL

SQL is the language of data. You write a query, a single sentence in a language that resembles English, and you obtain the data you need. It's supported by the majority of databases, and the language is pretty standard. If you have a query that works on Oracle, most...

Services

Need Help?  Click Here for Expert Support

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *