I see a lot of problems with database schemas: missing indexes, wrong types, tables used like an spreadsheet… but all these problems are widely known. An often overlooked problem is inconsistency. Let me explain how you should guarantee database consistency, and how inconsistency is making your life more sad than it would be otherwise.
What Schema Inconsistency Is
The inconsistencies we’ll talk about today are of these types:
- Table names inconsistencies;
- Column names inconsistencies;
- Column types inconsistencies.
There are other aspects of a database that should be consistent, but let’s ignore them in this article.
A column might be defined in this way:
email VARCHAR(100)
It doesn’t matter here if it’s NULL
or NOT NULL
. Maybe in certain tables emails are not mandatory and we have a UNIQUE
constraint on them, so NULL
is necessary; whereas in other tables emails might be mandatory, so they should be defined as NOT NULL
(though, actually, you might also use a separate table for optional data). This is not an inconsistency.
Inconsistencies include the following:
- Using the
email
name in some tables, and themail
name in others.- Maybe even using different languages!
- Using different types for fields with the same name across different tables.
- Using
VARCHAR
, but with different sizes: 100, 200, 255…
Why are these things a problem?
Index usage
If your organisation doesn’t use types consistently, chances are that wrong types are occasionally chosen. This happens in most databases, so don’t feel stupid if you do this. I’m not here to judge you, I’m here to tell you that you have a problem.
Suppose your product
table has a quantity
column. You’d expect it to be an INTEGER
. For reasons that I don’t know (if you do, please lecture me) in some databases such columns occasionally end up being FLOAT
or, even worse, VARCHAR
. Now, suppose you have this query:
SELECT id FROM product WHERE quantity < 10;
You might say that the query is wrong, because strings must be quoted ('10'
). I prefer to say that the query is right and the column is wrong. But the point here is… this query cannot use an index!
Why? Because the query should be performed in this way:
- Read the first entry of the index;
- Keep reading rightward until you find
'10'
.
Or the other way around:
- Find the value
'10'
; - Keep reading leftward until the end of the index.
But there’s a problem: the 10
integer can match many strings, not only '10'
. It can match:
' 10'
;'0010'
;' 10 '
;- etc.
Since there is not 1:1 conversion, using an index of strings and relying on its order is not possible.
JOIN performance
A JOIN
between two tables is performed by comparing values from the left table with values from the right table to find matches. As you can imagine, this part is potentially slow because many combinations of values need to be tested. However, when we use indexes correctly and compare a reasonable amount of rows, a JOIN
shouldn’t be too bad.
But type or size inconsistency can cause problems here. When you compare two different types (INTEGER
vs VARCHAR
) or similar types of different sizes (INTEGER
vs BIGINT
), a conversion must happen. When we have to compare many rows, this might degrade a JOIN
‘s speed.
For this reason, it’s important that columns that match have the same type and size. This is guaranteed if we use type consistently: for example, if we make sure that all columns called email
will be of type VARCHAR(100)
, we can also be sure that JOIN
s based on email
columns won’t involve type conversions for row matching. More importantly, we better make sure that all primary keys are of the same type (BIGINT
or UUID
), and consequently all relationship columns (eg, customer_id
) will be of that type.
Set operations performance
Relational databases support some operations that are often described as set operations (even though they are operations defined by relational algebra):
UNION
;INTERSECT
;EXCEPT
orMINUS
.
UNION means that the results of a query are added to the results of another query:
(SELECT id, name, phone FROM customer)
UNION ALL
(SELECT id, name, phone FROM supplier);
There will be a single resultset, with one type and one size for every column. In the above example, id
, name
and phone
must be of the same type and size. Some DBMSs are tolerant to small differences, but this means that a conversion must happen. For example, if the first query returns id’s of type BIGINT
names of type VARCHAR(100)
but the second query has types INTEGER
and VARCHAR(50)
, probably the latter will be converted. With big resultsets, this might affect query speed.
INTERSECT
returns the rows that are returned by both the left and right queries. This is equivalent to an INNER JOIN
that matches all the columns. EXCEPT
returns the rows that are returned by the left query but not by the right query. This is equivalent to an anti-join that matches all the columns. Examples:
-- customers that are also suppliers
(SELECT id, name, phone FROM customer)
INTERSECT DISTINCT
(SELECT id, name, phone FROM supplier);
-- customers that are not suppliers
(SELECT id, name, phone FROM customer)
EXCEPT ALL
(SELECT id, name, phone FROM supplier);
Type mismatches affect INTERSECT
and EXCEPT
speed in the same way it affects JOIN
s.
DISTINCT
doesn’t need to be specified, as it is the default behaviour. It means that duplicate rows must be eliminated from the final results. ALL
means that duplicates are allowed. These modifiers are not affected by type mismatches, because they come into play after conversions.
Automation and Data interoperability
In some cases, other technologies or scripts need to connect to a database to automate some operation with the data. Typical cases are:
- Data anonymisation for staging databases;
- Exporting data to other technologies.
In both cases, probably the program that connects to the database is flexible enough to work with any schema. But it will require that some key columns have follow strict naming conventions, to be able to recognise them and do something special with them.
Anonymisation
Staging databases are often recreated nightly from a production backup. But some regulations require that data are anonymised before being exposed to the developers. Data anonymisation consists of replacing personal data (PII) with partially masked data, or random realistic data. An id is normally not considered sensitive information, but data like emails, phone numbers or bank data should not be in a staging database.
Columns that contain sensitive data should be identifiable by their name.While some universal names exist, other names might depend on the specific terminology used by the company, its developers mother tongue, and so on. The company should have a dictionary of names for columns that contain personal information, so that an anonymisation script can identify and scrub them.
It’s very important to follow this dictionary. If an anonymisation script expects email columns to be called email
, there must not be columns called mail
, email2
, or email_new
– or their data might not be anonymised. And keep in mind that even names like EMAIL
, Email
or e-mail
might cause sensitive columns to be skipped.
It’s also important to pay attention to the data types. If the anonymisation script expects phone numbers to be strings, it could for example replace every digit character with another. But if phone numbers are stored as integers in some tables, the script will likely fail.
Data Exporting
Data are exported from OLTP databases to data warehouses and other data technologies. Data engineers build the data pipelines, exporting the data that is needed, excluding data that is not needed or not allowed, and identifying the columns with special meanings. Identifying the columns with special meanings is not necessarily intuitive for the application developers: for example, in a DWH primary keys are not production’s id columns, but other columns that make sense from a business perspective (see Primary Keys in Data Warehouses). You get the point, right? Data engineers should be able to identify all the columns they care about quickly, just by looking at their names.
Similarly they might have some automation that filters and transforms the data, but first needs to identify them. For example, this automation might need to recognise “relation tables” from their names, or it might need to know how to order data chronologically based on some columns name and type.
Without this level of consistency, data engineers will need to spend time to understand the source data, configure table and column names, code new ways to filter or transform the data. With time, inconsistency will make their automation more fragile and harder to maintain. Their daily job will also become more tasky and frustrating.
A Bad Solution: Domains
Most traditional DBMSs support domains, a feature included in SQL92 and later standards. Domains define a type and some constraints that can be applied to a set of columns. So, for example, there can be an email
domain defined as VARCHAR(100) NOT NULL CHECK (VALUE LIKE '_%@_%.__%')
. This domain can then be applied to all columns that are supposed to contain an email.
In theory, this is a great way to define the type and the set of valid values for columns that should contain the same class of data: this is exactly the goal of domains. Also, domains are defined in the database itself. This means that they completely support the database’s types and constraints, and they don’t require using external technologies for validation.
Unfortunately, domains are feature designed in the early 90s, when scalability was not a concern. Don’t misunderstand me: I’m not a relational database detractor, and I’m not implying that they don’t scale just because they were designed in the 70s. Relational databases are an example of solid, extensible design that survived over the decades and will survive for more decades. But this particular feature won’t scale.
Let me give you an example. Suppose you have a domain, and tens of columns associated to that domain. Some of them might even have millions of rows. Suppose you need to change its size, or even its type. This might trigger a massive series of data writes that will happen in a single transaction. Even in case you simply change a constraint, the new constraint will generally need to be verified again, in a single transaction. This kind of operations can easily kill your production database.
A Good Solution: Acceptance Tests
Our databases are – or should be – versioned. This means that every schema change is committed to a branch, and a migration of the database corresponds to a pull request. Probably our pull requests are reviewed by one or more colleagues before being accepted.
However, a pull request might also be required to pass some acceptance tests. And if it includes schema changes (diffs in SQL files), these tests should cover such changes. The workflow should be similar to the following:
- Read the up migration. Verify if there is a corresponding down migration.
- Run the up migration against a staging database or, even better, an unused database.
- Run queries against the
information_schema
to verify that rules are not violated. For example, the script shouldn’t find any table without a primary key, any primary key consisting of more than two columns, etc. - There should be a dictionary table containing column names and their expected definition (type, size). Columns with unexpected definitions won’t pass the tests. If it makes sense, some names might actually be regular expressions.
- Unknown names should not pass the tests. If they don’t follow the pattern, they need be fixed. If they are new, a new name needs be added to the dictionary table.
Other Good Practices
Every non-obvious part of your schema should be documented. Provided that you implement the solution described above, the dictionary table should also document the meaning of the column names and their formats. For example, if a string column contains lists, document the format of these lists (comma-separated words, a JSON array…). If the format is public (eg, EAN-13 or AWS ARN) just specify the documentation’s URL.
All DBMSs allow to associate every table and columns to a comment that will be shown by database GUIs. Use this feature, but avoid comments that are just noise. Don’t state that email
contains an email, but there are other valuable piece of information that you might need to write into the comment. For example:
- Why do we have that column? Maybe it’s legacy stuff that should be removed. Maybe it’s a redundant column that avoids big joins.
- Information that’s beyond a physical data description. Is it verified information? Is there a risk that it’s obsolete? Where does that information come from?
I’m not suggesting that you should do it for all columns, or that you should write a book about every column. But make sure that colleagues that join the company in 2 years will read the name and understand what they’re seeing. If they don’t, respect their time: give them more details and save them a long investigation work.
Conclusions
We saw some reasons why your company should use column names and types consistently. Consistent typing will bring benefits to query performance. Naming and typing consistency will make both the data engineers and OLTP developers more productive.
As your application grows and your team expands, these small inconsistencies compound into major headaches. The schema choices you make today will either empower your team or frustrate them (and yourself) for years. Choose consistency now: establish naming and typing rules in your organisation.
I’d be curious to hear about your experience on this topic. Did you have problems because of schema inconsistency? Have you managed to solve them somehow? Do you recommend some particular tools?
Federico Razzoli
0 Comments