Most tables have a concept of “duplicate rows”, where duplicates should never be allowed. Typical examples are:
- We have a
product
table, thename
column isUNIQUE
. If I’ll try to insert a row with aname
that already exists, the product I’m trying to insert is a duplicate. - Or maybe duplicate names are allowed, as long as the products are in different categories. So “The Avengers” can be the title of a film and the title of a game. The
UNIQUE
index is on(name, category)
.
But should happen when the application tries to insert a duplicate? This depends from case to case. The possible effects are the following:
- The
INSERT
fails with an error. This is the most common case: there is an error that we must handle. For example, a user tries to sign up with an email that is already in our database. - The
INSERT
silently fails. This might be the case for a batch insertion from a CSV file that happens every night. We are aware that some products already exist, and we don’t care. We’ll just ignore duplicates. - The existing row should be automatically modified. This is the case if existing duplicates might hold old information – in all columns, or just in some columns.
- The row should be completely replaced. Foreign keys and triggers will remove related rows in other tables, and the rows will be re-inserted.
Let’s see how to handle each of these cases in SQL, in a simple way, without re-inventing the wheel. You will save yourself some useless effort, and will avoid bugs in some corner cases.
Duplicate creation should fail with an error
Use cases: new user sign-up, email change
A user is attempting to sign up. The user will entry their email and, optionally, their LinkedIn profile. If they match existing emails or LinkedIn profiles, we want the sign-up process to fail, and we want to show the user an error.
A very similar use case is when a user changes their email. In this case we’ll use an UPDATE
, rather than an INSERT
.
The anti-pattern: INSERT after SELECT
The anti-pattern is the following:
SELECT
rows fromuser
where the columnsemail
orlinkedin_url
contain the values we intend to insert.- If the SELECT returns no rows,
INSERT
the new row, or we run an UPDATE to do so.
Problems with this anti-pattern:
- We have to run additional SQL queries. This will increase the workload for the database, and will increase the latency for the sign-up process.
- It’s easy to forget to check one column for duplicates, especially if there are
UNIQUE
keys. - But if we perform all the needed checks, it’s easy to forget to create a
UNIQUE
index. Duplicates might be entered in other ways.
The good solution: error checking
In most cases, you don’t need to check whether a row already exists. All you need to do is to try to create it. If the operation fails, you’ll handle the error. In this way, you’ll run at least one less query.
Consider this example:
> INSERT INTO user (name, surname, email, linkedin_url) VALUES
-> ('Tom', 'Baker', 'do****@******rs.com', NULL)
-> ;
Query OK, 1 row affected (0.002 sec)
> INSERT INTO user (name, surname, email, linkedin_url) VALUES
-> ('Peter', 'Capaldi', 'do****@******ey.com', NULL)
-> ;
ERROR 1062 (23000): Duplicate entry 'do****@******ey.com' for key 'unq_email'
So all you need to do is to check the error you obtain. If the error number is 1062, you tried to violate a UNIQUE
index or the primary key. The last word in the quoted name of the constraint. For an example of how to do this in PHP, see my article Check constraints in MariaDB.
Caveats
With this method, the error message will only tell you the name of one of the constraints that you tried to violate. It could be UNIQUE
index, but it could also be a CHECK
constraint or a primary key.
Duplicate creation should silently fail
Use case: batch import from an external source
Maybe the contents of a table are periodically imported from an external source, like a CSV file or a REST API. In this case, the import might contain rows that we already have, and we might want to ignore them. But if an INSERT
returns an error, the whole transaction will rollback.
The anti-pattern: many INSERTs, many transactions
The intuitive solution is to run many INSERTs, and each of them will be a separate transaction:
INSERT ... ;
INSERT ... ;
INSERT ... ;
This is slower than necessary, and when importing many rows it can be very slow.
The good solution: INSERT IGNORE
Using a single INSERT
, when possible, is always better. The syntax is the following:
INSERT IGNORE INTO table_name (a, b, c) VALUES
('A', 1, TRUE)
, ('B', 2, TRUE)
, ('C', 3, FALSE)
;
When it’s not possible, probably we can at least use a single transaction:
START TRANSACTION;
INSERT IGNORE INTO table_name (a, b, c) VALUES ('A', 1, TRUE);
INSERT IGNORE INTO table_name (a, b, c) VALUES ('B', 2, TRUE);
INSERT IGNORE INTO table_name (a, b, c) VALUES ('C', 3, FALSE);
COMMIT;
As you can see, in both cases we used the IGNORE
option. In this way, even if some rows can’t be inserted because they violate a UNIQUE
key (or any other constraint), other rows will be inserted correctly. For each row that can’t be inserted, we will receive a warning. This will allow us to handle the duplicates, if we need to.
Caveats
We fail to insert some rows because they violate other types of constraints, like CHECK
constraint. We might not want to ignore these errors. In this case, our application should check the warnings, or they should be logged so that they can be examined later.
Automatically modifying existing rows
Use case: Docker containers autodiscovery
We write in a table a container that might be newly discovered or not, and its current status. If the container already exists in the table, its status must be updated.
The anti-pattern: SELECT + (INSERT or UPDATE)
The most common anti-pattern consists in checking whether the container exists (with a SELECT
) and, based on this, running an INSERT
to create the row, or running an UPDATE
to change properties that might have been changed, like the container state. Two SQL statements will be sent for each container found. Also, a DELETE will be sent to delete containers that don’t exist anymore. All these statements should happen in a single transaction.
In this way, many more queries than necessary will be run. The code will also have to implement the upset logic. Bugs are not unlikely.
The good solution: INSERT … ON DUPLICATE KEY UPDATE
MariaDB and MySQL’s command for an upsert operation is INSERT ... ON DUPLICATE KEY UPDATE
. This command can try to insert many rows in one shot. For duplicate rows, it will update the columns that we wish to update. Here’s an example:
START TRANSACTION;
INSERT INTO container (id, image, status, last_seen)
VALUES
('146e9e7e6fc5', 'mariadb', 'UP', NOW()),
('32895c057cae', 'apache', 'UP', NOW()),
('2d2cd5859434', 'prometheus', 'UP', NOW())
ON DUPLICATE KEY UPDATE status = VALUE(status), last_checked = VALUE(last_seen);
DELETE FROM container WHERE last_seen < (INTERVAL last_seen = 1 HOUR);
COMMIT;
VALUE(column_name)
means: the value that I tried to insert.
We might also use the value that already exists in the column. For example, we can say: if the row exists, increment num by 1:
... ON DUPLICATE KEY UPDATE num = num + 1;
We can also mix bare column names and VALUE()
calls:
... ON DUPLICATE KEY UPDATE num = num + VALUE(num);
When we try to create a row with num=5
, if the row already exists, its value will be incremented by 5.
Caveats
None.
Replacing existing rows
Use case: An adventure game
Imagine that our MariaDB instance backs a game. It’s an D&D-like game, and the player manages a team of four warriors (or less, if some warriors die). The player also has objects that were found during the adventures. Each object can be discarded, or given to a warrior. So, we have three tables: warrior
, object
, warrior_object
. The player may also encounter a new warrior and welcome it in the team, but in this case one of the other warriors will say goodbye. All the objects carried by the leaver will be lost.
To resume, these are the actions that possibly imply replacing an existing row:
- Assign an object to a warrior. If the object was already assigned to another warrior, a row in
warrior_object
needs be deleted. In both cases, a new row needs be inserted. - Welcome a new warrior. Since the number of warriors per player is limited to four, warriors can have a
player_id
and a progressive number from 1 to 4. So, for example, we can insert awarrior
table row withplayer_id=24
andwarrior_number=1
, after deleting it if it already exists.
The anti-patterns
- We can use a
SELECT
to know if a row exists, optionallyDELETE
it, and thenINSERT
its replacement. All these statements must happen in a transaction. - A slightly better version of this operation is to skip the
SELECT
and runDELETE
in any case: it might delete a row (if it exists) or it might do nothing (if the row doesn’t exist). In the first case, we’ll run one less statement. If we have proper foreign keys, they will delete the related rows in other tables for us. - Another option is to use
UPDATE
. So, to logically insert or replace a warrior, we can use INSERT in case the player has less than four warriors; orUPDATE
to change all its attributes. Then we should useDELETE
to delete the leaver’s objects. Again, these statements should occur in a single transaction. - A better version consists in using
INSERT ... ON DUPLICATE KEY UPDATE
instead of choosing betweenINSERT
orUPDATE
. Then, we’ll run aDELETE
onwarrior_object
. We’ll use a single transaction.
In general, all these options imply re-inventing the wheel in the application and running more queries than we should. But we’ll see that we might have to use one of these options. In this case, the second and the fourth options are usually the best. The second should be preferred in most cases.
The good solution: REPLACE
Most DBMSs, including MariaDB and MySQL, support the REPLACE
statement. It’s essentially an INSERT
, preceded by a DELETE
in case a duplicate row exists.
How is it different from an UPDATE
or an INSERT ... ON DUPLICATE KEY UPDATE
?
- Foreign keys will follow the
ON DELETE
action. If it’sCASCADE
, the linked rows in other tables will be deleted. If it’sRESTRICT
, theREPLACE
statement will fail if linked rows exist. BEFORE
andAFTER DELETE
triggers will run, if a duplicate row exists and we’re deleting it. Then,BEFORE
andAFTER INSERT
triggers will run.
The API has an affected_rows()
function that tells us how many rows were written by the last statement. Similarly, SQL has a ROW_COUNT()
statement. Both these methods will return 2 if a row was deleted and a new row was inserted, and will return 1 if a row was inserted and no row is deleted.
Caveats
We might have several foreign keys. If at least one of them is of type ON DELETE RESTRICT
, our REPLACE
will fail. Of course, ON DELETE CASCADE
will propagate the deletion to another table, and this might be what we want. But be aware that all foreign keys of this type will run the CASCADE
action. There is no way to selectively propagate the deletion to some tables, and not to others.
Triggers might also be a problem. Depending on why they exist and what they do, we might not want to run DELETE
triggers, INSERT
triggers, or any triggers at all, when we run REPLACE
. But this is just impossible. We might set a session variable like @skip_triggers
, and the triggers might exit immediately when this variable is set to TRUE
. But it’s a good practice to keep triggers as simple as possible or, even better, avoid them.
On non-transactional tables (eg, MEMORY or MyISAM), REPLACE
is risky: it might succeed in deleting an old row, but fail at creating the new row. But this is just another reason to use InnoDB, except exceptional cases when data consistency is not vital.
A note on creating UNIQUE indexes
All the above hints will only work if UNIQUE
indexes exist in your table. But sometimes, when we create a table we don’t know that a column should be unique. Later we are told that duplicates should not be allowed. But they already exist, so attempting to create a UNIQUE
index will fail: we should first remove them, or handle them somehow.
To find exiting duplicates, see my article How to delete duplicate rows in MariaDB.
Conclusions
With very basic SQL statement and some coding skills, you can handle pretty much any situation. But this doesn’t mean that always using basic SQL is a good idea. You’ll have to produce more code to implement some boring logic, which is a good receipt to develop buggy applications. And you’ll have to run more SQL statements, which is bad for performance in many ways. First of all, the overhead of client-server communications between your application and the database will increase the latency of your application. You will also lock rows for more time, making scalability problems more likely. And you will consume more resources on your database, increasing the costs on the cloud.
Did you know that we hold an SQL Optimisation training for developers? Subscribe, and you will learn how to write highly performant, scalable transactions and SQL queries!
0 Comments