Schedule Meeting

a

Handling duplicate rows in MariaDB and MySQL

by | May 8, 2025 | MariaDB, MySQL

Need Help?  Click Here for Expert Support

Most tables have a concept of “duplicate rows”, where duplicates should never be allowed. Typical examples are:

  • We have a product table, the name column is UNIQUE. If I’ll try to insert a row with a name that already exists, the product I’m trying to insert is a duplicate.
  • Or maybe duplicate names are allowed, as long as the products are in different categories. So “The Avengers” can be the title of a film and the title of a game. The UNIQUE index is on (name, category).

But should happen when the application tries to insert a duplicate? This depends from case to case. The possible effects are the following:

  • The INSERT fails with an error. This is the most common case: there is an error that we must handle. For example, a user tries to sign up with an email that is already in our database.
  • The INSERT silently fails. This might be the case for a batch insertion from a CSV file that happens every night. We are aware that some products already exist, and we don’t care. We’ll just ignore duplicates.
  • The existing row should be automatically modified. This is the case if existing duplicates might hold old information – in all columns, or just in some columns.
  • The row should be completely replaced. Foreign keys and triggers will remove related rows in other tables, and the rows will be re-inserted.

Let’s see how to handle each of these cases in SQL, in a simple way, without re-inventing the wheel. You will save yourself some useless effort, and will avoid bugs in some corner cases.

Duplicate creation should fail with an error

Use cases: new user sign-up, email change

A user is attempting to sign up. The user will entry their email and, optionally, their LinkedIn profile. If they match existing emails or LinkedIn profiles, we want the sign-up process to fail, and we want to show the user an error.

A very similar use case is when a user changes their email. In this case we’ll use an UPDATE, rather than an INSERT.

The anti-pattern: INSERT after SELECT

The anti-pattern is the following:

  • SELECT rows from user where the columns email or linkedin_url contain the values we intend to insert.
  • If the SELECT returns no rows, INSERT the new row, or we run an UPDATE to do so.

Problems with this anti-pattern:

  • We have to run additional SQL queries. This will increase the workload for the database, and will increase the latency for the sign-up process.
  • It’s easy to forget to check one column for duplicates, especially if there are UNIQUE keys.
  • But if we perform all the needed checks, it’s easy to forget to create a UNIQUE index. Duplicates might be entered in other ways.

The good solution: error checking

In most cases, you don’t need to check whether a row already exists. All you need to do is to try to create it. If the operation fails, you’ll handle the error. In this way, you’ll run at least one less query.

Consider this example:

> INSERT INTO user (name, surname, email, linkedin_url) VALUES
    ->     ('Tom', 'Baker', 'do****@******rs.com', NULL)
    -> ;
Query OK, 1 row affected (0.002 sec)

> INSERT INTO user (name, surname, email, linkedin_url) VALUES
    ->     ('Peter', 'Capaldi', 'do****@******ey.com', NULL)
    -> ;
ERROR 1062 (23000): Duplicate entry 'do****@******ey.com' for key 'unq_email'

So all you need to do is to check the error you obtain. If the error number is 1062, you tried to violate a UNIQUE index or the primary key. The last word in the quoted name of the constraint. For an example of how to do this in PHP, see my article Check constraints in MariaDB.

Caveats

With this method, the error message will only tell you the name of one of the constraints that you tried to violate. It could be UNIQUE index, but it could also be a CHECK constraint or a primary key.

Duplicate creation should silently fail

Use case: batch import from an external source

Maybe the contents of a table are periodically imported from an external source, like a CSV file or a REST API. In this case, the import might contain rows that we already have, and we might want to ignore them. But if an INSERT returns an error, the whole transaction will rollback.

The anti-pattern: many INSERTs, many transactions

The intuitive solution is to run many INSERTs, and each of them will be a separate transaction:

INSERT ... ;
INSERT ... ;
INSERT ... ;

This is slower than necessary, and when importing many rows it can be very slow.

The good solution: INSERT IGNORE

Using a single INSERT, when possible, is always better. The syntax is the following:

INSERT IGNORE INTO table_name (a, b, c) VALUES
      ('A', 1, TRUE)
    , ('B', 2, TRUE)
    , ('C', 3, FALSE)
;

When it’s not possible, probably we can at least use a single transaction:

START TRANSACTION;
INSERT IGNORE INTO table_name (a, b, c) VALUES ('A', 1, TRUE);
INSERT IGNORE INTO table_name (a, b, c) VALUES ('B', 2, TRUE);
INSERT IGNORE INTO table_name (a, b, c) VALUES ('C', 3, FALSE);
COMMIT;

As you can see, in both cases we used the IGNORE option. In this way, even if some rows can’t be inserted because they violate a UNIQUE key (or any other constraint), other rows will be inserted correctly. For each row that can’t be inserted, we will receive a warning. This will allow us to handle the duplicates, if we need to.

Caveats

We fail to insert some rows because they violate other types of constraints, like CHECK constraint. We might not want to ignore these errors. In this case, our application should check the warnings, or they should be logged so that they can be examined later.

Automatically modifying existing rows

Use case: Docker containers autodiscovery

We write in a table a container that might be newly discovered or not, and its current status. If the container already exists in the table, its status must be updated.

The anti-pattern: SELECT + (INSERT or UPDATE)

The most common anti-pattern consists in checking whether the container exists (with a SELECT) and, based on this, running an INSERT to create the row, or running an UPDATE to change properties that might have been changed, like the container state. Two SQL statements will be sent for each container found. Also, a DELETE will be sent to delete containers that don’t exist anymore. All these statements should happen in a single transaction.

In this way, many more queries than necessary will be run. The code will also have to implement the upset logic. Bugs are not unlikely.

The good solution: INSERT … ON DUPLICATE KEY UPDATE

MariaDB and MySQL’s command for an upsert operation is INSERT ... ON DUPLICATE KEY UPDATE. This command can try to insert many rows in one shot. For duplicate rows, it will update the columns that we wish to update. Here’s an example:

START TRANSACTION;
INSERT INTO container (id, image, status, last_seen)
    VALUES
    ('146e9e7e6fc5',  'mariadb',     'UP',  NOW()),
    ('32895c057cae',  'apache',      'UP',  NOW()),
    ('2d2cd5859434',  'prometheus',  'UP',  NOW())
  ON DUPLICATE KEY UPDATE status = VALUE(status), last_checked = VALUE(last_seen);
DELETE FROM container WHERE last_seen < (INTERVAL last_seen = 1 HOUR);
COMMIT;

VALUE(column_name) means: the value that I tried to insert.

We might also use the value that already exists in the column. For example, we can say: if the row exists, increment num by 1:

... ON DUPLICATE KEY UPDATE num = num + 1;

We can also mix bare column names and VALUE() calls:

... ON DUPLICATE KEY UPDATE num = num + VALUE(num);

When we try to create a row with num=5, if the row already exists, its value will be incremented by 5.

Caveats

None.

Replacing existing rows

Use case: An adventure game

Imagine that our MariaDB instance backs a game. It’s an D&D-like game, and the player manages a team of four warriors (or less, if some warriors die). The player also has objects that were found during the adventures. Each object can be discarded, or given to a warrior. So, we have three tables: warrior, object, warrior_object. The player may also encounter a new warrior and welcome it in the team, but in this case one of the other warriors will say goodbye. All the objects carried by the leaver will be lost.

To resume, these are the actions that possibly imply replacing an existing row:

  • Assign an object to a warrior. If the object was already assigned to another warrior, a row in warrior_object needs be deleted. In both cases, a new row needs be inserted.
  • Welcome a new warrior. Since the number of warriors per player is limited to four, warriors can have a player_id and a progressive number from 1 to 4. So, for example, we can insert a warrior table row with player_id=24 and warrior_number=1, after deleting it if it already exists.

The anti-patterns

  1. We can use a SELECT to know if a row exists, optionally DELETE it, and then INSERT its replacement. All these statements must happen in a transaction.
  2. A slightly better version of this operation is to skip the SELECT and run DELETE in any case: it might delete a row (if it exists) or it might do nothing (if the row doesn’t exist). In the first case, we’ll run one less statement. If we have proper foreign keys, they will delete the related rows in other tables for us.
  3. Another option is to use UPDATE. So, to logically insert or replace a warrior, we can use INSERT in case the player has less than four warriors; or UPDATE to change all its attributes. Then we should use DELETE to delete the leaver’s objects. Again, these statements should occur in a single transaction.
  4. A better version consists in using INSERT ... ON DUPLICATE KEY UPDATE instead of choosing between INSERT or UPDATE. Then, we’ll run a DELETE on warrior_object. We’ll use a single transaction.

In general, all these options imply re-inventing the wheel in the application and running more queries than we should. But we’ll see that we might have to use one of these options. In this case, the second and the fourth options are usually the best. The second should be preferred in most cases.

The good solution: REPLACE

Most DBMSs, including MariaDB and MySQL, support the REPLACE statement. It’s essentially an INSERT, preceded by a DELETE in case a duplicate row exists.

How is it different from an UPDATE or an INSERT ... ON DUPLICATE KEY UPDATE?

  • Foreign keys will follow the ON DELETE action. If it’s CASCADE, the linked rows in other tables will be deleted. If it’s RESTRICT, the REPLACE statement will fail if linked rows exist.
  • BEFORE and AFTER DELETE triggers will run, if a duplicate row exists and we’re deleting it. Then, BEFORE and AFTER INSERT triggers will run.

The API has an affected_rows() function that tells us how many rows were written by the last statement. Similarly, SQL has a ROW_COUNT() statement. Both these methods will return 2 if a row was deleted and a new row was inserted, and will return 1 if a row was inserted and no row is deleted.

Caveats

We might have several foreign keys. If at least one of them is of type ON DELETE RESTRICT, our REPLACE will fail. Of course, ON DELETE CASCADE will propagate the deletion to another table, and this might be what we want. But be aware that all foreign keys of this type will run the CASCADE action. There is no way to selectively propagate the deletion to some tables, and not to others.

Triggers might also be a problem. Depending on why they exist and what they do, we might not want to run DELETE triggers, INSERT triggers, or any triggers at all, when we run REPLACE. But this is just impossible. We might set a session variable like @skip_triggers, and the triggers might exit immediately when this variable is set to TRUE. But it’s a good practice to keep triggers as simple as possible or, even better, avoid them.

On non-transactional tables (eg, MEMORY or MyISAM), REPLACE is risky: it might succeed in deleting an old row, but fail at creating the new row. But this is just another reason to use InnoDB, except exceptional cases when data consistency is not vital.

The foreign key and the trigger caveats are often good reasons to avoid REPLACE.

A note on creating UNIQUE indexes

All the above hints will only work if UNIQUE indexes exist in your table. But sometimes, when we create a table we don’t know that a column should be unique. Later we are told that duplicates should not be allowed. But they already exist, so attempting to create a UNIQUE index will fail: we should first remove them, or handle them somehow.

To find exiting duplicates, see my article How to delete duplicate rows in MariaDB.

Conclusions

With very basic SQL statement and some coding skills, you can handle pretty much any situation. But this doesn’t mean that always using basic SQL is a good idea. You’ll have to produce more code to implement some boring logic, which is a good receipt to develop buggy applications. And you’ll have to run more SQL statements, which is bad for performance in many ways. First of all, the overhead of client-server communications between your application and the database will increase the latency of your application. You will also lock rows for more time, making scalability problems more likely. And you will consume more resources on your database, increasing the costs on the cloud.

Did you know that we hold an SQL Optimisation training for developers? Subscribe, and you will learn how to write highly performant, scalable transactions and SQL queries!

 

All content in this blog is distributed under the CreativeCommons Attribution-ShareAlike 4.0 International license. You can use it for your needs and even modify it, but please refer to Vettabase and the author of the original post. Read more about the terms and conditions: https://creativecommons.org/licenses/by-sa/4.0/

About Federico Razzoli
Federico Razzoli is a database professional, with a preference for open source databases, who has been working with DBMSs since year 2000. In the past 20+ years, he served in a number of companies as a DBA, Database Engineer, Database Consultant and Software Developer. In 2016, Federico summarized his extensive experience with MariaDB in the “Mastering MariaDB” book published by Packt. Being an experienced database events speaker, Federico speaks at professional conferences and meetups and conducts database trainings. He is also a supporter and advocate of open source software. As the Director of Vettabase, Federico does business worldwide but prefers to do it from Scotland where he lives.

Recent Posts

When indexes make SQL queries slower

When indexes make SQL queries slower

During our SQL optimisation training courses, I always stress the need of understanding how databases run SQL queries internally. Which is easier and more intuitive than you'd think, anyway. But if you overlook this aspect, there is always a risk of sticking to simple...

MariaDB SEQUENCE: a Simple Approach to Synthetic Data

MariaDB SEQUENCE: a Simple Approach to Synthetic Data

MariaDB SEQUENCE is a storage engine that generates a sequence of positive integer numbers. However, in this article I will show you that it's easy to use SEQUENCE to generate more complex sequences, that are not necessarily numeric. This is a very convenient way to...

Stored Procedures in MariaDB: Smarter, Easier and More Powerful

Stored Procedures in MariaDB: Smarter, Easier and More Powerful

MariaDB supports stored procedures written in procedural SQL. Which essentially means "SQL with IF's and loops". Most DBMSs do the same, but every project supports different, incompatible syntaxes. MariaDB implemented Oracle's dialect, called PL/SQL. The base of the...

Services

Need Help?  Click Here for Expert Support

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *