Deleting duplicate values can be necessary, for example, when we realise that a column (or a combination of columns) should be UNIQUE
. If the column existed in production for some time, it’s possible that it now contains some duplicate values, so trying to create a UNIQUE
index will fail:
MariaDB [test]> ALTER TABLE person ADD UNIQUE unq_email (email);
ERROR 1062 (23000): Duplicate entry 'jo**@sm***.com' for key 'unq_email'
There are two cases here.
- We can delete duplicate values without any criteria – that is: if there are two rows with
email='jo**@sm***.com'
we want to delete one of them, and we don’t have a reason to choose one or another. - Or we may need to delete rows based on some criteria – for example, we may choose to delete the newest rows, or prefer to delete the rows where a column is empty.
Forcing a UNIQUE index creation
Forcing the creation of a UNIQUE
index is the easiest way to delete duplicates without any criteria. In other words, you’ll remove duplicates, but you don’t know which rows will survive. We can do it this way:
ALTER IGNORE TABLE person ADD UNIQUE unq_email (email);
The IGNORE
keyword tells MariaDB to ignore (delete) duplicate rows and create the index.
Selectively deleting duplicate rows
First I’ll explain what we need to do, so that the logic will be clear. Then I’ll show example queries.
Theory
We want to do the following:
- Order the rows by the column that contains duplicates that we need to eliminate. The duplicates will be “grouped” together.
- Add a secondary order so that, the duplicate we choose to preserve appears first.
- Add a progressive number to the duplicates, so we can delete those with a progressive number greater than 1.
Practice
First, we need a query that makes it clear which rows are duplicate:
SELECT
email, full_name,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY email, full_name
) AS row_number
FROM person
GROUP BY email, full_name
;
ROW_NUMBER()
is a window function. In the simplest case, it will return a progressive number for all returned rows, following the specified order. With PARTITION BY email
, it will reset the count every time it finds a different value for email
. For example:
SELECT
p.id, p.email,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY p.email, p.id
) AS row_number
FROM person p
GROUP BY p.email, p.id
;
For example:
+----+----------------+------------+
| id | email | row_number |
+----+----------------+------------+
| 4 | do****@wh*.com | 1 |
| 5 | do****@wh*.com | 2 |
| 1 | jo**@sm***.com | 1 |
| 2 | jo**@sm***.com | 2 |
| 3 | jo**@sm***.com | 3 |
+----+----------------+------------+
We can delete the rows with a row_number
greater than 1:
DELETE person
FROM person
INNER JOIN (
SELECT
p.id,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY p.email, p.id
) AS row_number
FROM person p
GROUP BY p.email, p.id
) dup
ON person.id = dup.id
WHERE dup.row_number > 1
;
Now we want to add some logic.
Delete newest rows
To delete the newest rows and preserve the oldest, we can modify the window function ORDER BY
clause:
DELETE person
FROM person
INNER JOIN (
SELECT
p.id,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY
p.email,
p.registration_date,
p.id
) AS row_number
FROM person p
GROUP BY p.email, p.id
) dup
ON person.id = dup.id
WHERE dup.row_number > 1
;
Delete non-empty values first
To try preserving a row with a non-empty full_name
:
DELETE person
FROM person
INNER JOIN (
SELECT
p.id,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY
p.email,
p.full_name > '' DESC,
p.id
) AS row_number
FROM person p
GROUP BY p.email, p.id
) dup
ON person.id = dup.id
WHERE dup.row_number > 1
;
Note that p.full_name > '' DESC
will cause the values that are not NULL
and not empty to be returned first.
Before MariaDB 10.2
Window functions support was added to MariaDB 10.2. Older versions do not support window functions.
When working with older versions, we can use a query like this to find values that occur multiple times:
SELECT email, COUNT(*) AS count
FROM person
GROUP BY email
HAVING COUNT(*) > 1
;
If we want to get more information about duplicate rows, we can use a query like this:
SELECT p.*
FROM person p
INNER JOIN (
SELECT email, COUNT(*) AS count
FROM person
GROUP BY email
HAVING COUNT(*) > 1
) dup
ON p.id = dup.id
;
Conclusions
Deleting duplicate values is a common problem when creating a UNIQUE
index. While MariaDB allows to “brutally” delete duplicates without any criteria, that is not always acceptable. Before version 10.2 window functions were not supported, so there were no easy solutions.
With modern MariaDB versions, we can order the rows so that the one we care about appears first, and then thanks to ROW_NUMBER()
we can delete the duplicates.
To master advanced SQL and query optimisation, consider our SQL optimisation training for teams.
Federico Razzoli
0 Comments