Need some help to clean duplicates in MySQL table which didn’t have constraint

I’ve inherited some MySQL table that was designed without correct constraint so it gets filled with some duplicate rows which I need to remove. The problem that across duplicate rows data isn’t usually consistent, see example below:

idrequest_idguid_iddetailsflag
110fh82ENhelp me1
211fh82EN
312fh82ENassistance required1
412fh82ENassistance required1
513fh82EN
613fh82ENassist me.1
713fh82EN
814fh82EN

Records with id: 1,2,8 perfectly fine. For duplicate records with id 3, 4 I have designed the query below which works fine and removes all duplicates without an issue:

DELETE IR.*
FROM platform.temp IR
WHERE id IN (
    SELECT maxId AS id FROM (
        SELECT MAX(id) as maxId, request_id, guid_id
        FROM platform.temp
        GROUP BY request_id, guid_id
        HAVING COUNT(*) > 1
    ) AS T
);

The problem is records with id 5,6,7. You can see that the same record by (guid_id and request_id) is not consistent. So, my previous query will delete records with content too because of MAX(id). I have designed a query that fixes these records, but we are talking about a huge database and this query is painfully slow:

UPDATE platform.temp AS DEST_T
INNER JOIN (
    SELECT request_id, guid_id, details, flag FROM platform.temp WHERE details IS NOT NULL AND details != ''
) AS SOURCE_T
SET DEST_T.details = SOURCE_T.details, DEST_T.flag = SOURCE_T.flag
    WHERE DEST_T.guid_id = SOURCE_T.guid_id AND DEST_T.request_id = SOURCE_T.request_id;

How can I change my delete query that it will order my subgroup by field details and will select not MAX(id) but first id, so I will be sure that last row in subgroup will always be populated with value and will left?

MySQL version: 5.6.40-log

UPDATE1:
The desired outcome after cleaning the table should be as follow:

idrequest_idguid_iddetailsflag
110fh82ENhelp me1
211fh82EN
312fh82ENassistance required1
613fh82ENassist me.1
814fh82EN

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Use a self join of the table:

DELETE t1
FROM tablename t1 INNER JOIN tablename t2
ON t2.request_id = t1.request_id AND t2.guid_id = t1.guid_id
WHERE (t2.id < t1.id AND COALESCE(t1.details, '') = '')
      OR
      (t2.id > t1.id AND COALESCE(t2.details, '') <> '');

This will keep 1 row for each request_id and guid_id combination, not necessarily the one with the min id.

See the demo.

Another way to do it, with conditional aggregation:

DELETE t1
FROM tablename t1 INNER JOIN (
  SELECT request_id, guid_id,
         MIN(id) min_id,
         MIN(CASE WHEN COALESCE(details, '') <> '' THEN id END) min_id_not_null
  FROM tablename
  GROUP BY request_id, guid_id
) t2 ON t2.request_id = t1.request_id AND t2.guid_id = t1.guid_id
WHERE t1.id <> COALESCE(t2.min_id_not_null, t2.min_id);

This will keep the row with the min id under your conditions, but maybe its performance would not be that good compared to the 1st query.

See the demo.

Method 2

Another way is to emulate the ROW_NUMBER ad then perform the delete operation.

DELETE FROM test
 WHERE id NOT IN (select id
                    from (SELECT id, 
                                 @row_number := CASE WHEN @last_request_id <> x.request_id + x.guid_id
                                                          THEN 1 ELSE @row_number + 1 END AS row_num,
                                 @last_request_id := x.request_id + x.guid_id
                            FROM test x
                           CROSS JOIN (SELECT @row_number := 0, @last_request_id := null, @last_guid_id := null) y
                           ORDER BY request_id, guid_id, details DESC) temp
                   where row_num = 1);

Demo.

Method 3

As i said in the comments i would use it with row_numbers, which in mysql 8 would look much more nicer

CREATE TABLE temp
    (`id` varchar(4), `request_id` varchar(12), `guid_id` varchar(9), `details` varchar(21), `flag` varchar(6))
;
    
INSERT INTO temp
    (`id`, `request_id`, `guid_id`, `details`, `flag`)
VALUES
 
    ('1', '10', 'fh82EN', 'help me', '1'),
    ('2', '11', 'fh82EN', NULL, NULL),
    ('3', '12', 'fh82EN', 'assistance required', '1'),
    ('4', '12', 'fh82EN', 'assistance required', '1'),
    ('5', '13', 'fh82EN', NULL, NULL),
    ('6', '13', 'fh82EN', 'assistance required', '1'),
    ('7', '13', 'fh82EN', NULL, NULL),
    ('8', '14', 'fh82EN', NULL, NULL)
;
DELETE t1
FROM temp t1 INNER JOIN 
(SELECT `id`
, IF(@request = `request_id` AND @guid = guid_id, @rn:= @rn+1,@rn := 1) rn
,@request := `request_id` as request_id
,@guid := guid_id as guid_id
fROM temp,(SELECT @request := 0, @guid := '',@rn := 0) t1
ORDER BY  `guid_id`,`request_id`,`details` DESC, id) t2 ON 
t1.`id` = t2.`id` AND rn > 1
SELECT * FROM temp
id | request_id | guid_id | details             | flag
:- | :--------- | :------ | :------------------ | :---
1  | 10         | fh82EN  | help me             | 1   
2  | 11         | fh82EN  | null                | null
3  | 12         | fh82EN  | assistance required | 1   
6  | 13         | fh82EN  | assistance required | 1   
8  | 14         | fh82EN  | null                | null

db<>fiddle here


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x