Finding duplicate records in a database: the SQL HAVING clause

One issue I run across occasionally is a table with duplicate entries such as two entries for the same company in an accounts payable system. This can create embarrassing problems with billing if ACME Inc #1 is 90 days overdue because someone credited a payment, and now a credit, to ACME Inc #2.

I ran across this problem recently on a health records system which was missing a UNIQUE constraint on a secondary patient ID field: student ID. The application was originally built without that student ID but the client decided later that it wanted it added to the patient record. That was done but without any constraints placed on it because it was an optional field that most clinics probably wouldn’t use. Or so the client thought.

A few months later, the client was so happy with the software that it decided to move all its student records over to the application so that the New Patient flow could import data from the data warehouse as a preload, saving the operator time and errors. The problem was that during those interim months, operators had been manually putting bad data into those student IDs resulting in New Patient importing patients from the warehouse which already existed in the patient table. Voila, two identical patients with differing student IDs as well as two very different patients with the same student ID.

The only correctlive option in this case is to merge the two patients to one, but before you can do that you have to decide which patient record is the “authority” and what data you want to merge from the other record(s). And before you can do that you have to identify those duplicates.

This query will accomplish that. It employs the HAVING filter companion of the GROUP BY result set aggregator. I don’t see many queries employing HAVING, which is a shame because it’s a powerful way to filter grouped data, almost like a WHERE clause.

Here’s an example of a query on a patient table returning all duplicate patients with the same first and last name and student ID:

        SELECT
	    COUNT(*),
	    first_name,
	    last_name,
	    student_id
	FROM patients
	GROUP BY first_name, last_name, student_id
	HAVING COUNT(*) > 1;

Result:

          count | first_name  |   last_name    | student_id
	--------+-------------+----------------+------------
	     2  | KENNETH     | RIVERA         | 268098363
	     2  | ERIC        | SALLES         | 217257892
	     2  | MICHAEL     | LAWRENCE       | 274295927
	     2  | LOUISE      | ROSARIO        |
	     2  | KENDRA      | PRICE          | 224434348

You can, of course, add other conditons to that query to find, for instance, all patients with the same last_name and student_id. But that just gives you a list of the duplicated fields in the table. Most of these results do indeed look like dupes but in the case of LOUISE ROSARIO, is she a dupe or is she two different patients without a student ID? You need to eyeball their records to make sure. What would be nicer is a pointer to the actual records where those dupes exist. That’s a tougher challenge because you can’t add patient_id to this query without breaking the GROUP BY logic.

That can be accomplished with a nested, or sub, query with the above query serving us that search data.

SELECT
    PD.person_id,
    PD.first_name,
    PD.last_name,
    PD.dob,
    PD.student_id
FROM patients AS PD,
(SELECT
        first_name,
        last_name,
        student_id
    FROM patients
    GROUP BY first_name, last_name, student_id
    HAVING COUNT(*) > 1) AS SUB1
WHERE
    PD.first_name = SUB1.first_name
AND PD.last_name = SUB1.last_name
AND PD.student_id = SUB1.student_id
ORDER BY PD.student_id, PD.last_name, PD.first_name;

Results:

patient_id | first_name |   last_name    |    dob     | student_id
-----------+------------+----------------+------------+-----------
     27914 | VICTOR     | TORREDO        | 1995-02-14 | 432019685
     27915 | VICTOR     | TORREDO        | 1995-02-14 | 432019685
      8191 | ESTHER     | WILLIAMS       | 1995-09-23 | 206567040
      4519 | ESTHER     | WILLIAMS       | 1995-09-23 | 206567040
     17236 | THOMAS     | ASHCROFT       | 1995-02-13 | 333279517
      1288 | THOMAS     | ASHCROFT       | 1995-02-13 | 333279517
     17888 | LOUISE     | ROSARIO        | 1987-06-17 |
     11368 | LOUISE     | ROSARIO        | 1972-01-30 |

Aha, they’re obviously two different LOUISE ROSARIOs.

By the way, SQL, being as flexible as it is, would let you write this another way, such as with an explicit JOIN. I leave that as an exercise to the interested user <g>.