Sign In
Sign In

How to Find and Delete Duplicate Rows in MySQL with GROUP BY and HAVING Clauses

How to Find and Delete Duplicate Rows in MySQL with GROUP BY and HAVING Clauses
Shahid Ali
Technical writer
MySQL
19.11.2024
Reading time: 8 min

Duplicate entries may inadvertently accumulate in databases, which are crucial for storing vast amounts of structured data. These duplicates could show up for a number of reasons, including system errors, data migration mistakes, or repeated user submissions. A database with duplicate entries may experience irregularities, sluggish performance, and erroneous reporting. Using the GROUP BY and HAVING clauses, as well as a different strategy that makes use of temporary tables, we will discuss two efficient methods for locating and removing duplicate rows in MySQL. With these techniques, you can be sure that your data will always be accurate, clean, and well-organized.

Database duplication in MySQL tables can clog your data, resulting in inaccurate analytics and needless storage. Locating and eliminating them is a crucial database upkeep task. This is a detailed guide on how to identify and remove duplicate rows.

If two or more columns in a row have identical values, it is called a duplicate row. For instance, rows that have the same values in both the userName and userEmail columns of a userDetails table may be considered duplicates.

Benefits of Removing Duplicate Data

The advantage of eliminating duplicate data is that duplicate entries can slow down query performance, take up extra storage space, and produce misleading results in reports and analytics. The accuracy and speed of data processing are improved by keeping databases clean, which is particularly crucial for databases that are used for critical applications or are expanding.

Requirements

Prior to starting, make sure you have access to a MySQL database or have MySQL installed on your computer. The fundamentals of general database concepts and SQL queries. One can execute SQL commands by having access to a MySQL client or command-line interface.

To gain practical experience, you can create a sample database and table that contains duplicate records so that you can test and comprehend the techniques for eliminating them.

Creating a Test Database

Launch the MySQL command-line tool to create a Test Database.

mysql -u your_username -p

Create a new database called test_dev_db after entering your MySQL credentials.

CREATE DATABASE test_dev_db;

Then, switch to this newly created database:.

USE test_dev_db;

Add several rows, including duplicates, to the userDetails table after creating it with the CREATE TABLE query and INSERT query below.

CREATE TABLE userDetails (
userId INT AUTO_INCREMENT PRIMARY KEY,
userName VARCHAR(100),
userEmail VARCHAR(100)
);
INSERT INTO userDetails (userName, userEmail) VALUES
(‘Alisha’, ‘[email protected]’),
(‘Bobita, ‘[email protected]’),
(‘Alisha’, ‘[email protected]’),
(‘Alisha’, ‘[email protected]’);

Using GROUP BY and HAVING to Locate Duplicates

Grouping rows according to duplicate-defining columns and using HAVING to filter groups with more than one record is the simplest method for finding duplicates.

Now that you have duplicate data, you can use SQL to determine which rows contain duplicate entries. MySQL's GROUP BY and HAVING clauses make this process easier by enabling you to count instances of each distinct value. An example of a table structure is the userDetails table, which contains the columns userId, userName, and userEmail. The GROUP BY clause is useful for counting occurrences and identifying duplicates because it groups records according to specified column values. The HAVING clause  allows duplicate entries in groups formed by GROUP BY to be found by combining groups based on specific criteria.

Table userDetails Structure

userId

userName

userEmail

1

Alisha 

[email protected]

2

Bobita 

[email protected]

3

Alisha 

[email protected]

4

Alisha 

[email protected]

In the above table userDetails, records with identical userName and userEmail values are considered duplicates.

Finding Duplicates

Query for find the duplicate entries:

SELECT userName, userEmail, COUNT(*) as count
FROM userDetails
GROUP BY userName, userEmail
HAVING count > 1;

Rows are grouped by userName and userEmail in the aforementioned query, which also counts entries within the group and eliminates groups with a single entry (no duplicates).

Explanation:

  • SELECT userName, userEmail, COUNT(*) as count: Retrieves the count of each combination of username and userEmail, as well as their unique values.
  • GROUP BY userName, userEmail: Records are grouped by username and user email using the GROUP BY userName, userEmail function
  • COUNT (*): Tallies the rows in each set.
  • HAVING occurrences > 1: Recurring entries are identified by displaying only groups with more than one record.

This query will return groups of duplicate records based on the selected columns.

userName

userEmail

count

Alisha

[email protected]

3

Eliminating Duplicate Rows

After finding duplicates, you may need to eliminate some records while keeping the unique ones. Joining the table to itself and removing rows with higher userId values is one effective method that preserves the lowest userId for every duplicate.

Use the SQL query to remove duplicate rows while keeping the lowest userId entry.

DELETE u1
FROM userDetails u1
JOIN userDetails u2
ON u1. userName = u2. userName
AND u1. userEmail = u2. userEmail
AND u1. userId > u2. userId ;

Explanation:

  • u1 & u2: Aliases for the userDetails table to ease a self-join.
  • ON u1. userName = u2. userName AND u1. userEmail = u2. userEmail: Matches rows with identical userName, userEmail.
  • AND u1. userId > u2. userId: Removes rows with higher userId values, keeping only the row with the smallest userId.

Because this action cannot be undone, it is advised that you backup your data before beginning the deletion procedure.

Confirming Duplicate Removal

To confirm that all duplicates have been removed, repeat the Step 1 identification query.

SELECT userName, userEmail, COUNT(*) as count
FROM userDetails
GROUP BY userName, userEmail
HAVING count > 1;

All duplicates have been successfully eliminated if this query yields no rows.

Benefits of Employing GROUP BY and HAVING

The GROUP BY and HAVING clauses serve as vital instruments for the aggregation of data and the filtration of grouped outcomes. These functionalities are especially useful for detecting and handling duplicate entries or for condensing extensive datasets. Below are the primary benefits of employing these clauses.

  • Efficient Identification of Duplicates
  • Data Aggregation and Summarization
  • Filtering Aggregated Results with Precision
  • Versatility Across Multiple Scenarios
  • Compatibility and Simplicity
  • Enhanced Query Readability
  • Support for Complex Aggregations

The GROUP BY and HAVING clauses serve as essential instruments for data aggregation, identifying duplicates, and filtering results. Their effectiveness, ease of use, and adaptability render them crucial for database management and data analysis activities, allowing users to derive insights and handle data proficiently across a variety of applications.

Identifying Duplicates Using a Temporary Table

When dealing with large datasets, it can be easier and more efficient to separate duplicates using a temporary table before deleting them.

Creating the Table

Make a temporary table to store duplicate groups according to predetermined standards (e.g. A. username, along with userEmail.

CREATE TEMPORARY TABLE temp_view_duplicates AS
SELECT username, userEmail, MIN (userId) AS minuid
FROM userDetails
GROUP BY username, userEmail,
HAVING COUNT(*) > 1;

Explanation:

  • CREATE TEMPORARY TABLE temp_view_duplicates AS: Creates a temporary table named temp_view_duplicates.
  • SELECT userName, userEmail, MIN(userId) AS minuid: Groups duplicates by userName and userEmail, keeping only the row with the smallest userId.
  • GROUP BY userName, userEmail: Groups rows by userName, userEmail.
  • HAVING COUNT(*) > 1: Filters only groups with more than one row, identifying duplicates.

This temporary table will now contain one representative row per duplicate group (the row with the smallest id).

Deleting Duplicates from the Main Table

Now that we have a list of unique rows with duplicates in the temp_view_duplicates table, we can use the temporary table to remove duplicates while keeping only the rows with the smallest userId.

Use the following DELETE command:

DELETE FROM userDetails
WHERE (username, userEmail) IN (
SELECT username, userEmail FROM temp_view_duplicates
) AND userId NOT IN (
SELECT minuid FROM temp_view_duplicates
);

Explanation:

  • WHERE (username, userEmail,) IN: Targets only duplicate groups identified in temp_view_duplicates.
  • AND userId NOT IN (SELECT minuid FROM temp_view_duplicates): Ensures that only duplicate rows (those with higher userId values) are deleted.

Verifying Results

To confirm that duplicates have been removed, query the userDetails table:

SELECT * FROM userDetails;

Only unique rows should remain.

Temporary tables (CREATE TEMPORARY TABLE) are automatically dropped when the session ends, so they don’t persist beyond the current session.

When making extensive deletions, think about utilizing a transaction to safely commit or undo changes as necessary.

Key Advantages of Using a Temporary Table

  • Lower Complexity: By isolating duplicates, the removal process is simpler and clearer.
  • Enhanced Efficiency: It's faster for large datasets, as it avoids repeated joins.
  • Improved Readability: Using a temporary table makes the process more modular and easier to understand.

Conclusion

Eliminating duplicate records is essential for maintaining a well-organized database, improving performance, and ensuring accurate reporting. This guide presented two approaches:

  • Direct Method with GROUP BY and HAVING Clauses: Ideal for small datasets, using self-joins to delete duplicates.
  • Temporary Table Approach: More efficient for larger datasets, leveraging temporary storage to streamline deletion.

Choose the method that best fits your data size and complexity to keep your database clean and efficient.

MySQL
19.11.2024
Reading time: 8 min

Do you have questions,
comments, or concerns?

Our professionals are available to assist you at any moment,
whether you need help or are just unsure of where to start
Email us
Hostman's Support