How to Remove Duplicates in Python – Lists & DataFrames Guide

1 1. Introduction
2 2. Three Basic Patterns for Removing Duplicate Elements in Python
3 3. Processing when elements such as lists are “non-hashable”
4 4. Advanced Topics: Extract Duplicates Only or Count Occurrences
5 5. Removing Duplicates in Pandas DataFrames
6 6. Performance Comparison and Choosing the Optimal Solution
7 7. Summary
8 FAQ (Frequently Asked Questions)

1. Introduction

In data analysis and programming contexts, “removing duplicate data” is a fundamental operation that can be considered essential. The need to remove duplicate elements from lists, arrays, or dataframes using Python is seen across a wide range of users, from beginners to advanced practitioners. For example, after scraping large amounts of data from the web or when loading a CSV file, it is not uncommon for the same values or rows to appear repeatedly. Leaving such “duplicates” unchecked can lead to various problems, such as inaccurate aggregation results and unnecessary processing. Python provides a variety of methods for removing duplicates, using both built-in features and external libraries. This article covers everything from basic techniques for deduplication in Python to advanced usage and pitfalls. It is written to be clear for programming beginners while also being useful for advanced users in professional settings, with concrete examples and key points for lists, arrays, and DataFrames. We answer questions such as “Which method should I choose?” and “Why does the order change?” while presenting practical know-how that can be applied immediately in real-world scenarios. If you want to efficiently remove duplicate data with Python, please use this as a reference.

2. Three Basic Patterns for Removing Duplicate Elements in Python

When removing duplicate elements from lists, arrays, and similar structures in Python, three main methods are commonly used. Each has its own characteristics, so it’s important to choose the appropriate one based on your goals and situation.

2.1 Removing Duplicates with set (order‑nonpreserving method)

The simplest and most intuitive approach is to use Python’s built‑in type “set”. A set is a collection that does not allow duplicates, so converting a list to a set automatically removes duplicate elements.

numbers = [1, 2, 2, 3, 4, 4, 5]
unique_numbers = list(set(numbers))
print(unique_numbers)  # Example output: [1, 2, 3, 4, 5]

However, a set does not preserve element order. Therefore, if you need to retain the original list order, you’ll need a different approach.

2.2 Order‑preserving removal using dict.fromkeys

Since Python 3.7, dictionaries preserve the insertion order of their keys. This enables a duplicate‑removal trick using “dict.fromkeys()”. With this method, you can eliminate duplicates without disturbing the original list order.

numbers = [1, 2, 2, 3, 4, 4, 5]
unique_numbers = list(dict.fromkeys(numbers))
print(unique_numbers)  # Example output: [1, 2, 3, 4, 5]

Because it provides order‑preserving duplicate removal in a single concise line, it’s widely used.

2.3 List comprehension + set for order preservation and flexibility

Another common technique combines a list comprehension with a set to remove duplicates while preserving order. This approach is advantageous because it easily accommodates more flexible conditions and complex data structures.

numbers = [1, 2, 2, 3, 4, 4, 5]
seen = set()
unique_numbers = [x for x in numbers if not (x in seen or seen.add(x))]
print(unique_numbers)  # Example output: [1, 2, 3, 4, 5]

With this method, you can explicitly express the “skip elements that have already appeared” logic, making it easy to apply conditional duplicate removal and to handle non‑hashable elements.

3. Processing when elements such as lists are “non-hashable”

These duplicate‑removal techniques using set or dict that were introduced earlier work for hashable elements like lists and tuples. However, Python lists are non‑hashable objects, so you cannot use a list that contains lists (e.g., a two‑dimensional list) or a list of dictionaries directly as keys in a set or dict. Therefore, a bit of ingenuity is required. For example, consider the following two‑dimensional list.

data = [[1, 2], [3, 4], [1, 2], [5, 6]]

In this case, using set or dict.fromkeys() will raise an error. Therefore, a method that manually checks whether a list that has already appeared is present and removes duplicates is used.

3.1 Method using list comprehension and a list

The simplest approach is to keep a separate list of already‑seen lists and add items one by one.

data = [[1, 2], [3, 4], [1, 2], [5, 6]]
unique_data = []
for item in data:
    if item not in unique_data:
        unique_data.append(item)
print(unique_data)  # Example output: [[1, 2], [3, 4], [5, 6]]

With this approach, you can remove duplicates while preserving order even for non‑hashable elements such as lists, dictionaries, and sets.

3.2 Generalizing into a function

Defining it as a function is convenient for making the process more generic.

def remove_duplicates(seq):
    result = []
    for item in seq:
        if item not in result:
            result.append(item)
    return result

data = [[1, 2], [3, 4], [1, 2], [5, 6]]
print(remove_duplicates(data))  # [[1, 2], [3, 4], [5, 6]]

3.3 Performance and considerations

While this method has the advantage of preserving order, the search cost of item not in result becomes high when the list is large. It is practical for a few thousand elements, but for data exceeding tens of thousands, you should consider other approaches (e.g., converting elements to tuples and managing them with a set).

4. Advanced Topics: Extract Duplicates Only or Count Occurrences

Removing duplicates is a basic step in data preprocessing, but in real-world scenarios you often need to “extract only the duplicated elements” or “count how many times each element appears.” Python can handle these tasks easily using the standard library.

4.1 Extract Only Duplicated Elements

If you want to extract only the “elements that appear multiple times” from a list, the Counter class from the collections module is handy.

from collections import Counter

data = [1, 2, 2, 3, 4, 4, 4, 5]
counter = Counter(data)
duplicates = [item for item, count in counter.items() if count > 1]
print(duplicates)  # Example output: [2, 4]

With this approach, you can pull out “elements that have duplicates” at a glance. It also works with strings and any other hashable objects.

4.2 Count the Occurrence Frequency of Each Element

Counter also makes it easy to obtain the occurrence count of each element. Because it can be used for aggregation and frequency analysis, it’s highly valued in data analysis work.

from collections import Counter

data = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']
counter = Counter(data)
print(counter)  # Example output: Counter({'apple': 3, 'banana': 2, 'orange': 1})

Since a Counter object behaves like a dictionary, you can easily check how many times a specific element appears.

4.3 Remove All Duplicates (Keep Only Unique Elements)

You can also use Counter to remove all duplicated elements and keep only the “elements that appeared once.”

from collections import Counter

data = [1, 2, 2, 3, 4, 4, 5]
counter = Counter(data)
unique_items = [item for item, count in counter.items() if count == 1]
print(unique_items)  # Example output: [1, 3, 5]

In this way, depending on your data’s purpose, you can flexibly perform various extractions and aggregations such as “duplicates only,” “occurrence counts,” or “unique elements.”

5. Removing Duplicates in Pandas DataFrames

DataFrames in pandas are frequently used in data analysis and machine learning. pandas, which efficiently handles tabular data, includes handy features specialized for detecting and removing duplicate data.

5.1 Removing Duplicate Rows with drop_duplicates()

The most commonly used method in pandas is the drop_duplicates() method. It is a function that can easily remove duplicate rows (or columns) from a DataFrame or Series.

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Alice', 'David'],
    'age': [24, 27, 24, 32]
})

df_unique = df.drop_duplicates()
print(df_unique)
# Example output:
#     name  age
# 0  Alice   24
# 1    Bob   27
# 3  David   32

In this example, when rows are completely identical, only the first occurrence is kept.

5.2 Determining Duplicates Based on Specific Columns (subset argument)

If you want to determine duplicates based on specific columns only, use the subset argument.

df_unique_name = df.drop_duplicates(subset=['name'])
print(df_unique_name)
# Example output:
#     name  age
# 0  Alice   24
# 1    Bob   27
# 3  David   32

In this case, only duplicates in the ‘name’ column are considered.

5.3 Marking Duplicate Rows with duplicated()

If you want to identify which rows are duplicates rather than delete them, use the duplicated() method.

print(df.duplicated())
# Example output:
# 0    False
# 1    False
# 2     True
# 3    False
# dtype: bool

Rows that are duplicates are indicated by True, and non-duplicates by False. You can use this to extract or remove only the duplicate rows.

5.4 Commonly Used Options (keep, inplace, ignore_index)

keep: 'first' (default. Keeps only the first occurrence), 'last' (keeps only the last occurrence), False (removes all duplicates)
inplace: when set to True, updates the original DataFrame directly
ignore_index: when set to True, the index is reset

df.drop_duplicates(subset=['name'], keep=False, inplace=True)

In this example, all rows where ‘name’ is duplicated are removed.

6. Performance Comparison and Choosing the Optimal Solution

When removing duplicates in Python, the method you choose can lead to significant differences in execution speed and memory efficiency. Here we explain the performance characteristics of commonly used techniques and how to choose the optimal one based on your use case.

6.1 Speed Comparison of set and dict.fromkeys

In general, duplicate removal using set or dict.fromkeys is extremely fast.

set: The simplest and fastest. However, it does not preserve order.
dict.fromkeys: Ideal when order preservation is needed. Since Python 3.7, it maintains insertion order while providing fast processing.

Both can handle lists with tens of thousands of items without issue. However, dict.fromkeys cannot be used with elements that are unhashable (i.e., cannot serve as keys).

6.2 Comparison with List Comprehension + set

Using a list comprehension combined with set (the method that employs seen) retains order while offering relatively high speed. However, it is slightly slower than set or dict.fromkeys (the gap tends to widen as the number of elements grows). This approach is especially effective when you need flexibility, such as conditionally removing duplicates or handling complex data structures.

6.3 When Elements Are Unhashable

If a list contains other lists or dictionaries, the search cost of item not in result becomes O(n^2), causing the process to slow down once you exceed a few thousand items. When you need to preserve order with large datasets, consider revising the data structure itself (e.g., converting to tuples, managing by IDs, etc.).

6.4 Performance of pandas’ drop_duplicates

pandas’ drop_duplicates is internally optimized and runs quickly even on data sets ranging from hundreds of thousands to millions of rows. However, when complex conditions or multiple column specifications are involved, processing time can increase somewhat, and memory usage should also be monitored.

6.5 Summary of Guidelines for Choosing

When the data volume is large and order doesn’t matter: set is the fastest
When preserving order is also important: dict.fromkeys or list comprehension + set
When dealing with unhashable elements or complex conditions: list comprehension + list, or encapsulate in a function
For data analysis, large datasets, CSV files, etc.: pandas’ drop_duplicates
When you also need to count occurrences or aggregate duplicate elements: collections.Counter

Choose the method that best fits the characteristics and use case of your data, based on the strengths of each approach.

7. Summary

Duplicate removal in Python is an essential technique in data processing and analysis. In this article, we systematically explained methods for removing duplicates from lists, arrays, and pandas DataFrames, covering everything from basics to advanced applications. Summarizing the main points, we have:

Using set provides the simplest and fastest way to remove duplicates, though it does not preserve order.
With dict.fromkeys or list comprehensions + set, you can eliminate duplicates while keeping the original order.
When you have non‑hashable elements (such as lists or dictionaries), approaches like a list comprehension combined with item not in result are effective.
By using collections.Counter, you can handle more detailed tasks such as extracting only duplicate items or aggregating occurrence counts.
pandas’ drop_duplicates and duplicated offer powerful duplicate‑removal capabilities that scale to large datasets, making them indispensable in data analysis.

Each method has its strengths, weaknesses, and differences in speed and memory efficiency. Depending on whether “speed is critical,” “order must be preserved,” or “non‑hashable data needs to be handled,” choosing the right tool for your situation and data characteristics is the key to efficient and accurate data processing. Duplicate removal may seem simple but is surprisingly deep; knowing the right technique can make Python data processing dramatically smoother. We hope you’ll apply the content of this article to your real‑world work or learning.

FAQ (Frequently Asked Questions)

Q1. Which is faster, set or dict.fromkeys?

Generally, removing duplicates with a set is the fastest. However, a set does not preserve element order. If you need to keep order, use dict.fromkeys. dict.fromkeys in Python 3.7 and later is also quite fast, so you can choose based on your use case.

Q2. How can I remove duplicates while preserving the original list order?

dict.fromkeys or a list comprehension combined with a set (using a seen collection) works. Both preserve order. Using only a set will disrupt the order.

Q3. What should I do when a list contains other lists or dictionaries, causing errors with set or dict.fromkeys?

Unhashable elements (lists, dictionaries, etc.) cannot be used as set elements or dict keys. In that case, the most reliable approach is to manually append while checking a list for “already added” (using a for loop with an if statement, or encapsulating it in a function).

Q4. How can I remove duplicates in pandas based on a specific column?

Use the subset argument of drop_duplicates. For example, df.drop_duplicates(subset=['name']) will consider only the ‘name’ column when determining duplicates.

Q5. How can I delete all duplicate elements and keep only values that appear once?

Use collections.Counter and extract only the elements whose occurrence count is one.

from collections import Counter

data = [1, 2, 2, 3, 4, 4, 5]
counter = Counter(data)
unique_items = [item for item, count in counter.items() if count == 1]
# unique_items = [1, 3, 5]

Q6. How can I check for duplicate rows in a DataFrame instead of deleting them?

Use the duplicated() method. It returns True for duplicate rows and False otherwise. It can also be used to extract only the duplicate portions.

Q7. What should I watch out for when removing duplicates from data sets of hundreds of thousands of records?

set and pandas’ drop_duplicates handle large data well, but memory usage and execution time can spike. Consider sampling or processing in chunks as appropriate.

Q8. Why is duplicate removal necessary?

Leaving duplicate data as is can lead to incorrect results in statistical analysis, charting, machine learning, and other tasks. Duplicate removal is an essential preprocessing step for reliable analysis and aggregation.