目次
- 1 1. Introduction
- 2 2. Three Basic Patterns for Removing Duplicate Elements in Python
- 3 3. Processing when elements such as lists are “non-hashable”
- 4 4. Advanced Topics: Extract Duplicates Only or Count Occurrences
- 5 5. Removing Duplicates in Pandas DataFrames
- 6 6. Performance Comparison and Choosing the Optimal Solution
- 7 7. Summary
- 8 FAQ (Frequently Asked Questions)
- 8.1 Q1. Which is faster, set or dict.fromkeys?
- 8.2 Q2. How can I remove duplicates while preserving the original list order?
- 8.3 Q3. What should I do when a list contains other lists or dictionaries, causing errors with set or dict.fromkeys?
- 8.4 Q4. How can I remove duplicates in pandas based on a specific column?
- 8.5 Q5. How can I delete all duplicate elements and keep only values that appear once?
- 8.6 Q6. How can I check for duplicate rows in a DataFrame instead of deleting them?
- 8.7 Q7. What should I watch out for when removing duplicates from data sets of hundreds of thousands of records?
- 8.8 Q8. Why is duplicate removal necessary?
1. Introduction
In data analysis and programming contexts, “removing duplicate data” is a fundamental operation that can be considered essential. The need to remove duplicate elements from lists, arrays, or dataframes using Python is seen across a wide range of users, from beginners to advanced practitioners. For example, after scraping large amounts of data from the web or when loading a CSV file, it is not uncommon for the same values or rows to appear repeatedly. Leaving such “duplicates” unchecked can lead to various problems, such as inaccurate aggregation results and unnecessary processing. Python provides a variety of methods for removing duplicates, using both built-in features and external libraries. This article covers everything from basic techniques for deduplication in Python to advanced usage and pitfalls. It is written to be clear for programming beginners while also being useful for advanced users in professional settings, with concrete examples and key points for lists, arrays, and DataFrames. We answer questions such as “Which method should I choose?” and “Why does the order change?” while presenting practical know-how that can be applied immediately in real-world scenarios. If you want to efficiently remove duplicate data with Python, please use this as a reference.2. Three Basic Patterns for Removing Duplicate Elements in Python
When removing duplicate elements from lists, arrays, and similar structures in Python, three main methods are commonly used. Each has its own characteristics, so it’s important to choose the appropriate one based on your goals and situation.2.1 Removing Duplicates with set (order‑nonpreserving method)
The simplest and most intuitive approach is to use Python’s built‑in type “set”. A set is a collection that does not allow duplicates, so converting a list to a set automatically removes duplicate elements.numbers = [1, 2, 2, 3, 4, 4, 5]
unique_numbers = list(set(numbers))
print(unique_numbers) # Example output: [1, 2, 3, 4, 5]
However, a set does not preserve element order. Therefore, if you need to retain the original list order, you’ll need a different approach.2.2 Order‑preserving removal using dict.fromkeys
Since Python 3.7, dictionaries preserve the insertion order of their keys. This enables a duplicate‑removal trick using “dict.fromkeys()”. With this method, you can eliminate duplicates without disturbing the original list order.numbers = [1, 2, 2, 3, 4, 4, 5]
unique_numbers = list(dict.fromkeys(numbers))
print(unique_numbers) # Example output: [1, 2, 3, 4, 5]
Because it provides order‑preserving duplicate removal in a single concise line, it’s widely used.2.3 List comprehension + set for order preservation and flexibility
Another common technique combines a list comprehension with a set to remove duplicates while preserving order. This approach is advantageous because it easily accommodates more flexible conditions and complex data structures.numbers = [1, 2, 2, 3, 4, 4, 5]
seen = set()
unique_numbers = [x for x in numbers if not (x in seen or seen.add(x))]
print(unique_numbers) # Example output: [1, 2, 3, 4, 5]
With this method, you can explicitly express the “skip elements that have already appeared” logic, making it easy to apply conditional duplicate removal and to handle non‑hashable elements.3. Processing when elements such as lists are “non-hashable”
These duplicate‑removal techniques using set or dict that were introduced earlier work for hashable elements like lists and tuples. However, Python lists are non‑hashable objects, so you cannot use a list that contains lists (e.g., a two‑dimensional list) or a list of dictionaries directly as keys in a set or dict. Therefore, a bit of ingenuity is required. For example, consider the following two‑dimensional list.data = [[1, 2], [3, 4], [1, 2], [5, 6]]
In this case, using set or dict.fromkeys() will raise an error. Therefore, a method that manually checks whether a list that has already appeared is present and removes duplicates is used.3.1 Method using list comprehension and a list
The simplest approach is to keep a separate list of already‑seen lists and add items one by one.data = [[1, 2], [3, 4], [1, 2], [5, 6]]
unique_data = []
for item in data:
if item not in unique_data:
unique_data.append(item)
print(unique_data) # Example output: [[1, 2], [3, 4], [5, 6]]
With this approach, you can remove duplicates while preserving order even for non‑hashable elements such as lists, dictionaries, and sets.3.2 Generalizing into a function
Defining it as a function is convenient for making the process more generic.def remove_duplicates(seq):
result = []
for item in seq:
if item not in result:
result.append(item)
return result
data = [[1, 2], [3, 4], [1, 2], [5, 6]]
print(remove_duplicates(data)) # [[1, 2], [3, 4], [5, 6]]
3.3 Performance and considerations
While this method has the advantage of preserving order, the search cost ofitem not in result
becomes high when the list is large. It is practical for a few thousand elements, but for data exceeding tens of thousands, you should consider other approaches (e.g., converting elements to tuples and managing them with a set).4. Advanced Topics: Extract Duplicates Only or Count Occurrences
Removing duplicates is a basic step in data preprocessing, but in real-world scenarios you often need to “extract only the duplicated elements” or “count how many times each element appears.” Python can handle these tasks easily using the standard library.4.1 Extract Only Duplicated Elements
If you want to extract only the “elements that appear multiple times” from a list, theCounter
class from the collections module is handy.from collections import Counter
data = [1, 2, 2, 3, 4, 4, 4, 5]
counter = Counter(data)
duplicates = [item for item, count in counter.items() if count > 1]
print(duplicates) # Example output: [2, 4]
With this approach, you can pull out “elements that have duplicates” at a glance. It also works with strings and any other hashable objects.4.2 Count the Occurrence Frequency of Each Element
Counter
also makes it easy to obtain the occurrence count of each element. Because it can be used for aggregation and frequency analysis, it’s highly valued in data analysis work.from collections import Counter
data = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']
counter = Counter(data)
print(counter) # Example output: Counter({'apple': 3, 'banana': 2, 'orange': 1})
Since a Counter object behaves like a dictionary, you can easily check how many times a specific element appears.
4.3 Remove All Duplicates (Keep Only Unique Elements)
You can also use Counter to remove all duplicated elements and keep only the “elements that appeared once.”from collections import Counter
data = [1, 2, 2, 3, 4, 4, 5]
counter = Counter(data)
unique_items = [item for item, count in counter.items() if count == 1]
print(unique_items) # Example output: [1, 3, 5]
In this way, depending on your data’s purpose, you can flexibly perform various extractions and aggregations such as “duplicates only,” “occurrence counts,” or “unique elements.”5. Removing Duplicates in Pandas DataFrames
DataFrames in pandas are frequently used in data analysis and machine learning. pandas, which efficiently handles tabular data, includes handy features specialized for detecting and removing duplicate data.5.1 Removing Duplicate Rows with drop_duplicates()
The most commonly used method in pandas is thedrop_duplicates()
method. It is a function that can easily remove duplicate rows (or columns) from a DataFrame or Series.import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Alice', 'David'],
'age': [24, 27, 24, 32]
})
df_unique = df.drop_duplicates()
print(df_unique)
# Example output:
# name age
# 0 Alice 24
# 1 Bob 27
# 3 David 32
In this example, when rows are completely identical, only the first occurrence is kept.5.2 Determining Duplicates Based on Specific Columns (subset argument)
If you want to determine duplicates based on specific columns only, use thesubset
argument.df_unique_name = df.drop_duplicates(subset=['name'])
print(df_unique_name)
# Example output:
# name age
# 0 Alice 24
# 1 Bob 27
# 3 David 32
In this case, only duplicates in the ‘name’ column are considered.5.3 Marking Duplicate Rows with duplicated()
If you want to identify which rows are duplicates rather than delete them, use theduplicated()
method.print(df.duplicated())
# Example output:
# 0 False
# 1 False
# 2 True
# 3 False
# dtype: bool
Rows that are duplicates are indicated by True
, and non-duplicates by False
. You can use this to extract or remove only the duplicate rows.5.4 Commonly Used Options (keep, inplace, ignore_index)
- keep:
'first'
(default. Keeps only the first occurrence),'last'
(keeps only the last occurrence),False
(removes all duplicates) - inplace: when set to
True
, updates the original DataFrame directly - ignore_index: when set to
True
, the index is reset
df.drop_duplicates(subset=['name'], keep=False, inplace=True)
In this example, all rows where ‘name’ is duplicated are removed.6. Performance Comparison and Choosing the Optimal Solution
When removing duplicates in Python, the method you choose can lead to significant differences in execution speed and memory efficiency. Here we explain the performance characteristics of commonly used techniques and how to choose the optimal one based on your use case.6.1 Speed Comparison of set and dict.fromkeys
In general, duplicate removal usingset
or dict.fromkeys
is extremely fast.- set: The simplest and fastest. However, it does not preserve order.
- dict.fromkeys: Ideal when order preservation is needed. Since Python 3.7, it maintains insertion order while providing fast processing.
dict.fromkeys
cannot be used with elements that are unhashable (i.e., cannot serve as keys).6.2 Comparison with List Comprehension + set
Using a list comprehension combined withset
(the method that employs seen
) retains order while offering relatively high speed.
However, it is slightly slower than set
or dict.fromkeys
(the gap tends to widen as the number of elements grows). This approach is especially effective when you need flexibility, such as conditionally removing duplicates or handling complex data structures.6.3 When Elements Are Unhashable
If a list contains other lists or dictionaries, the search cost ofitem not in result
becomes O(n^2), causing the process to slow down once you exceed a few thousand items.
When you need to preserve order with large datasets, consider revising the data structure itself (e.g., converting to tuples, managing by IDs, etc.).6.4 Performance of pandas’ drop_duplicates
pandas’drop_duplicates
is internally optimized and runs quickly even on data sets ranging from hundreds of thousands to millions of rows.
However, when complex conditions or multiple column specifications are involved, processing time can increase somewhat, and memory usage should also be monitored.6.5 Summary of Guidelines for Choosing
- When the data volume is large and order doesn’t matter:
set
is the fastest - When preserving order is also important:
dict.fromkeys
or list comprehension + set - When dealing with unhashable elements or complex conditions: list comprehension + list, or encapsulate in a function
- For data analysis, large datasets, CSV files, etc.: pandas’
drop_duplicates
- When you also need to count occurrences or aggregate duplicate elements:
collections.Counter
7. Summary
Duplicate removal in Python is an essential technique in data processing and analysis. In this article, we systematically explained methods for removing duplicates from lists, arrays, and pandas DataFrames, covering everything from basics to advanced applications. Summarizing the main points, we have:- Using set provides the simplest and fastest way to remove duplicates, though it does not preserve order.
- With dict.fromkeys or list comprehensions + set, you can eliminate duplicates while keeping the original order.
- When you have non‑hashable elements (such as lists or dictionaries), approaches like a list comprehension combined with
item not in result
are effective. - By using collections.Counter, you can handle more detailed tasks such as extracting only duplicate items or aggregating occurrence counts.
- pandas’ drop_duplicates and duplicated offer powerful duplicate‑removal capabilities that scale to large datasets, making them indispensable in data analysis.
FAQ (Frequently Asked Questions)
Q1. Which is faster, set or dict.fromkeys?
Generally, removing duplicates with a set is the fastest. However, a set does not preserve element order. If you need to keep order, use dict.fromkeys. dict.fromkeys in Python 3.7 and later is also quite fast, so you can choose based on your use case.Q2. How can I remove duplicates while preserving the original list order?
dict.fromkeys or a list comprehension combined with a set (using a seen collection) works. Both preserve order. Using only a set will disrupt the order.Q3. What should I do when a list contains other lists or dictionaries, causing errors with set or dict.fromkeys?
Unhashable elements (lists, dictionaries, etc.) cannot be used as set elements or dict keys. In that case, the most reliable approach is to manually append while checking a list for “already added” (using a for loop with an if statement, or encapsulating it in a function).Q4. How can I remove duplicates in pandas based on a specific column?
Use the subset argument of drop_duplicates. For example,df.drop_duplicates(subset=['name'])
will consider only the ‘name’ column when determining duplicates.Q5. How can I delete all duplicate elements and keep only values that appear once?
Use collections.Counter and extract only the elements whose occurrence count is one.from collections import Counter
data = [1, 2, 2, 3, 4, 4, 5]
counter = Counter(data)
unique_items = [item for item, count in counter.items() if count == 1]
# unique_items = [1, 3, 5]