Removing Duplicate Files Using Python: A Comprehensive Guide
Duplicate files can consume valuable storage space and clutter your file system. In this tutorial, we will explore how to efficiently remove duplicate files using Python. Whether the files have different names or are located in nested folders, this step-by-step guide will help you identify and eliminate duplicates, freeing up disk space and improving file organization
- File Comparison Based on Content
To compare files, we will calculate a hash value for each file’s content. The hashlib
module in Python provides various hash algorithms like MD5, SHA-1, and SHA-256. By calculating and comparing the hash values, we can determine if two files have the same content
2. Traversing Through Folders
To handle files located in nested folders, we will use the os
module in Python. The os.walk()
function allows us to recursively traverse through a directory and its subdirectories, providing access to files and folders within them
3. Identifying Duplicate Files
As we traverse through the folders, we will maintain a dictionary where the hash values will be the keys, and the corresponding file paths will be the values. For each file encountered, we will calculate its hash value and check if it already exists in the dictionary. If a duplicate is found, we can add it to a separate list for removal
4. Removing Duplicate Files
Once we have identified the duplicate files, we can prompt the user for confirmation before removing them. Using the os.remove()
function, we can delete the duplicate files permanently from the file system
Let’s dive into the code
import os
import hashlib
def calculate_file_hash(file_path):
"""Calculates the hash value of a file's content."""
hasher = hashlib.md5()
with open(file_path, 'rb') as file:
for chunk in iter(lambda: file.read(4096), b''):
hasher.update(chunk)
return hasher.hexdigest()
def find_duplicate_files(root_folder):
"""Traverses through the root folder and identifies duplicate files."""
duplicates = {}
for folder_path, _, file_names in os.walk(root_folder):
for file_name in file_names:
file_path = os.path.join(folder_path, file_name)
file_hash = calculate_file_hash(file_path)
if file_hash in duplicates:
duplicates[file_hash].append(file_path)
else:
duplicates[file_hash] = [file_path]
return duplicates
def remove_duplicate_files(duplicates):
"""Removes duplicate files from the file system."""
for file_paths in duplicates.values():
if len(file_paths) > 1:
print(f"Duplicate files found:\n{file_paths}\n")
for file_path in file_paths[1:]:
os.remove(file_path)
print(f"{file_path} has been deleted.\n")
if __name__ == '__main__':
root_folder = 'C:/Users/teja/Desktop/Projects/DupFileRemove'
duplicates = find_duplicate_files(root_folder)
remove_duplicate_files(duplicates)
In this example, we define three functions:
calculate_file_hash(file_path)
: This function calculates the hash value of a file's content using the MD5 algorithm. It reads the file in chunks to efficiently handle large files.find_duplicate_files(root_folder)
: This function traverses through the root folder and identifies duplicate files. It uses theos.walk()
function to recursively iterate over all files in the directory and its subdirectories. It calculates the hash value of each file's content and stores the file paths with the same hash value in a dictionary.remove_duplicate_files(duplicates)
: This function takes the dictionary of duplicate files and prompts the user for confirmation before removing them. For each group of duplicate files, it presents the file paths and asks the user if they want to delete the additional copies. If the user confirms, theos.remove()
function is called to delete the file permanently.
The provided code can work for media files as well. The process of removing duplicate files remains the same regardless of the file type. The code calculates the hash value based on the file’s content, so it is not dependent on the file extension or type.
Whether you have text files, images, audio files, or videos, the code will correctly identify and remove duplicate files based on their content. The hashing algorithm used (MD5 in the example) works for any file type
Thank you for reading. Support me by clapping and sharing this blog
Let’s catchup in LinkedIn