How to Find Duplicate Files by Content Hash in Ubuntu

This guide explains how to identify and remove duplicate files on Ubuntu using content hashing. We will cover command-line tools like fdupes and rmlint, as well as manual methods using md5sum. By comparing file hashes rather than names or sizes, you ensure accurate detection of redundant data to free up disk space safely.

Why Use Content Hashing

File names and sizes can be misleading. Two files might have different names but identical data, or the same name but different content. Content hashing generates a unique string of characters based on the data inside the file. If two files produce the same hash value, their contents are identical. This method guarantees accuracy when cleaning up your storage.

Using fdupes

fdupes is a popular command-line program for identifying and deleting duplicate files. It compares file sizes and then generates hashes for further verification.

Install fdupes: Open your terminal and run the following command: bash sudo apt update sudo apt install fdupes
Scan for duplicates: To search a specific directory, such as your Home folder, use: bash fdupes -r ~/ The -r flag tells the tool to search recursively through subdirectories.
Delete duplicates: Once you have verified the list, you can delete the files interactively: bash fdupes -rd ~/ The tool will ask you which copy to keep for every set of duplicates found.

Using rmlint

rmlint is a faster alternative designed to find duplicates and other filesystem lint. It is highly efficient for large datasets.

Install rmlint: bash sudo apt install rmlint
Scan your directory: Run the scanner on your target folder: bash rmlint ~/ This command does not delete files immediately. Instead, it creates a shell script named rmlint.sh in the current directory.
Review and execute: Open the generated script to review what will be deleted. Once you are confident, execute the script to remove the duplicates: bash bash ./rmlint.sh

Manual Verification with md5sum

For single files or custom scripts, you can use the built-in md5sum utility to compare hashes manually.

Generate the hash: Run the following command on a file: bash md5sum filename.txt
Compare outputs: Run the command on a second file. If the resulting hash strings are identical, the files are duplicates.

Safety Precautions

Always back up your data before running bulk deletion commands. Hash collisions are theoretically possible but extremely rare with modern algorithms. However, human error during the deletion selection process is the most common risk. Use the dry-run or interactive flags available in these tools to review changes before they are finalized.