find and remove duplicates in a directory
Written by: J Dawg
I have a directory with multiple img files and some of them are identical but they all have different names. I need to remove duplicates but with no external tools only with a bash
script. I’m a beginner in Linux. I tried nested for loop to compare md5
sums and depending on the result remove but something is wrong with the syntax and it doesn’t work. any help?
what I’ve tried is…
for i in directory_path; do sum1='find $i -type f -iname "*.jpg" -exec md5sum '{}' ;' for j in directory_path; do sum2='find $j -type f -iname "*.jpg" -exec md5sum '{}' ;' if test $sum1=$sum2 ; then rm $j ; fi done done
I get: test: too many arguments
There are quite a few problems in your script.
- First, in order to assign the result of a command to a variable you need to enclose it either in backtics (
`command`
) or, preferably,$(command)
. You have it in single quotes ('command'
) which instead of assigning the result of your command to your variable, assigns the command itself as a string. Therefore, yourtest
is actually:$ echo "test $sum1=$sum2" test find $i -type f -iname "*.jpg" -exec md5sum {} ;=find $j -type f -iname "*.jpg" -exec md5sum {} ;
- The next issue is that the command
md5sum
returns more than just the hash:$ md5sum /etc/fstab 46f065563c9e88143fa6fb4d3e42a252 /etc/fstab
You only want to compare the first field, so you should parse the
md5sum
output by passing it through a command that only prints the first field:find $i -type f -iname "*.png" -exec md5sum '{}' ; | cut -f 1 -d ' '
or
find $i -type f -iname "*.png" -exec md5sum '{}' ; | awk '{print $1}'
- Also, the
find
command will return many matches, not just one and each of those matches will be duplicated by the secondfind
. This means that at some point you will be comparing the same file to itself, the md5sum will be identical and you will end up deleting all your files (I ran this on a test dir containinga.jpg
andb.jpg
):for i in $(find . -iname "*.jpg"); do for j in $(find . -iname "*.jpg"); do echo "i is: $i and j is: $j" done done i is: ./a.jpg and j is: ./a.jpg ## BAD, will delete a.jpg i is: ./a.jpg and j is: ./b.jpg i is: ./b.jpg and j is: ./a.jpg i is: ./b.jpg and j is: ./b.jpg ## BAD will delete b.jpg
- You don’t want to run
for i in directory_path
unless you are passing an array of directories. If all these files are in the same directory, you want to runfor i in $(find directory_path -iname "*.jpg"
) to go through all the files. - It is a bad idea to use
for
loops with the output of find. You should usewhile
loops or globbing:find . -iname "*.jpg" | while read i; do [...] ; done
or, if all your files re in the same directory:
for i in *jpg; do [...]; done
Depending on your shell and the options you have set, you can use globbing even for files in subdirectories but let’s not get into that here.
- Finally, you should also quote your variables else directory paths with spaces will break your script.
File names can contain spaces, new lines, backslashes and other weird characters, to deal with those correctly in a while
loop you’ll need to add some more options. What you want to write is something like:
find dir_path -type f -iname "*.jpg" -print0 | while IFS= read -r -d '' i; do find dir_path -type f -iname "*.jpg" -print0 | while IFS= read -r -d '' j; do if [ "$i" != "$j" ] then sum1=$(md5sum "$i" | cut -f 1 -d ' ' ) sum2=$(md5sum "$j" | cut -f 1 -d ' ' ) [ "$sum1" = "$sum2" ] && rm "$j" fi done done
An even simpler way would be:
find directory_path -name "*.jpg" -exec md5sum '{}' + |
perl -ane '$k{$F[0]}++; system("rm $F[1]") if $k{$F[0]}>1'
A better version that can deal with spaces in file names:
find directory_path -name "*.jpg" -exec md5sum '{}' + |
perl -ane '$k{$F[0]}++; system("rm "@F[1 .. $#F]"") if $k{$F[0]}>1'
This little Perl script will run through the results of the find
command (i.e. the md5sum and file name). The -a
option for perl
splits input lines at whitespace and saves them in the F
array, so $F[0]
will be the md5sum and $F[1]
the file name. The md5sum is saved in the hash k
and the script checks if the hash has already been seen (if $k{$F[0]}>1
) and deletes the file if it has (system("rm $F[1]")
).
While that will work, it will be very slow for large image collections and you cannot choose which files to keep. There are many programs that handle this in a more elegant way including:
There is a nifty program called fdupes
that simplifies the whole process and prompts the user for deleting duplicates. I think it is worth checking:
$ fdupes --delete DIRECTORY_WITH_DUPLICATES [1] DIRECTORY_WITH_DUPLICATES/package-0.1-linux.tar.gz [2] DIRECTORY_WITH_DUPLICATES/package-0.1-linux.tar.gz.1 Set 1 of 1, preserve files [1 - 2, all]: 1 [+] DIRECTORY_WITH_DUPLICATES/package-0.1-linux.tar.gz [-] DIRECTORY_WITH_DUPLICATES/package-0.1-linux.tar.gz.1
Basically, it prompted me for which file to keep, I typed 1, and it removed the second.
Other interesting options are:
-r --recurse for every directory given follow subdirectories encountered within -N --noprompt when used together with --delete, preserve the first file in each set of duplicates and delete the others without prompting the user
From your example, you probably want to run it as:
fdupes --recurse --delete --noprompt DIRECTORY_WITH_DUPLICATES
See man fdupes
for all options available.
Leave a Reply
You must be logged in to post a comment.