Knowledge Base

Preserving for the future: Shell scripts, AoC, and more

Shell: find duplicate files and replace most with symlinks

This is probably not the best solution, but it's the one I wanted to work with. I intended to reduce disk space of 3 similar projects. Here is my shell script to find duplicate files (by md5sum) and replace any secondary file with a symlink to the relative path of the original file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/usr/bin/env sh
# File: set-symlinks.sh
# License: CC-BY-SA 3.0
# Author: bgstack15
# Startdate: 2020-02-07 14:03
# Title: Script that Replaces Duplicate Files with Symlinks
# Purpose:
# History:
# Usage:
# Reference:
#    https://stackoverflow.com/questions/2564634/convert-absolute-path-into-relative-path-given-a-current-directory-using-bash
# Improve:
# Dependencies:
#    coreutils >= 8.23

INDIR=~/src/project

#results="$( find "${INDIR}" ! -type d ! -type l -exec md5sum {} + | sort )"
find "${INDIR}" ! -type d ! -type l -exec md5sum {} + | sort | \
   awk '{a[$1]=a[$1]":"$2} END {for (i in a){print a[i]}}' | sed -r -e 's/^://;' | \
   while IFS=':' read main child1 child2 child3 child4 ;
   do
      x=0
      while test $x -lt 4 ;
      do
         x=$(( x + 1 ))
         eval thischild="\${child$x}"
         if test -n "${thischild}" ;
         then
            linkname="$( realpath --relative-to "$( dirname "${thischild}" )" "${main}" 2>/dev/null )"
            test -n "${DEBUG}" && echo "ln -sf ${linkname} ${thischild}" 1>&2
            test -z "${DRYRUN}" && ln -sf "${linkname}" "${thischild}"
         fi
      done
   done

I was going to do this task for myself by hand, but then a quick investigation showed 73 files that were duplicates. Because of the small size of the project, I decided to just run it in shell and not revert to Python. I don't need efficiency; I just need to run it once, really. The tricky bits are in the very front of the logic. The awk associate array builds a list of all filenames that correspond with an md5sum. Then, stripping out the leading colon (separator), I pipe the output to a while for easy variable naming. And then loop a few times (hard-coded to 4) and if item number X exists, get the relative path to the main file, and force create the symlink.

Comments