Knowledge Base

Preserving for the future: Shell scripts, AoC, and more

Find approximate filenames

I have a movie file collection. In the past, I was less discriminating about filetypes, but now I have standardized on .mkv format, because it can include subtitles, additional audio tracks and even video tracks, etc. Over time, I have replaced some of the non-mkv files, but I never cleaned up all the old .avi or .divx files. So I might have a video in multiple file formats, one of which can just go away.

To find duplicates, I wrote a script that uses tre-agrep for fuzzy-searching.

My script find-approx.sh:

#!/bin/sh
# Startdate: 2022-09-25 17:07
# Goal: for each non-mkv video file, see if I have a fuzzy-match mkv file for it
# Dependencies:
#    tre-agrep
#    find . ! -type d ! -name '*.nfo' ! -name '*.jpg' ! -name '*.png' ! -name '*.srt' ! -name '*.pdf' ! -name '*.txt' ! -name '*.csv' ! -name '*.sh' > approx1
INFILE=approx1
_func() {
   _in="${1}"
   # we know to always strip file ending
   _match="$( basename "${_in}" | sed -r -e 's@\.....?$@@;' )"
   # should probably also strip (YYYY)
   _match="$( echo "${_match}" | sed -r -e 's@ ?\((19|20)[0-9]{2}\) ?$@@;' )"
   # and remove ", The" and similar for good measure
   _match="$( echo "${_match}" | sed -r -e 's@, (The|A|An)$@@;' )"
   #echo "Do something with \"${_match}\""
   tre-agrep -e "${_match}.*\.mkv" < "${INFILE}" | sed -r -e "s@^@FOUND \"${_in}\": @;"
}
grep -viE '\.mkv' < "${INFILE}" | tr '\r' '\0' | while IFS='\0' read foo ;
do
   _func "${foo}"
done

So not every printed line is an actual match. "The Godfather.avi" matched "Godfather, The, II (1974).mkv", so clearly I need to ignore some. But I was able to clean up a few old-format files that I no longer need!

Comments