I have a movie file collection. In the past, I was less discriminating about filetypes, but now I have standardized on .mkv format, because it can include subtitles, additional audio tracks and even video tracks, etc. Over time, I have replaced some of the non-mkv files, but I never cleaned up all the old .avi or .divx files. So I might have a video in multiple file formats, one of which can just go away.
To find duplicates, I wrote a script that uses tre-agrep for fuzzy-searching.
My script find-approx.sh
:
|
#!/bin/sh
|
|
# Startdate: 2022-09-25 17:07
|
|
# Goal: for each non-mkv video file, see if I have a fuzzy-match mkv file for it
|
|
# Dependencies:
|
|
# tre-agrep
|
|
# find . ! -type d ! -name '*.nfo' ! -name '*.jpg' ! -name '*.png' ! -name '*.srt' ! -name '*.pdf' ! -name '*.txt' ! -name '*.csv' ! -name '*.sh' > approx1
|
|
INFILE=approx1
|
|
_func() {
|
|
_in="${1}"
|
|
# we know to always strip file ending
|
|
_match="$( basename "${_in}" | sed -r -e 's@\.....?$@@;' )"
|
|
# should probably also strip (YYYY)
|
|
_match="$( echo "${_match}" | sed -r -e 's@ ?\((19|20)[0-9]{2}\) ?$@@;' )"
|
|
# and remove ", The" and similar for good measure
|
|
_match="$( echo "${_match}" | sed -r -e 's@, (The|A|An)$@@;' )"
|
|
#echo "Do something with \"${_match}\""
|
|
tre-agrep -e "${_match}.*\.mkv" < "${INFILE}" | sed -r -e "s@^@FOUND \"${_in}\": @;"
|
|
}
|
|
grep -viE '\.mkv' < "${INFILE}" | tr '\r' '\0' | while IFS='\0' read foo ;
|
|
do
|
|
_func "${foo}"
|
|
done
|
So not every printed line is an actual match. "The Godfather.avi" matched "Godfather, The, II (1974).mkv", so clearly I need to ignore some. But I was able to clean up a few old-format files that I no longer need!
Comments