Find approximate filenames

bgstack15

2022-10-08 15:20

I have a movie file collection. In the past, I was less discriminating about filetypes, but now I have standardized on .mkv format, because it can include subtitles, additional audio tracks and even video tracks, etc. Over time, I have replaced some of the non-mkv files, but I never cleaned up all the old .avi or .divx files. So I might have a video in multiple file formats, one of which can just go away.

To find duplicates, I wrote a script that uses tre-agrep for fuzzy-searching.

My script find-approx.sh:

	`#!/bin/sh`
	`# Startdate: 2022-09-25 17:07`
	`# Goal: for each non-mkv video file, see if I have a fuzzy-match mkv file for it`
	`# Dependencies:`
	`# tre-agrep`
	`# find . ! -type d ! -name '.nfo' ! -name '.jpg' ! -name '.png' ! -name '.srt' ! -name '.pdf' ! -name '.txt' ! -name '.csv' ! -name '.sh' > approx1`
	`INFILE=approx1`
	`_func() {`
	`_in="${1}"`
	`# we know to always strip file ending`
	`_match="$( basename "${_in}" \| sed -r -e 's@\.....?$@@;' )"`
	`# should probably also strip (YYYY)`
	`_match="$( echo "${_match}" \| sed -r -e 's@ ?$(19\|20)[0-9]{2}$ ?$@@;' )"`
	`# and remove ", The" and similar for good measure`
	`_match="$( echo "${_match}" \| sed -r -e 's@, (The\|A\|An)$@@;' )"`
	`#echo "Do something with \"${_match}\""`
	`tre-agrep -e "${_match}.*\.mkv" < "${INFILE}" \| sed -r -e "s@^@FOUND \"${_in}\": @;"`
	`}`
	`grep -viE '\.mkv' < "${INFILE}" \| tr '\r' '\0' \| while IFS='\0' read foo ;`
	`do`
	`_func "${foo}"`
	`done`

So not every printed line is an actual match. "The Godfather.avi" matched "Godfather, The, II (1974).mkv", so clearly I need to ignore some. But I was able to clean up a few old-format files that I no longer need!

Knowledge Base

Comments