Knowledge Base

Preserving for the future: Shell scripts, AoC, and more

Modify html file to use correctly-cased href and src filenames

I was trying to read the included index.html file for the Civilization 2 scenario Star Trek: Battle for the Alpha Quadrant by Kobayashi. When I exploded the 7z file on my GNU/Linux system, the html file points to invalid files for some images and links because of the case- sensitive filesystem. So rather than reading through the file manually to investigate how I should rename things, I decided to modify the html file to point to the extant files. Here is my mostly general solution for that.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#!/usr/bin/env python3
# Startdate: 2021-07-17
# Purpose: fix incorrectly-cased links in html file
# Usage:
#    ~/dev/fix-links/fix_links.py ~/Downloads/BAQ/Index.HTM ~/Downloads/BAQ/asdf.html
from bs4 import BeautifulSoup
import sys, os, glob

def fix_links(contents,outfile=None):
   soup = BeautifulSoup(contents,'html.parser')
   changed=0
   for item in soup.find_all(src=True): # mostly for img
      if not os.path.exists(item['src']):
         print(f"need to fix {item['src']}")
         item['src'] = find_case_insensitive(item['src'])
         changed += 1
   for item in soup.find_all(href=True): # finds anything with an href, so A and other tags.
      if not os.path.exists(item['href']):
         print(f"need to fix {item['href']}")
         newvalue=find_case_insensitive(item['href'])
         #print(f"Is it {newvalue}?")
         item['href'] = newvalue
         changed += 1
   if changed > 0:
      print("Made a change!")
      if outfile:
         with open(outfile,"w") as o:
            o.write(str(soup.prettify()))
      else:
         print(soup)

def find_case_insensitive(filename, dirname = os.path.curdir):
   # screw doing it the pythonic way. Just do it the real way.
   # major flaw: this only works for current-path files, i.e., no directory name is in the searched filename.
   output = os.popen(f"find {dirname} -iname {filename} -printf '%f'").read()
   return output

if __name__ == "__main__":
   contents="<html>dummy text</html>"
   if len(sys.argv) >= 2:
      infile= sys.argv[1]
      with open(infile,"r") as i:
         contents=i.read()
   outfile = None
   if len(sys.argv) >= 3:
      outfile = sys.argv[2]
   fix_links(contents, outfile)

Yes, I don't do a lot of protection of the parameters, and I don't even handle when using files in subdirectories. It wasn't in my use case so I didn't have to solve for that. So stay tuned for if I ever need to add that! I use the beautiful soup python library, to find all elements that have a src or href tag, and then ensure it points to a file. And probably unfortunately, but I only know an easy way to find a case-insensitive file with low-level tools, not some fancy- pants pythonic way.

Comments