I was trying to read the included index.html file for the Civilization
2 scenario Star Trek: Battle
for the Alpha Quadrant by
Kobayashi. When I exploded the 7z file on my GNU/Linux system, the html file
points to invalid files for some images and links because of the case-
sensitive filesystem. So rather than reading through the file manually to
investigate how I should rename things, I decided to modify the html file to
point to the extant files. Here is my mostly general solution for that.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47 |
#!/usr/bin/env python3
# Startdate: 2021-07-17
# Purpose: fix incorrectly-cased links in html file
# Usage:
# ~/dev/fix-links/fix_links.py ~/Downloads/BAQ/Index.HTM ~/Downloads/BAQ/asdf.html
from bs4 import BeautifulSoup
import sys, os, glob
def fix_links(contents,outfile=None):
soup = BeautifulSoup(contents,'html.parser')
changed=0
for item in soup.find_all(src=True): # mostly for img
if not os.path.exists(item['src']):
print(f"need to fix {item['src']}")
item['src'] = find_case_insensitive(item['src'])
changed += 1
for item in soup.find_all(href=True): # finds anything with an href, so A and other tags.
if not os.path.exists(item['href']):
print(f"need to fix {item['href']}")
newvalue=find_case_insensitive(item['href'])
#print(f"Is it {newvalue}?")
item['href'] = newvalue
changed += 1
if changed > 0:
print("Made a change!")
if outfile:
with open(outfile,"w") as o:
o.write(str(soup.prettify()))
else:
print(soup)
def find_case_insensitive(filename, dirname = os.path.curdir):
# screw doing it the pythonic way. Just do it the real way.
# major flaw: this only works for current-path files, i.e., no directory name is in the searched filename.
output = os.popen(f"find {dirname} -iname {filename} -printf '%f'").read()
return output
if __name__ == "__main__":
contents="<html>dummy text</html>"
if len(sys.argv) >= 2:
infile= sys.argv[1]
with open(infile,"r") as i:
contents=i.read()
outfile = None
if len(sys.argv) >= 3:
outfile = sys.argv[2]
fix_links(contents, outfile)
|
Yes, I don't do a lot of protection of the parameters, and I don't even handle
when using files in subdirectories. It wasn't in my use case so I didn't have
to solve for that. So stay tuned for if I ever need to add that! I use the
beautiful soup
python library, to find all elements that have a src or href tag, and then
ensure it points to a file. And probably unfortunately, but I only know an
easy way to find a case-insensitive file with low-level tools, not some fancy-
pants pythonic way.
Comments