Knowledge Base

Preserving for the future: Shell scripts, AoC, and more

Scraping Pharaoh walkthroughs

Every so often I intend to play through the old Pharaoh computer game: no Cleopatra! I have the original disc and don't need some mangled version.

But I don't get very far before I lose interest in the game. Sometimes I don't even get all the way to Giza and the big, famous pyramids. So I need guidance to play the game, and for that I use the classic walkthroughs on Heavengames.

Just in case this site ever disappears, I cloned the pages locally for myself.

wget --limit-rate=50k -E -k -r -N -p -l 7 --reject-regex '.*caesar3.*|.*archives.*|.*strategy.*|.*downloads.*|.*faqs.*|.*buildings.*|.*links.*|.*comps/|.*egypt/|.*interviews/|.*gallery/|.*sitemap/|.*lympis/|.*\<e3/|.*cgi-bin/.*newsfiles/'

I built that exclusion regex after a bunch of trial-and-error. I only wanted the sources necessary to display the ~21 level walkthroughs with images and css.