blob: 5c81d5ee1d9961c7c061b31de66e3936b9b9c0a8 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
|
#### Metadata
Startdate: 2020-05-30 15:51
References:
Everything on this page, for jq filtering. https://stedolan.github.io/jq/manual/#Basicfilters
# Flow
1. Use gitlablib to list all issue web urls, and then remove all the "build", "buildmodify" and similar CI/CD issues.
. gitlablib.sh
list_all_issues | tee output/issues.all
<output/issues.all jq '.[]| if(.title|test("build-?(a(ll)?|mod(ify)?|add|del)?$")) then empty else . end | .web_url' | sed -r -e 's/"//g;' > output/issues.all.web_url
Manually munge the data to put the devuan/devuan-project/issues/20 on top.
2. Use fetch-issue-webpages.py to fetch all those webpages
ln -s issues.all.web_url output/files-to-fetch.txt
./fetch-issue-webpages.py
3. munge the downloaded html
All of the following is performed by `flow-part2.sh`
* fix newlines
sed -i -r -e 's/\\n/\n/g;' /mnt/public/www/issues/*.html
* find data-original-titles and replace the <time> tag contents with the value of its data-original-title. Also, this will BeautifulSoup pretty-print the html so some of the following commands work correctly.
ls -1 /mnt/public/www/issues/*.html > output/files-for-timestamps.txt
./fix-timestamps.py
* download all relevant images, and then fix them.
./fetch-images.sh
sed -i -f fix-images-in-html.sed /mnt/public/www/issues/*.html
* download all stylesheets and then fix them.
mkdir -p /mnt/public/www/issues/css
./fetch-css.sh
sed -i -f fix-css-in-html.sed /mnt/public/www/issues/*.html
* fix some encoding oddities
sed -i -f remove-useless.sed /mnt/public/www/issues/*.html
* remove html components that are not necessary
remove-useless.py
* Fix links that point to defunct domain without-systemd.org.
sed -i -r -f fix-without-systemd-links.sed /mnt/public/www/issues/*.html
* build some sort of index?
|