summaryrefslogtreecommitdiff
path: root/flow.md
blob: ca81d52955c52f5152eb4ea6ad754a015f1ee467 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#### Metadata
Startdate: 2020-05-30 15:51
References:
Everything on this page, for jq filtering. https://stedolan.github.io/jq/manual/#Basicfilters


# Flow

1. Use gitlablib to list all issue web urls, and then remove all the "build", "buildmodify" and similar CI/CD issues.

    . gitlablib.sh
    list_all_issues | tee output/issues.all
    <output/issues.all jq '.[]| if(.title|test("build-?(a(ll)?|mod(ify)?|add|del)?$")) then empty else . end | .web_url' | sed -r -e 's/"//g;' > output/issues.all.web_url

   Manually munge the data to put the devuan/devuan-project/issues/20 on top.

2. Use fetch-issue-webpages.py to fetch all those webpages

    ln -s issues.all.web_url output/files-to-fetch.txt
    ./fetch-issue-webpages.py

3. munge the downloaded html
   All of the following is performed by `flow-part2.sh`

  * fix newlines

    sed -i -r -e 's/\\n/\n/g;' /mnt/public/www/gitlab-issues/*.html

  * find data-original-titles and replace the <time> tag contents with the value of its data-original-title. Also, this will BeautifulSoup pretty-print the html so some of the following commands work correctly.

    ls -1 /mnt/public/www/gitlab-issues/*.html > output/files-for-timestamps.txt
    ./fix-timestamps.py

  * download all relevant images, and then fix them.

    ./fetch-images.sh
    sed -i -f fix-images-in-html.sed /mnt/public/www/gitlab-issues/*.html

  * download all stylesheets and then fix them.

    mkdir -p /mnt/public/www/gitlab-issues/css
    ./fetch-css.sh
    sed -i -f fix-css-in-html.sed /mnt/public/www/gitlab-issues/*.html

  * fix some encoding oddities

    sed -i -f remove-useless.sed /mnt/public/www/gitlab-issues/*.html

  * remove html components that are not necessary

    remove-useless.py

  * Fix links that point to defunct domain without-systemd.org.

    sed -i -r -f fix-without-systemd-links.sed /mnt/public/www/gitlab-issues/*.html

  * Perform final encoding conversion to remove any remaining broken characters

    ./conversion.sh /mnt/public/www/gitlab-issues/*.html

  * Fix some images that have a src="data:" that do not load, but the data-src property is the proper link

    ./use-datasrc-instead-src.py

  * build some sort of index?
bgstack15