Gitlab Issue Preservation project
The GitLab Issue Preservation (GLIP) Project
Introduction
Welcome to my huge hack of a project, which saves down static webpages of the git issues from a private gitlab instance. As a part of the Devuan migration from a self-hosted gitlab instance to a self-hosted gitea instance, the topic came up of saving the historical information in the git issues. I volunteered. The end results are not on the public Internet yet, but will be eventually.
Overview of the process
For merely executing the steps, these are the instructions. Most of them invoke scripts will be explained in the annotated process.
-
Use gitlablib to list all issue web urls, and then remove all the "build", "buildmodify" and similar CI/CD issues.
. gitlablib.sh
list_all_issues | tee output/issues.all
-
Use fetch-issue-webpages.py to fetch all those webpages.
ln -s issues.all.web_url output/files-to-fetch.txt
./fetch-issue-webpages.py
-
Munge the downloaded html. All of this is available in flow-part2.sh, which I needed to separate from the fetch-pages task.
-
Fix newlines
sed -i -r -e 's/\\n/\n/g;' /mnt/public/www/issues/*.html
-
find data-original-titles and replace the tag contents with the value of its data-original-title. Also, this will BeautifulSoup pretty-print the html so some of the following commands work correctly.
ls -1 /mnt/public/www/issues/*.html > output/files-for-timestamps.txt
./fix-timestamps.py
-
Download all relevant images, and then fix them.
./fetch-images.sh
sed -i -f fix-images-in-html.sed /mnt/public/www/issues/*.html
-
Download all stylesheets and then fix them.
mkdir -p /mnt/public/www/issues/css
./fetch-css.sh sed -i -f fix-css-in-html.sed /mnt/public/www/issues/*.html
-
Fix some encoding oddities
sed -i -f remove-useless.sed /mnt/public/www/issues/*.html
-
Remove html components that are not necessary
./remove-useless.py
-
Fix links that point to defunct domain without-systemd.org.
sed -i -r -f fix-without-systemd-links.sed /mnt/public/www/issues/*.html
-
The annotated process
-
- Use gitlablib to list all issue web urls, and then remove all the "build", "buildmodify" and similar CI/CD issues.
. gitlablib.sh
list_all_issues | tee output/issues.all
- Use gitlablib to list all issue web urls, and then remove all the "build", "buildmodify" and similar CI/CD issues.
I wrote a brief shell script library for interacting with the gitlab rest api. I had more functions than I ended up needing for the whole project; I really only needed the one private function for pagination of course, and then list_all_issues. Once I figured out how to list all issues (and how to handle the pagination) I didn't need to bother with the per-project or per-repo perspective.
#!/bin/sh
# Startdate: 2020-05-29
# Dependencies:
# jq
# my private token
# Library for interacting with Gitlab API
# For manual work:
# curl --header "${authheader}" "https://git.devuan.org/api/v4/projects/devuan%2Fdevuan-project/issues"
# References:
# https://docs.gitlab.com/ee/api/README.html#pagination
# handle transforming the / in the path_with_namespace to %2F per https://docs.gitlab.com/ee/api/README.html#namespaced-path-encoding https://docs.gitlab.com/ee/api/README.html#namespaced-path-encoding
# https://docs.gitlab.com/ee/api/issues.html
export token="$( cat /mnt/public/work/devuan/git.devuan.org.token.txt )"
export authheader="Private-Token: ${token}"
export server=git.devuan.org
export GLL_TMPDIR="$( mktemp -d )"
clean_gitlablib() {
rm -rf "${GLL_TMPDIR:-NOTHINGTODELETE}"/*
}
# PRIVATE
_handle_gitlab_pagination() {
# call: list_all_projects "${startUri}"
___hgp_starturi="${1}"
test -n "${GLL_DEBUG}" && set -x
# BEGIN
rhfile="$( TMPDIR="${GLL_TMPDIR}" mktemp -t "headers.XXXXXXXXXX" )"
done=0
size=-1
uri="${___hgp_starturi}"
# LOOP
while test ${done} -eq 0 ;
do
response="$( curl -v -L --header "${authheader}" "${uri}" 2>"${rhfile}" )"
#grep -iE "^< link" "${rhfile}" # determine size if test "${size}" = "-1" ; then # run only if size is still undefined tmpsize="$( awk '$2 == "x-total:" {print $3}' "${rhfile}" 2>/dev/null )"
test -n "${tmpsize}" && size="${tmpsize}"
echo "Number of items: ${size}" 1>&2
fi
tmpnextpage="$( awk '$2 == "x-next-page:" {print $3}' "${rhfile}" 2>/dev/null )"
# if x-next-page is blank, that means we are on the last page. Also, we could try x-total-pages compared to x-page.
test -z "${tmpnextpage}" && done=1
# so if we have a next page, get that link
nextUri="$( awk '{$1="";$2="";print}' "${rhfile}" | tr ',' '\n' | awk -F';' '/rel="next"/{print $1}' | sed -r -e 's/^\s*<!--/;' -e 's/-->\s*$//;' )"
if test -n "${nextUri}" ; then
uri="${nextUri}"
else
echo "No next page provided! Error." 1>&2
done=1
fi
# show contents
echo "${response}"
done
# cleanup
rm "${rhfile}"
set +x
}
list_all_projects() {
_handle_gitlab_pagination "https://${server}/api/v4/projects"
}
list_all_issues() {
_handle_gitlab_pagination "https://${server}/api/v4/issues?scope=all&status=all"
}
list_issues_for_project() {
___lifp_project="${1}"
___lifp_htmlencode_bool="${2}"
istruthy "${___lifp_htmlencode_bool}" && ___lifp_project="$( echo "${___lifp_project}" | sed -r -e 's/\//%2F/g;' )"
_handle_gitlab_pagination "https://${server}/api/v4/projects/${___lifp_project}/issues"
}
2. Use fetch-issue-webpages.py to fetch all those webpages.
ln -s issues.all.web_url output/files-to-fetch.txt
./fetch-issue-webpages.py
The script is where the bulk of my learning occurred for this project. Did you know that headless browsers can scroll down a webpage, and basically force the AJAX to load-- that annoying stuff that doesn't load when you do a nice, simple wget?
#!/usr/bin/env python3
# Startdate: 2020-05-29 16:22
# History:
# Usage:
# ln -s issues.all.web_url output/files-to-fetch.txt
# ./fetch-issues-webpages.py
# How to make this work:
# apt-get install python3-pyvirtualdisplay
# download this geckodriver, place in /usr/local/bin
# References:
# basic guide https://web.archive.org/web/20191031110759/http://scraping.pro/use-headless-firefox-scraping-linux/
# https://stackoverflow.com/questions/40302006/no-such-file-or-directory-geckodriver-for-a-python-simple-selenium-applicatio
# geckodriver https://github.com/mozilla/geckodriver/releases/download/v0.24.0/geckodriver-v0.24.0-linux64.tar.gz
# https://www.selenium.dev/selenium/docs/api/py/index.html?highlight=get
# page source https://www.selenium.dev/selenium/docs/api/py/webdriver_remote/selenium.webdriver.remote.webdriver.html?highlight=title#selenium.webdriver.remote.webdriver.WebDriver.title
# make sure all comments load https://stackoverflow.com/questions/26566799/wait-until-page-is-loaded-with-selenium-webdriver-for-python/44998503#44998503
# https://crossbrowsertesting.com/blog/test-automation/automate-login-with-selenium/
# Improve:
from pyvirtualdisplay import Display
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import re, time, getpass
def ask_password(prompt):
#return input(prompt+": ")
return getpass.getpass(prompt+": ")
def scrollDown(driver, value):
driver.execute_script("window.scrollBy(0,"+str(value)+")")
# Scroll down the page
def scrollDownAllTheWay(driver):
old_page = driver.page_source
while True:
#logging.debug("Scrolling loop")
for i in range(2):
scrollDown(driver, 500)
time.sleep(2)
new_page = driver.page_source
if new_page != old_page:
old_page = new_page
else:
break
return True
server_string="https://git.devuan.org"
outdir="/mnt/public/www/issues"
with open("output/files-to-fetch.txt") as f:
lines=[line.rstrip() for line in f]
# ask password now instead of after the delay
password = ask_password("Enter password for "+server_string)
display = Display(visible=0, size=(800, 600))
display.start()
browser = webdriver.Firefox()
# log in to gitlab instance
browser.get(server_string+"/users/sign_in")
browser.find_element_by_id("user_login").send_keys('bgstack15')
browser.find_element_by_id("user_password").send_keys(password)
browser.find_element_by_class_name("qa-sign-in-button").click()
browser.get(server_string+"/profile") # always needs the authentication
scrollDownAllTheWay(browser)
for thisfile in lines:
destfile=re.sub("\.+",".",re.sub("\/|issues",".",re.sub("^"+re.escape(server_string)+"\/","",thisfile)))+".html"
print("Saving",thisfile,outdir+"/"+destfile)
browser.get(thisfile)
scrollDownAllTheWay(browser)
with open(outdir+"/"+destfile,"w") as text_file:
print(browser.page_source.encode('utf-8'),file=text_file)
# done with loop
browser.quit()
display.stop()
3. Munge the downloaded html. All of this is available in flow-part2.sh, which I needed to separate from the fetch-pages task.
* Fix newlines
sed -i -r -e 's/\\n/\n/g;' /mnt/public/www/issues/*.html
Nothing fancy here. I guess my encoding choice for saving the output was a little... wrong. So I'm sure this a crutch that isn't used by the professionals.
* find data-original-titles and replace the tag contents with the value of its data-original-title. Also, this will BeautifulSoup pretty-print the html so some of the following commands work correctly.
ls -1 /mnt/public/www/issues/*.html > output/files-for-timestamps.txt
./fix-timestamps.py
I'm really fond of this one, partially because it's entirely my solution and using it exactly as written for another project, and it because depends on an amazing little piece of metadata that gitlab provides in the web pages! The timestamps for relevant items are included, so while the rendered html shows "1 week ago," we can convert the text to show the absolute timestamp. The script is as follows:
#!/usr/bin/env python3
# Startdate: 2020-05-29 20:40
# Purpose: convert timestamps on gitlab issue web page into UTC
# History:
# 2020-05-30 09:24 add loop through files listed in output/files-for-timestamps.txt
# Usage:
# ls -1 /mnt/public/www/issues/output*.html > output/files-for-timestamps.txt
# ./fix-timestamps.py
# References:
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#pretty-printing
# https://gitlab.com/bgstack15/vooblystats/-/blob/master/vooblystats.py
# /posts/2020/02/16/python3-convert-relative-date-to-utc-timestamp/
# Improve:
# this is hardcoded to work when the pages are shown in EDT.
from bs4 import BeautifulSoup
from datetime import timedelta
from parsedatetime import Calendar
from pytz import timezone
def fix_timestamps(page_text):
soup = BeautifulSoup(page_text,"html.parser")
cal = Calendar()
x = 0
for i in soup.find_all(name='time'):
x = x + 1
j = i.attrs["data-original-title"]
if 'EDT' == j[-3:] or 'EST' == j[-3:]:
tzobject=timezone("US/Eastern")
else:
tzobject=timezone("UTC")
dto, _ = cal.parseDT(datetimeString=j,tzinfo=timezone("US/Eastern"))
add_hours = int((str(dto)[-6:])[:3])
j = (timedelta(hours=-add_hours) + dto).strftime('%Y-%m-%dT%H:%MZ')
# second precision %S is not needed for this use case.
i.string = j
return soup
with open("output/files-for-timestamps.txt") as f:
lines = [line.rstrip() for line in f]
for thisfile in lines:
print("Fixing timestamps in file",thisfile)
with open(thisfile) as tf:
output=fix_timestamps(tf.read())
with open(thisfile,"w",encoding='utf-8') as tf:
tf.write(str(output.prettify()))
* Download all relevant images, and then fix them.
./fetch-images.sh
sed -i -f fix-images-in-html.sed /mnt/public/www/issues/*.html
I wrote this script first, because it was the images that were the most important item for this whole project.
#!/bin/sh
# startdate 2020-05-29 20:04
# After running this, be sure to do the sed.
# sed -i -f fix-images-in-html.sed /mnt/public/www/issues/*.html
# Improve:
# It is probably an artifact of the weird way the asset svgs are embedded, but I cannot get them to display at all even though they are downloaded successfully. I have seen this before, the little embedded images you cannot easily download and simply display.
INDIR=/mnt/public/www/issues
INGLOB=*.html
SEDSCRIPT=/mnt/public/work/devuan/fix-images-in-html.sed
INSERVER=https://git.devuan.org
cd "${INDIR}"
# could use this line to get all the assets, but they do not display regardless due to html weirdness
#orig_src="$( grep -oE '(\<src|xlink:href)="?\/[^"]*"' ${INGLOB} | grep -vE '\.js' | awk -F'"' '!x[$0]++{print $2}' )"
orig_src="$( grep -oE '\<src="?\/[^"]*"' ${INGLOB} | grep -vE '\.js' | awk -F'"' '!x[$2]++{print $2}' )" cat /dev/null > "${SEDSCRIPT}"
echo "${orig_src}" | while read line ; do
getpath="${INSERVER}${line}"
outdir="$( echo "${line}" | awk -F'/' '{print $2}' )"
test ! -d "${outdir}" && mkdir -p "${outdir}"
targetfile="${outdir}/$( basename "${line}" )"
test -n "${DEBUG}" && echo "process ${getpath} and save to ${targetfile}" 1>&2
test -z "${DRYRUN}" && wget --quiet --content-disposition -O "${targetfile}" "${getpath}"
# dynamically build a sed script
echo "s:${line}:${targetfile##/}:g;" | tee -a "${SEDSCRIPT}"
done
* Download all stylesheets and then fix them.
mkdir -p /mnt/public/www/issues/css
./fetch-css.sh
sed -i -f fix-css-in-html.sed /mnt/public/www/issues/*.html
This is basically a rehash of the previous script.
#!/bin/sh
# Startdate: 2020-05-29 20:18
INDIR=/mnt/public/www/issues
INGLOB=*.html
SEDSCRIPT=/mnt/public/work/devuan/fix-css-in-html.sed
# OUTDIR will be made in INDIR, because of the `cd` below.
OUTDIR=css
test ! -d "${OUTDIR}" && mkdir -p "${OUTDIR}"
INSERVER=https://git.devuan.org
cd "${INDIR}"
orig_css="$( sed -n -r -e 's/^.*<link.*(href="[^"]+\.css").*/\1/p' ${INGLOB} | awk -F'"' '!x[$2]++{print $2}' )" cat /dev/null > "${SEDSCRIPT}"
echo "${orig_css}" | while read line ; do
getpath="${INSERVER}${line}"
targetfile="${OUTDIR}/$( basename "${line}" )"
test -n "${DEBUG}" && echo "process ${getpath} and save to ${targetfile}" 1>&2
test -z "${DRYRUN}" && wget --quiet --content-disposition -O "${targetfile}" "${getpath}"
# dynamically build a sed script
echo "s:${line}:${targetfile##/}:g;" | tee -a "${SEDSCRIPT}"
done
* Fix some encoding oddities
sed -i -f remove-useless.sed /mnt/public/www/issues/*.html
This is definitely because of my choice of encoding. In fact, I bet my copy- paste of the script contents is entirely messed up for this blog post. You'll have to check it out in the git repo. Also, this is probably the hackiest part of the whole project.
$ {s/^'//}
1 {s/^b'//}
s/·/·/g # do not ask how I made this one
s/Ã//g
s/\\'/'/g
s/\xc2(\x91|\x82|\x)//g
s/\\xc2\\xb7/·/g # two characters here
s/\\xc3\\xab/�/g
s/\\xe1\\xb4\\x84\\xe1\\xb4\\xa0\\xe1\\xb4\\x87/CVE/g
s/\\xe2\\x80\\x99/'/g
s/\\xe2\\x80\\xa6/.../g
s/(\\x..)*\\xb7/·/g # two characters here
* Remove html components that are not necessary
./remove-useless.py
Thankfully, I know enough BeautifulSoup to be dangerous. In fact, I went with the scrape-and-delete method because we wanted readable issue contents with minimal work. And yes, this was my best-case scenario for "minimal work." And yes, I know this has way too much duplicated code. It works. Please submit any optimizations as a comment below, or as a PR on the git repo.
#!/usr/bin/env python3
# Startdate: 2020-05-30 19:30
# Purpose: remove key, useless html elements from slurped pages
from bs4 import BeautifulSoup
import sys
def remove_useless(contents):
soup = BeautifulSoup(contents,"html.parser")
try:
sidebar = soup.find(class_="nav-sidebar")
sidebar.replace_with("")
except:
pass
try:
navbar = soup.find(class_="navbar-gitlab")
navbar.replace_with("")
except:
pass
try:
rightbar = soup.find(class_="issuable-context-form")
rightbar.replace_with("")
except:
pass
try:
rightbar = soup.find(class_="js-issuable-sidebar")
rightbar.replace_with("")
except:
pass
try:
rightbar = soup.find(class_="js-issuable-actions")
rightbar.replace_with("")
except:
pass
try:
rightbar = soup.find(class_="js-noteable-awards")
rightbar.replace_with("")
except:
pass
try:
rightbar = soup.find(class_="disabled-comment")
rightbar.replace_with("")
except:
pass
try:
rightbar = soup.find(class_="notes-form")
rightbar.replace_with("")
except:
pass
try:
rightbar = soup.find(class_="btn-edit")
rightbar.replace_with("")
except:
pass
try:
rightbar = soup.find(class_="js-issuable-edit")
rightbar.replace_with("")
except:
pass
try:
mylist = soup.find_all(class_="note-actions")
for i in mylist:
i.replace_with("")
except:
pass
try:
mylist = soup.find_all(class_="emoji-block")
for i in mylist:
i.replace_with("")
except:
return soup
with open("output/files-for-timestamps.txt") as f:
lines = [line.rstrip() for line in f]
for thisfile in lines:
print("Removing useless html in file",thisfile)
with open(thisfile) as tf:
output=remove_useless(tf.read())
with open(thisfile,"w",encoding='utf-8') as tf:
tf.write(str(output.prettify()))
* Fix links that point to defunct domain without-systemd.org.
sed -i -r -f fix-without-systemd-links.sed /mnt/public/www/issues/*.html
This requirement came in late during the development phase. I called this one "scope creep," but thankfully it was easy enough to automate changing out links to the web.archive.org versions.
/without-systemd\.org/{
/archive\.org/!{
s@(http://without-systemd\.org)@https://web.archive.org/web/20190208013412/\1@g;
}
}
Conclusions
I learned how to use a headless browser for this project! I already had dabbled with BeautifulSoup and jq, and of course I already know the GNU coreutils. I already had a function for fixing relative timestamps thankfully!
References
Weblinks
Obviously my scripts listed here also contain the plain URLs of the references, but this is the list of them in html format:
- API Docs | GitLab #pagination
- API Docs | GitLab #namespaced-path-encoding
- Issues API | GitLab
- basic guide to headless browser Tutorial: How to use Headless Firefox for Scraping in Linux (Web Archive)
- linux - No such file or directory: 'geckodriver' for a Python simple Selenium application - Stack Overflow
- geckodriver binary https://github.com/mozilla/geckodriver/releases/download/v0.24.0/geckodriver-v0.24.0-linux64.tar.gz
- Selenium Client Driver — Selenium 3.14 documentation
- page source selenium.webdriver.remote.webdriver — Selenium 3.14 documentation
- make sure all comments load Wait until page is loaded with Selenium WebDriver for Python - Stack Overflow
- Selenium 101: How To Automate Your Login Process | CrossBrowserTesting.com
- Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation
- relative timestamp to absolute vooblystats.py · master · B Stack / vooblystats · GitLab
- Python3: convert relative date to UTC timestamp
Comments