Knowledge Base

Preserving for the future: Shell scripts, AoC, and more

Gitlab Issue Preservation project

The GitLab Issue Preservation (GLIP) Project

Introduction

Welcome to my huge hack of a project, which saves down static webpages of the git issues from a private gitlab instance. As a part of the Devuan migration from a self-hosted gitlab instance to a self-hosted gitea instance, the topic came up of saving the historical information in the git issues. I volunteered. The end results are not on the public Internet yet, but will be eventually.

Overview of the process

For merely executing the steps, these are the instructions. Most of them invoke scripts will be explained in the annotated process.

  1. Use gitlablib to list all issue web urls, and then remove all the "build", "buildmodify" and similar CI/CD issues.

    . gitlablib.sh
    

    list_all_issues | tee output/issues.all output/issues.all.web_url

  2. Use fetch-issue-webpages.py to fetch all those webpages.

    ln -s issues.all.web_url output/files-to-fetch.txt
    

    ./fetch-issue-webpages.py

  3. Munge the downloaded html. All of this is available in flow-part2.sh, which I needed to separate from the fetch-pages task.

    • Fix newlines

          sed -i -r -e 's/\\n/\n/g;' /mnt/public/www/issues/*.html
      
    • find data-original-titles and replace the tag contents with the value of its data-original-title. Also, this will BeautifulSoup pretty-print the html so some of the following commands work correctly.

          ls -1 /mnt/public/www/issues/*.html > output/files-for-timestamps.txt
      

      ./fix-timestamps.py

    • Download all relevant images, and then fix them.

          ./fetch-images.sh
      

      sed -i -f fix-images-in-html.sed /mnt/public/www/issues/*.html

    • Download all stylesheets and then fix them.

          mkdir -p /mnt/public/www/issues/css
      

      ./fetch-css.sh sed -i -f fix-css-in-html.sed /mnt/public/www/issues/*.html

    • Fix some encoding oddities

          sed -i -f remove-useless.sed /mnt/public/www/issues/*.html
      
    • Remove html components that are not necessary

          ./remove-useless.py
      
    • Fix links that point to defunct domain without-systemd.org.

          sed -i -r -f fix-without-systemd-links.sed /mnt/public/www/issues/*.html
      

The annotated process

    1. Use gitlablib to list all issue web urls, and then remove all the "build", "buildmodify" and similar CI/CD issues.
          . gitlablib.sh
      

      list_all_issues | tee output/issues.all output/issues.all.web_url

I wrote a brief shell script library for interacting with the gitlab rest api. I had more functions than I ended up needing for the whole project; I really only needed the one private function for pagination of course, and then list_all_issues. Once I figured out how to list all issues (and how to handle the pagination) I didn't need to bother with the per-project or per-repo perspective.

            #!/bin/sh
    # Startdate: 2020-05-29
    # Dependencies:
    #    jq
    #    my private token
    # Library for interacting with Gitlab API
    # For manual work:
    #    curl --header "${authheader}" "https://git.devuan.org/api/v4/projects/devuan%2Fdevuan-project/issues"
    # References:
    #    https://docs.gitlab.com/ee/api/README.html#pagination
    #    handle transforming the / in the path_with_namespace to %2F per https://docs.gitlab.com/ee/api/README.html#namespaced-path-encoding https://docs.gitlab.com/ee/api/README.html#namespaced-path-encoding
    #    https://docs.gitlab.com/ee/api/issues.html

    export token="$( cat /mnt/public/work/devuan/git.devuan.org.token.txt )"
    export authheader="Private-Token: ${token}"

    export server=git.devuan.org

    export GLL_TMPDIR="$( mktemp -d )"

    clean_gitlablib() {
       rm -rf "${GLL_TMPDIR:-NOTHINGTODELETE}"/*
    }

    # PRIVATE
    _handle_gitlab_pagination() {
       # call: list_all_projects "${startUri}"
       ___hgp_starturi="${1}"
       test -n "${GLL_DEBUG}" && set -x
       # BEGIN
       rhfile="$( TMPDIR="${GLL_TMPDIR}" mktemp -t "headers.XXXXXXXXXX" )"
       done=0
       size=-1
       uri="${___hgp_starturi}"

       # LOOP
       while test ${done} -eq 0 ;
       do
          response="$( curl -v -L --header "${authheader}" "${uri}" 2>"${rhfile}" )"
          #grep -iE "^< link" "${rhfile}" # determine size if test "${size}" = "-1" ; then # run only if size is still undefined tmpsize="$( awk '$2 == "x-total:" {print $3}' "${rhfile}" 2>/dev/null )"
             test -n "${tmpsize}" && size="${tmpsize}"
             echo "Number of items: ${size}" 1>&2
          fi

          tmpnextpage="$( awk '$2 == "x-next-page:" {print $3}' "${rhfile}" 2>/dev/null )"
          # if x-next-page is blank, that means we are on the last page. Also, we could try x-total-pages compared to x-page.
          test -z "${tmpnextpage}" && done=1
          # so if we have a next page, get that link
          nextUri="$( awk '{$1="";$2="";print}' "${rhfile}" | tr ',' '\n' | awk -F';' '/rel="next"/{print $1}' | sed -r -e 's/^\s*<!--/;' -e 's/-->\s*$//;' )"
          if test -n "${nextUri}" ; then
             uri="${nextUri}"
          else
             echo "No next page provided! Error." 1>&2
             done=1
          fi

          # show contents
          echo "${response}"
       done

       # cleanup
       rm "${rhfile}"
       set +x
    }

    list_all_projects() {
       _handle_gitlab_pagination "https://${server}/api/v4/projects"
    }

    list_all_issues() {
       _handle_gitlab_pagination "https://${server}/api/v4/issues?scope=all&status=all"
    }

    list_issues_for_project() {
       ___lifp_project="${1}"
       ___lifp_htmlencode_bool="${2}"
       istruthy "${___lifp_htmlencode_bool}" && ___lifp_project="$( echo "${___lifp_project}" | sed -r -e 's/\//%2F/g;' )"
       _handle_gitlab_pagination "https://${server}/api/v4/projects/${___lifp_project}/issues"
    }



2. Use fetch-issue-webpages.py to fetch all those webpages.

            ln -s issues.all.web_url output/files-to-fetch.txt
    ./fetch-issue-webpages.py

The script is where the bulk of my learning occurred for this project. Did you know that headless browsers can scroll down a webpage, and basically force the AJAX to load-- that annoying stuff that doesn't load when you do a nice, simple wget?

            #!/usr/bin/env python3
    # Startdate: 2020-05-29 16:22
    # History:
    # Usage:
    #    ln -s issues.all.web_url output/files-to-fetch.txt
    #    ./fetch-issues-webpages.py
    # How to make this work:
    #    apt-get install python3-pyvirtualdisplay
    #    download this geckodriver, place in /usr/local/bin
    # References:
    #    basic guide https://web.archive.org/web/20191031110759/http://scraping.pro/use-headless-firefox-scraping-linux/
    #    https://stackoverflow.com/questions/40302006/no-such-file-or-directory-geckodriver-for-a-python-simple-selenium-applicatio
    #    geckodriver https://github.com/mozilla/geckodriver/releases/download/v0.24.0/geckodriver-v0.24.0-linux64.tar.gz
    #    https://www.selenium.dev/selenium/docs/api/py/index.html?highlight=get
    #    page source https://www.selenium.dev/selenium/docs/api/py/webdriver_remote/selenium.webdriver.remote.webdriver.html?highlight=title#selenium.webdriver.remote.webdriver.WebDriver.title
    #    make sure all comments load https://stackoverflow.com/questions/26566799/wait-until-page-is-loaded-with-selenium-webdriver-for-python/44998503#44998503
    #    https://crossbrowsertesting.com/blog/test-automation/automate-login-with-selenium/
    # Improve:
    from pyvirtualdisplay import Display
    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    import re, time, getpass

    def ask_password(prompt):
        #return input(prompt+": ")
        return getpass.getpass(prompt+": ")

    def scrollDown(driver, value):
       driver.execute_script("window.scrollBy(0,"+str(value)+")")

    # Scroll down the page
    def scrollDownAllTheWay(driver):
       old_page = driver.page_source
       while True:
          #logging.debug("Scrolling loop")
          for i in range(2):
             scrollDown(driver, 500)
             time.sleep(2)
          new_page = driver.page_source
          if new_page != old_page:
             old_page = new_page
          else:
             break
       return True

    server_string="https://git.devuan.org"
    outdir="/mnt/public/www/issues"

    with open("output/files-to-fetch.txt") as f:
       lines=[line.rstrip() for line in f]

    # ask password now instead of after the delay
    password = ask_password("Enter password for "+server_string)

    display = Display(visible=0, size=(800, 600))
    display.start()

    browser = webdriver.Firefox()

    # log in to gitlab instance
    browser.get(server_string+"/users/sign_in")
    browser.find_element_by_id("user_login").send_keys('bgstack15')
    browser.find_element_by_id("user_password").send_keys(password)
    browser.find_element_by_class_name("qa-sign-in-button").click()
    browser.get(server_string+"/profile") # always needs the authentication
    scrollDownAllTheWay(browser)

    for thisfile in lines:
       destfile=re.sub("\.+",".",re.sub("\/|issues",".",re.sub("^"+re.escape(server_string)+"\/","",thisfile)))+".html"
       print("Saving",thisfile,outdir+"/"+destfile)
       browser.get(thisfile)
       scrollDownAllTheWay(browser)
       with open(outdir+"/"+destfile,"w") as text_file:
          print(browser.page_source.encode('utf-8'),file=text_file)

    # done with loop
    browser.quit()
    display.stop()


3. Munge the downloaded html. All of this is available in flow-part2.sh, which I needed to separate from the fetch-pages task. 
  * Fix newlines

                    sed -i -r -e 's/\\n/\n/g;' /mnt/public/www/issues/*.html

Nothing fancy here. I guess my encoding choice for saving the output was a little... wrong. So I'm sure this a crutch that isn't used by the professionals.

  * find data-original-titles and replace the  tag contents with the value of its data-original-title. Also, this will BeautifulSoup pretty-print the html so some of the following commands work correctly.

                    ls -1 /mnt/public/www/issues/*.html > output/files-for-timestamps.txt
        ./fix-timestamps.py

I'm really fond of this one, partially because it's entirely my solution and using it exactly as written for another project, and it because depends on an amazing little piece of metadata that gitlab provides in the web pages! The timestamps for relevant items are included, so while the rendered html shows "1 week ago," we can convert the text to show the absolute timestamp. The script is as follows:

                    #!/usr/bin/env python3
        # Startdate: 2020-05-29 20:40
        # Purpose: convert timestamps on gitlab issue web page into UTC
        # History:
        #    2020-05-30 09:24 add loop through files listed in output/files-for-timestamps.txt
        # Usage:
        #    ls -1 /mnt/public/www/issues/output*.html > output/files-for-timestamps.txt
        #    ./fix-timestamps.py
        # References:
        #    https://www.crummy.com/software/BeautifulSoup/bs4/doc/#pretty-printing
        #    https://gitlab.com/bgstack15/vooblystats/-/blob/master/vooblystats.py
        #    /posts/2020/02/16/python3-convert-relative-date-to-utc-timestamp/
        # Improve:
        #    this is hardcoded to work when the pages are shown in EDT.
        from bs4 import BeautifulSoup
        from datetime import timedelta
        from parsedatetime import Calendar
        from pytz import timezone

        def fix_timestamps(page_text):
           soup = BeautifulSoup(page_text,"html.parser")
           cal = Calendar()
           x = 0
           for i in soup.find_all(name='time'):
              x = x + 1
              j = i.attrs["data-original-title"]
              if 'EDT' == j[-3:] or 'EST' == j[-3:]:
                 tzobject=timezone("US/Eastern")
              else:
                 tzobject=timezone("UTC")
              dto, _ = cal.parseDT(datetimeString=j,tzinfo=timezone("US/Eastern"))
              add_hours = int((str(dto)[-6:])[:3])
              j = (timedelta(hours=-add_hours) + dto).strftime('%Y-%m-%dT%H:%MZ')
              # second precision %S is not needed for this use case.
              i.string = j
           return soup

        with open("output/files-for-timestamps.txt") as f:
           lines = [line.rstrip() for line in f]

        for thisfile in lines:
           print("Fixing timestamps in file",thisfile)
           with open(thisfile) as tf:
              output=fix_timestamps(tf.read())
           with open(thisfile,"w",encoding='utf-8') as tf:
              tf.write(str(output.prettify()))


  * Download all relevant images, and then fix them.

                    ./fetch-images.sh
        sed -i -f fix-images-in-html.sed /mnt/public/www/issues/*.html

I wrote this script first, because it was the images that were the most important item for this whole project.

                    #!/bin/sh
        # startdate 2020-05-29 20:04
        # After running this, be sure to do the sed.
        #    sed -i -f fix-images-in-html.sed /mnt/public/www/issues/*.html
        # Improve:
        #    It is probably an artifact of the weird way the asset svgs are embedded, but I cannot get them to display at all even though they are downloaded successfully. I have seen this before, the little embedded images you cannot easily download and simply display.

        INDIR=/mnt/public/www/issues
        INGLOB=*.html

        SEDSCRIPT=/mnt/public/work/devuan/fix-images-in-html.sed

        INSERVER=https://git.devuan.org

        cd "${INDIR}"

        # could use this line to get all the assets, but they do not display regardless due to html weirdness
        #orig_src="$( grep -oE '(\<src|xlink:href)="?\/[^"]*"' ${INGLOB} | grep -vE '\.js' | awk -F'"' '!x[$0]++{print $2}' )"
        orig_src="$( grep -oE '\<src="?\/[^"]*"' ${INGLOB} | grep -vE '\.js' | awk -F'"' '!x[$2]++{print $2}' )" cat /dev/null > "${SEDSCRIPT}"

        echo "${orig_src}" | while read line ; do
           getpath="${INSERVER}${line}"
           outdir="$( echo "${line}" | awk -F'/' '{print $2}' )"
           test ! -d "${outdir}" && mkdir -p "${outdir}"
           targetfile="${outdir}/$( basename "${line}" )"
           test -n "${DEBUG}" && echo "process ${getpath} and save to ${targetfile}" 1>&2
           test -z "${DRYRUN}" && wget --quiet --content-disposition -O "${targetfile}" "${getpath}"
           # dynamically build a sed script
           echo "s:${line}:${targetfile##/}:g;" | tee -a "${SEDSCRIPT}"
        done


  * Download all stylesheets and then fix them.

                    mkdir -p /mnt/public/www/issues/css
        ./fetch-css.sh
        sed -i -f fix-css-in-html.sed /mnt/public/www/issues/*.html

This is basically a rehash of the previous script.

                    #!/bin/sh
        # Startdate: 2020-05-29 20:18

        INDIR=/mnt/public/www/issues
        INGLOB=*.html

        SEDSCRIPT=/mnt/public/work/devuan/fix-css-in-html.sed

        # OUTDIR will be made in INDIR, because of the `cd` below.
        OUTDIR=css
        test ! -d "${OUTDIR}" && mkdir -p "${OUTDIR}"

        INSERVER=https://git.devuan.org

        cd "${INDIR}"

        orig_css="$( sed -n -r -e 's/^.*<link.*(href="[^"]+\.css").*/\1/p' ${INGLOB} | awk -F'"' '!x[$2]++{print $2}' )" cat /dev/null > "${SEDSCRIPT}"

        echo "${orig_css}" | while read line ; do
           getpath="${INSERVER}${line}"
           targetfile="${OUTDIR}/$( basename "${line}" )"
           test -n "${DEBUG}" && echo "process ${getpath} and save to ${targetfile}" 1>&2
           test -z "${DRYRUN}" && wget --quiet --content-disposition -O "${targetfile}" "${getpath}"
           # dynamically build a sed script
           echo "s:${line}:${targetfile##/}:g;" | tee -a "${SEDSCRIPT}"
        done

  * Fix some encoding oddities

                    sed -i -f remove-useless.sed /mnt/public/www/issues/*.html

This is definitely because of my choice of encoding. In fact, I bet my copy- paste of the script contents is entirely messed up for this blog post. You'll have to check it out in the git repo. Also, this is probably the hackiest part of the whole project.

                    $ {s/^'//}
        1 {s/^b'//}
        s/·/·/g # do not ask how I made this one
        s/Ã//g
        s/\\'/'/g
        s/\xc2(\x91|\x82|\x)//g
        s/\\xc2\\xb7/·/g # two characters here
        s/\\xc3\\xab//g
        s/\\xe1\\xb4\\x84\\xe1\\xb4\\xa0\\xe1\\xb4\\x87/CVE/g
        s/\\xe2\\x80\\x99/'/g
        s/\\xe2\\x80\\xa6/.../g
        s/(\\x..)*\\xb7/·/g # two characters here


  * Remove html components that are not necessary

                    ./remove-useless.py

Thankfully, I know enough BeautifulSoup to be dangerous. In fact, I went with the scrape-and-delete method because we wanted readable issue contents with minimal work. And yes, this was my best-case scenario for "minimal work." And yes, I know this has way too much duplicated code. It works. Please submit any optimizations as a comment below, or as a PR on the git repo.

                    #!/usr/bin/env python3
        # Startdate: 2020-05-30 19:30
        # Purpose: remove key, useless html elements from slurped pages
        from bs4 import BeautifulSoup
        import sys

        def remove_useless(contents):
           soup = BeautifulSoup(contents,"html.parser")
           try:
              sidebar = soup.find(class_="nav-sidebar")
              sidebar.replace_with("")
           except:
              pass
           try:
              navbar = soup.find(class_="navbar-gitlab")
              navbar.replace_with("")
           except:
              pass
           try:
              rightbar = soup.find(class_="issuable-context-form")
              rightbar.replace_with("")
           except:
              pass
           try:
              rightbar = soup.find(class_="js-issuable-sidebar")
              rightbar.replace_with("")
           except:
              pass
           try:
              rightbar = soup.find(class_="js-issuable-actions")
              rightbar.replace_with("")
           except:
              pass
           try:
              rightbar = soup.find(class_="js-noteable-awards")
              rightbar.replace_with("")
           except:
              pass
           try:
              rightbar = soup.find(class_="disabled-comment")
              rightbar.replace_with("")
           except:
              pass
           try:
              rightbar = soup.find(class_="notes-form")
              rightbar.replace_with("")
           except:
              pass
           try:
              rightbar = soup.find(class_="btn-edit")
              rightbar.replace_with("")
           except:
              pass
           try:
              rightbar = soup.find(class_="js-issuable-edit")
              rightbar.replace_with("")
           except:
              pass
           try:
              mylist = soup.find_all(class_="note-actions")
              for i in mylist:
                 i.replace_with("")
           except:
              pass
           try:
              mylist = soup.find_all(class_="emoji-block")
              for i in mylist:
                 i.replace_with("")
           except:
           return soup

        with open("output/files-for-timestamps.txt") as f:
           lines = [line.rstrip() for line in f]

        for thisfile in lines:
           print("Removing useless html in file",thisfile)
           with open(thisfile) as tf:
              output=remove_useless(tf.read())
           with open(thisfile,"w",encoding='utf-8') as tf:
              tf.write(str(output.prettify()))


  * Fix links that point to defunct domain without-systemd.org.

                    sed -i -r -f fix-without-systemd-links.sed /mnt/public/www/issues/*.html

This requirement came in late during the development phase. I called this one "scope creep," but thankfully it was easy enough to automate changing out links to the web.archive.org versions.

                    /without-systemd\.org/{
           /archive\.org/!{
              s@(http://without-systemd\.org)@https://web.archive.org/web/20190208013412/\1@g;
           }
        }

Conclusions

I learned how to use a headless browser for this project! I already had dabbled with BeautifulSoup and jq, and of course I already know the GNU coreutils. I already had a function for fixing relative timestamps thankfully!

References

Weblinks

Obviously my scripts listed here also contain the plain URLs of the references, but this is the list of them in html format:

  1. API Docs | GitLab #pagination
  2. API Docs | GitLab #namespaced-path-encoding
  3. Issues API | GitLab
  4. basic guide to headless browser Tutorial: How to use Headless Firefox for Scraping in Linux (Web Archive)
  5. linux - No such file or directory: 'geckodriver' for a Python simple Selenium application - Stack Overflow
  6. geckodriver binary https://github.com/mozilla/geckodriver/releases/download/v0.24.0/geckodriver-v0.24.0-linux64.tar.gz
  7. Selenium Client Driver — Selenium 3.14 documentation
  8. page source selenium.webdriver.remote.webdriver — Selenium 3.14 documentation
  9. make sure all comments load Wait until page is loaded with Selenium WebDriver for Python - Stack Overflow
  10. Selenium 101: How To Automate Your Login Process | CrossBrowserTesting.com
  11. Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation
  12. relative timestamp to absolute vooblystats.py · master · B Stack / vooblystats · GitLab
  13. Python3: convert relative date to UTC timestamp

Comments