Mirror an OBS deb repository locally

bgstack15

2020-01-27 08:56

Story

I run an OBS repository for all my packages, and it is available at the main site: https://build.opensuse.org/project/show/home:bgstack15. But I wanted to mirror this for myself, so I don't have to configure all my systems to point outward to get updates. I already host a Devuan ceres mirror for myself, and so mirroring this Open Build System repository is the last step to be self-hosting entirely for all systems except the mirror server. I first dabbled with debmirror, but it kept wanting to try rsync despite my best configuration, plus it really insists on using the dists/ directory which isn't used in the OBS deb repo design. So, I researched scraping down a whole site, and I found httrack which exists to serve a local copy of an Internet site. Bingo! After a few hours of work, here is my solution for mirroring an OBS deb repo locally.

Solution

Create a user who will own the files and execute the httrack command, because httrack didn't want to be run as root. Also, this new user can't munge other data.

useradd obsmirror

Configure a script (available at gitlab)

#!/bin/sh
# File: /etc/installed/obsmirror.sh
# License: CC-BY-SA 4.0
# Author: bgstack15
# Startdate: 2020-01-05 18:01
# Title: Script that scrapes down OBS site to serve a copy to intranet
# Purpose: save down my OBS site so I can serve it locally
# History:
# Usage:
#    in a cron job: /etc/cron.d/mirror.cron
#       50  12  *   *   *   root    /etc/installed/obsmirror.sh 1>/dev/null 2>&1
# Reference:
#    https://unix.stackexchange.com/questions/114044/how-to-make-wget-download-recursive-combining-accept-with-exclude-directorie?rq=1
#    man 1 httrack
#    https://software.opensuse.org//download.html?project=home%3Abgstack15&package=freefilesync
# Improve:
#    use some text file as a list of recently-synced URLs, and if today's URL matches a recent one, then run the httrack with the --update flag. Probably keep a running list forever.
# Documentation:
#    Download the release key and trust it.
#       curl -s http://repo.example.com/mirror/obs/Release.key | apt-key add -
#    Use a sources.list.d/ file with contents:
#       deb https://repo.example.com/mirror/obs/ /
# Dependencies:
#    binaries: curl httrack grep head tr sed awk chmod chown find rm ln
#    user: obsmirror

logfile="/var/log/obsmirror/obsmirror.$( date "+%FT%H%M%S" ).log"
{
   test "${DEBUG:-NONE}" = "FULL" && set -x
   inurl="http://download.opensuse.org/repositories/home:/bgstack15/Debian_Unstable"
   workdir=/tmp/obs-stage
   outdir=/var/www/mirror/obs
   thisuser=obsmirror
   echo "logfile=${logfile}"

   mkdir -p "${workdir}" ; chmod "0711" "${workdir}" ; chown "${thisuser}:$( id -Gn obsmirror )" "${workdir}" 
   cd "${workdir}"
   # get page contents
   step1="$( curl -s -L "${inurl}/all" )"
   # get first listed package
   step2="$( echo "${step1}" | grep --color=always -oE 'href="[a-zA-Z0-9_.+\-]+\.deb"' | head -n1 | grep -oE '".*"' | tr -d '"' )"
   # get full url to a package
   step3="$( curl -s -I "${inurl}/all/${step2}" | awk '/Location:/ {print $2}' )"
   # get directory of the mirror to save down
   step4="$( echo "${step3}" | sed -r -e "s/all\/${step2}//;" -e 's/\s*$//;' )"
   # get domain of full url
   domainname="$( echo "${step3}" | grep -oE '(ht|f)tps?:\/\/[^\/]+\/' | cut -d'/' -f3 )"
   echo "TARGET URL: ${step4}"
   test -z "${DRYRUN}" && {
      # clean workdir of specific domain name in use right now.
      echo su "${thisuser}" -c "rm -rf \"${workdir:-SOMETHING}/${domainname:-SOMETHING}\""
      su "${thisuser}" -c "rm -rf \"${workdir:-SOMETHING}/${domainname:-SOMETHING}\"*"
      # have to skip the orig.tar.gz files because they are large and slow down the sync process significantly.
      echo su "${thisuser}" -c "httrack \"${step4}\" -*.orig.t* -v --mirror --update -s0 -r3 -%e0 \"${workdir}\""
      time su "${thisuser}" -c "httrack ${step4} -*.orig.t* -v --mirror --update -s0 -r3 -%e0 ${workdir}"
   }
   # -s0 ignore robots.txt
   # -r3 only go down 3 links
   # -%e0 follow 0 links to external sites

   # find most recent directory of that level
   levelcount="$(( $( printf "%s" "${inurl}" | tr -dc '/' | wc -c ) - 1 ))"
   subdir="$( find "${workdir}" -mindepth "${levelcount}" -maxdepth "${levelcount}" -type d -name 'Debian_Unstable' -printf '%T@ %p\n' | sort -n -k1 | head -n1 | awk '{print $2}' )"

   # if the work directory actually synced
   if test -n "${subdir}" ;
   then

      printf "%s " "DIRECTORY SIZE:"
      du -sxBM "${subdir:-.}"
      mkdir -p "$( dirname "${outdir}" )"
      # get current target of symlink
      current_target="$( find "${outdir}" -maxdepth 0 -type l -printf '%l\n' )"

      # if the current link is pointing to a different directory than this subdir
      if test "${current_target}" != "${subdir}" ;
      then
         # then replace it with a link to this one
         test -L "${outdir}" && unlink "${outdir}"
         echo ln -sf "${subdir}" "${outdir}"
         ln -sf "${subdir}" "${outdir}"
      fi

   else
      echo "ERROR: No subdir found, so cannot update the symlink."
   fi

   # disable the index.html with all the httrack comments and original site links
   find "${workdir}" -iname '*index.html' -exec rm {} +
} 2>&1 | tee -a "${logfile}"

And place this in cron!

#       50  12  *   *   *   root    /etc/installed/obsmirror.sh 1>/dev/null 2>&1

Explanation of script

So the logic is a little convoluted, because the OBS front page actually redirects downloads to various mirrors where the files are kept. So I needed to learn what the actual site is, and then pull down that whole site. I couldn't just use httrack --getfiles because it makes just a flat directory, which breaks the Packages contents' accuracy to the paths of the package files. But I didn't want the whole complex directory structure, just the repository structure. So I make a symlink to it in my actual web contents location.

Knowledge Base

Story

Solution

Explanation of script

Comments