Story
I run an OBS repository for all my packages, and it is available at the main
site: https://build.opensuse.org/project/show/home:bgstack15. But I wanted
to mirror this for myself, so I don't have to configure all my systems to
point outward to get updates. I already host a Devuan ceres mirror for
myself, and so mirroring this Open Build System repository is the
last step to be self-hosting entirely for all systems except the mirror
server. I first dabbled with debmirror, but it kept wanting to try rsync
despite my best configuration, plus it really insists on using the dists/
directory which isn't used in the OBS deb repo design. So, I researched
scraping down a whole site, and I found httrack
which exists to serve a local copy of an Internet site. Bingo! After a few
hours of work, here is my solution for mirroring an OBS deb repo locally.
Solution
Create a user who will own the files and execute the httrack command, because
httrack didn't want to be run as root. Also, this new user can't munge other
data.
useradd obsmirror
Configure a script (available at gitlab)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90 |
#!/bin/sh
# File: /etc/installed/obsmirror.sh
# License: CC-BY-SA 4.0
# Author: bgstack15
# Startdate: 2020-01-05 18:01
# Title: Script that scrapes down OBS site to serve a copy to intranet
# Purpose: save down my OBS site so I can serve it locally
# History:
# Usage:
# in a cron job: /etc/cron.d/mirror.cron
# 50 12 * * * root /etc/installed/obsmirror.sh 1>/dev/null 2>&1
# Reference:
# https://unix.stackexchange.com/questions/114044/how-to-make-wget-download-recursive-combining-accept-with-exclude-directorie?rq=1
# man 1 httrack
# https://software.opensuse.org//download.html?project=home%3Abgstack15&package=freefilesync
# Improve:
# use some text file as a list of recently-synced URLs, and if today's URL matches a recent one, then run the httrack with the --update flag. Probably keep a running list forever.
# Documentation:
# Download the release key and trust it.
# curl -s http://repo.example.com/mirror/obs/Release.key | apt-key add -
# Use a sources.list.d/ file with contents:
# deb https://repo.example.com/mirror/obs/ /
# Dependencies:
# binaries: curl httrack grep head tr sed awk chmod chown find rm ln
# user: obsmirror
logfile="/var/log/obsmirror/obsmirror.$( date "+%FT%H%M%S" ).log"
{
test "${DEBUG:-NONE}" = "FULL" && set -x
inurl="http://download.opensuse.org/repositories/home:/bgstack15/Debian_Unstable"
workdir=/tmp/obs-stage
outdir=/var/www/mirror/obs
thisuser=obsmirror
echo "logfile=${logfile}"
mkdir -p "${workdir}" ; chmod "0711" "${workdir}" ; chown "${thisuser}:$( id -Gn obsmirror )" "${workdir}"
cd "${workdir}"
# get page contents
step1="$( curl -s -L "${inurl}/all" )"
# get first listed package
step2="$( echo "${step1}" | grep --color=always -oE 'href="[a-zA-Z0-9_.+\-]+\.deb"' | head -n1 | grep -oE '".*"' | tr -d '"' )"
# get full url to a package
step3="$( curl -s -I "${inurl}/all/${step2}" | awk '/Location:/ {print $2}' )"
# get directory of the mirror to save down
step4="$( echo "${step3}" | sed -r -e "s/all\/${step2}//;" -e 's/\s*$//;' )"
# get domain of full url
domainname="$( echo "${step3}" | grep -oE '(ht|f)tps?:\/\/[^\/]+\/' | cut -d'/' -f3 )"
echo "TARGET URL: ${step4}"
test -z "${DRYRUN}" && {
# clean workdir of specific domain name in use right now.
echo su "${thisuser}" -c "rm -rf \"${workdir:-SOMETHING}/${domainname:-SOMETHING}\""
su "${thisuser}" -c "rm -rf \"${workdir:-SOMETHING}/${domainname:-SOMETHING}\"*"
# have to skip the orig.tar.gz files because they are large and slow down the sync process significantly.
echo su "${thisuser}" -c "httrack \"${step4}\" -*.orig.t* -v --mirror --update -s0 -r3 -%e0 \"${workdir}\""
time su "${thisuser}" -c "httrack ${step4} -*.orig.t* -v --mirror --update -s0 -r3 -%e0 ${workdir}"
}
# -s0 ignore robots.txt
# -r3 only go down 3 links
# -%e0 follow 0 links to external sites
# find most recent directory of that level
levelcount="$(( $( printf "%s" "${inurl}" | tr -dc '/' | wc -c ) - 1 ))"
subdir="$( find "${workdir}" -mindepth "${levelcount}" -maxdepth "${levelcount}" -type d -name 'Debian_Unstable' -printf '%T@ %p\n' | sort -n -k1 | head -n1 | awk '{print $2}' )"
# if the work directory actually synced
if test -n "${subdir}" ;
then
printf "%s " "DIRECTORY SIZE:"
du -sxBM "${subdir:-.}"
mkdir -p "$( dirname "${outdir}" )"
# get current target of symlink
current_target="$( find "${outdir}" -maxdepth 0 -type l -printf '%l\n' )"
# if the current link is pointing to a different directory than this subdir
if test "${current_target}" != "${subdir}" ;
then
# then replace it with a link to this one
test -L "${outdir}" && unlink "${outdir}"
echo ln -sf "${subdir}" "${outdir}"
ln -sf "${subdir}" "${outdir}"
fi
else
echo "ERROR: No subdir found, so cannot update the symlink."
fi
# disable the index.html with all the httrack comments and original site links
find "${workdir}" -iname '*index.html' -exec rm {} +
} 2>&1 | tee -a "${logfile}"
|
And place this in cron!
# 50 12 * * * root /etc/installed/obsmirror.sh 1>/dev/null 2>&1
Explanation of script
So the logic is a little convoluted, because the OBS front page actually
redirects downloads to various mirrors where the files are kept. So I needed
to learn what the actual site is, and then pull down that whole site. I
couldn't just use httrack --getfiles because it makes just a flat directory,
which breaks the Packages contents' accuracy to the paths of the package
files. But I didn't want the whole complex directory structure, just the
repository structure. So I make a symlink to it in my actual web contents
location.
Comments