Populating my new cgit instance

bgstack15

2021-04-25 09:15

Overview

Gitlab is the curent source of truth for all repos, except for the secure data stored locally at /mnt/public/packages. To prepare my on-prem git repos, I need to be be able to synchronize my repositories from the sources of truth. This process sets up the bare repos on the main git and web server, and sets up the sync-location git repos on any VM.

Preparing list assets

We need the csv lists that show the old and new locations.

Generating gitlab token

Visit gitlab.com in web browser, sign in, and visit "Edit profile" in user icon menu. Visit "Access Tokens" in left-side menu, and create a personal access token that is read-only. Save the token, which will resemble string MnrEnTVfA-7kujMarjsG, to file /mnt/public/work/gitlab/gitlab.com.personal_access_token.

Generating list of all relevant projects

Use generate-list.sh to pull the list of all my personal projects from gitlab. It is not complete, because even Your projects on gitlab shows 70 projects total, not the 60 the API returns. So after running that file, manually add any additional entries that should be synced.

./generate-list.sh > list.csv
# manually add any additional repos to list.csv.

To add the destination links, run add-dest.sh with redirection in and out.

< list.csv ./add-dest.sh > repos.csv

Preparing main git destinations

With all the lists created, now prepare the final destinations of the repositories.

Preparing blank repositories on git server

You need to make the blank repositories ahead of time due to how git-over-http works: It only accepts pushes for extant projects. For this task, run make- blank.sh.

time < repos.csv sh -x ./make-blank.sh

On main web server, fix the permissions of these new git repos.

sudo chgrp apache -R /mnt/public/www/git ; sudo chmod g+rwX -R /mnt/public/www/git ;

Setting up sync locations and synchronizing

Now that the destinations are prepared, use a temporary (or at least alternate) location to pull and then push the repositories.

Initialize sync-location git repos

VM d2-03a is the main implementation of the sync-location. This will be the system that performs the main work of pulling all git repos and contents down, and pushing them up to the server.

time OUTDIR=~/dev/sync-git INFILE=/mnt/public/Support/Programs/cgit/populate/repos.csv /mnt/public/Support/Programs/cgit/populate/populate-git-remotes.sh

The above command will need APPLY=1 as a variable when ready for real execution.

Perform synchronization of all git repos

This is the main operation of this whole process. It could take some time to execute.

INDIR=~/dev/sync-git /mnt/public/Support/Programs/cgit/populate/sync-all.sh

Build initial permissions list

To restrict push access to all the new repos, run this command, and save its output inside /etc/git_access.conf.

find /var/www/git -mindepth 1 -maxdepth 1 -type d -printf '%f\n' | awk '{print "Use Project "$0" \"user bgstack15\" \"all granted\""}' | sort

And of course an httpd -t and then reload.

Appendix A: File listings

generate-list.sh

#!/bin/sh
# Startdate: 2021-04-16 20:16
# Goal: generate csv to stdout of directory,origin,dest
# STEP 1
# Notes:
#    the gitlab api doesn't show the "contributed" projects in the API at all, so the output from this is incomplete. The output file will need to be curated with additional entries for projects not from my userspace.
# Dependencies:
#    Gitlab token at /mnt/public/work/gitlab/gitlab.com.personal_access_token
#    jq
# Reference: 
#    https://stackoverflow.com/questions/57242240/jq-object-cannot-be-csv-formatted-only-array
#    https://stackoverflow.com/questions/32960857/how-to-convert-arbitrary-simple-json-to-csv-using-jq/32965227#32965227

test -z "${TOKEN_FILE}" && TOKEN_FILE="/mnt/public/work/gitlab/gitlab.com.personal_access_token"

test ! -r "${TOKEN_FILE}" && { echo "Fatal! Cannot find token file ${TOKEN_FILE}. Aborted." 1>&2 ; exit 1 ; }
TOKEN="$( cat "${TOKEN_FILE}" )"
echo "${TOKEN}" | grep -qE "token:" && { TOKEN="$( echo "${TOKEN}" | awk '/^token:/{print $NF}' )" ; }

GUSER=bgstack15

# Functions
handle_pagination() {
   # call: handle_pagination "https://gitlab.com/api/v4/users/${GUSER}/projects"
   # return: stdout: a single json list of of all returned objects from all pages
   # GLOBALS USED: TOKEN
   ___hp_url="${1}"
   ___hp_next_link="dummy value"
   ___hp_json=""
   ___hp_MAX=30 # safety valve
   x=0
   ___hp_thisurl="${___hp_url}"
   while test -n "${___hp_next_link}" && test ${x} -lt ${___hp_MAX} ;
   do
      x=$((x+1))
      raw="$( curl --include --header "PRIVATE-TOKEN: ${TOKEN}" "${___hp_thisurl}" )"
      set +x ; links="$( echo "${raw}" | awk '/^link:/' )" ; set -x
      ___hp_next_link="$( echo "${links}" | tr ',' '\n' | sed -n -r -e '/rel="next"/{s/.*<!--/;s/-->;.*$//;p;}' )" # will be blank if next_link is not valid; that is, if this is the last page.
      ___hp_thisurl="${___hp_next_link}"
      set +x ; ___hp_json="${___hp_json}$( echo "${raw}" | awk '/^\[/' )" ; set -x
   done
   # combine all json lists into one
   # ref: https://stackoverflow.com/a/34477713/3569534
   echo "${___hp_json}" | jq --compact-output --null-input 'reduce inputs as $in (null; . + $in)'
}

# MAIN
json="$( handle_pagination "https://gitlab.com/api/v4/users/${GUSER}/projects" )"
echo "${json}" | jq '.[] | [{ web_url, path }]' | jq -r '(map(keys) | add | unique) as $cols | map(. as $row|$cols|map($row[.])) as $rows | $cols, $rows[] | @csv' | awk 'NR == 1 {print} NR >1 && !/"web_url"/{print}'

list.csv

Just a snippet:

"ansible01","https://gitlab.com/bgstack15/ansible01"
"ansible-ssh-tunnel-for-proxy","https://gitlab.com/bgstack15/ansible-ssh-tunnel-for-proxy"
el7-gnupg2-debmirror/gnupg2,https://gitlab.com/el7-gnupg2-debmirror/gnupg2
el7-gnupg2-debmirror/libksba,https://gitlab.com/el7-gnupg2-debmirror/libksba
el7-gnupg2-debmirror/libassuan,https://gitlab.com/el7-gnupg2-debmirror/libassuan

add-dest.sh

#!/bin/sh
# Startdate: 2021-04-17
# STEP 2
# Goal: fix column names, and also parse to add dest column.
test -z "${GIT_URL_BASE}" && GIT_URL_BASE="https://www.example.com/git"
{
   # fix column name, and then add the dest link
   sed -r -e '1s/web_url/origin/;' | tr -d '"' | \
   awk -v "topurl=${GIT_URL_BASE}" -F',' 'BEGIN{OFS=","} NR==1 {print $0",dest"} NR>1 {$NF=$NF","topurl"/"$1;print}'
}

repos.csv

The final list asset. Just a snippet:

ansible01,https://gitlab.com/bgstack15/ansible01,https://www.example.com/git/ansible01
ansible-ssh-tunnel-for-proxy,https://gitlab.com/bgstack15/ansible-ssh-tunnel-for-proxy,https://www.example.com/git/ansible-ssh-tunnel-for-proxy
el7-gnupg2-debmirror/gnupg2,https://gitlab.com/el7-gnupg2-debmirror/gnupg2,https://www.example.com/git/el7-gnupg2-debmirror/gnupg2
el7-gnupg2-debmirror/libksba,https://gitlab.com/el7-gnupg2-debmirror/libksba,https://www.example.com/git/el7-gnupg2-debmirror/libksba
el7-gnupg2-debmirror/libassuan,https://gitlab.com/el7-gnupg2-debmirror/libassuan,https://www.example.com/git/el7-gnupg2-debmirror/libassuan

make-blank.sh

#!/bin/sh
# Startdate: 2021-04-17 16:13
# Goal: given the repos.csv output from STEP 2 add-dest.sh script, make the blank git repos for each of those on the final destination server
# STEP 3

test -z "${GIT_URL_BASE}" && GIT_URL_BASE="https://www.example.com/git"
test -z "${GIT_TOP_DIR}" && GIT_TOP_DIR="/mnt/public/www/git"

cd "${GIT_TOP_DIR}"
# this awk will read stdin, and skip the first line which is the headers for the columns
for word in $( awk -F',' -v "topurl=${GIT_URL_BASE%%/}/" 'NR>1 {gsub(topurl,"",$3);print $3}' ) ;
do
   # if OVERWRITE and the dir already exists, then delete it
   test -d "${word}" && test -n "${OVERWRITE}" && rm -r "${word}"

   # If inside a namespace, then perform a few extra steps.
   echo "${word}" | grep -qE "\/" && {
      # make any subdirs between here and there
      mkdir -p "${word}"
   # DISABLED; can just use section-from-path=1 in main cgitrc
   ## if in a subdir, add a cgitrc file for this repo that indicates its section.
   #   section="$( echo "${word}" | awk -F'/' 'BEGIN{OFS="/"} {$NF="";print}' )"
   #   if ! grep -qE "section=.+" "${word}/cgitrc" 2>/dev/null ;
   #   then
   #      echo "section=${section%%/}" >> "${word}/cgitrc"
   #   fi
   }

   # actually make the blank git repo
   git init --bare "${word}" &

done
wait

sync-all.sh

#!/bin/sh
# STEP 5 repeating
# Startdate: 2020-05-21
# Goal: download every single git repository in full from bitbucket cloud for migration to bitbucket on-prem.
# History:
#    2021-04-17 forked from gituser.tgz to Support/Programs/cgit/populate project
# Usage:
#    INDIR=~/dev/sync-git 
# References:
#    git-sync-all.ps1
# Dependencies:

SYNCSCRIPT=/mnt/public/Support/Programs/cgit/populate/sync.sh

if test -z "${INDIR}" || ! test -r "${INDIR}" ;
then
   echo "Fatal! Invalid INDIR ${INDIR} which is either absent or unreadable. Aborted" 1>&2
   exit 1
fi

cd "${INDIR}"
x=0
for dir in $( find . -maxdepth 2 -mindepth 1 -type d -name '.git' -printf '%h\n' ) ;
do
   x=$(( x=x+1 ))
   lecho "Starting repo $x ${dir}"
   cd "${INDIR}/${dir}"
   "${SYNCSCRIPT}"
done

sync.sh

#!/bin/sh
# STEP 5 or manual
# Startdate: 2020-05-21
# Goal: sync just this one directory git repo.
# History:
#    2021-04-17 forked from gituser.tgz to Supoprt/Programs/cgit/populate project
# References:
#    How to actually pull all branches from the remote https://stackoverflow.com/questions/67699/how-to-clone-all-remote-branches-in-git/16563327#16563327
# Dependencies:
#    $PWD is the git repo in question.
git pull --all
{
   git branch -a | sed -n "/\/HEAD /d; /remotes\/origin/p;" | xargs -L1 git checkout -t
} 2>&1 | grep -vE 'fatal:.* already exists'
git push dest --all

Knowledge Base