Bug#908678: Some more thoughts and some tests on the security-tracker git repo

To: Daniel Lange <DLange@debian.org>, 908678@bugs.debian.org, 908678@bugs.debian.org
Cc: Guido Günther <agx@sigxcpu.org>, Bastian Blank <waldi@debian.org>
Subject: Bug#908678: Some more thoughts and some tests on the security-tracker git repo
From: Antoine Beaupré <anarcat@debian.org>
Date: Fri, 09 Nov 2018 18:05:55 -0500
Message-id: <[🔎] 87sh09j1gs.fsf@curie.anarc.at>
Reply-to: Antoine Beaupré <anarcat@debian.org>, 908678@bugs.debian.org
In-reply-to: <[🔎] 87y3a2hshp.fsf@curie.anarc.at>
References: <20180912131055.srfkzw5nefmcoqnh@shell.thinkmo.de> <48a81bd6-3e71-50b1-0c59-c94fbbd58a94@debian.org> <[🔎] 87y3a2hshp.fsf@curie.anarc.at> <20180912131055.srfkzw5nefmcoqnh@shell.thinkmo.de>

On 2018-11-09 16:05:06, Antoine Beaupré wrote:
>  2. do a crazy filter-branch to send commits to the right
>     files. considering how long an initial clone takes, i can't even
>     begin to imagine how long *that* would take. but it would be the
>     most accurate simulation.
>
> Short of that, I think it's somewhat dishonest to compare a clean
> repository with split files against a repository with history over 14
> years and thousands of commits. Intuitively, I think you're right and
> that "sharding" the data in yearly packets would help a lot git's
> performance. But we won't know until we simulate it, and if hit that
> problem again 5 years from now, all that work will have been for
> nothing. (Although it *would* give us 5 years...)

So I've done that craaaazy filter-branch, on a shallow clone (1000
commits). The original clone is about 30MB, but the split repo is only
4MB.

Cloning the original repo takes a solid 30+ seconds:

[1221]anarcat@curie:src130$ time git clone file://$PWD/security-tracker-1000.orig security-tracker-1000.orig-test
Clonage dans 'security-tracker-1000.orig-test'...
remote: Énumération des objets: 5291, fait.
remote: Décompte des objets: 100% (5291/5291), fait.
remote: Compression des objets: 100% (1264/1264), fait.
remote: Total 5291 (delta 3157), réutilisés 5291 (delta 3157)
Réception d'objets: 100% (5291/5291), 8.80 MiB | 19.47 MiB/s, fait.
Résolution des deltas: 100% (3157/3157), fait.
64.35user 0.44system 0:34.32elapsed 188%CPU (0avgtext+0avgdata 200056maxresident)k
0inputs+58968outputs (0major+48449minor)pagefaults 0swaps

Cloning the split repo takes less than a second:

[1223]anarcat@curie:src$ time git clone file://$PWD/security-tracker-1000-filtered security-tracker-1000-filtered-test
Clonage dans 'security-tracker-1000-filtered-test'...
remote: Énumération des objets: 2214, fait.
remote: Décompte des objets: 100% (2214/2214), fait.
remote: Compression des objets: 100% (1190/1190), fait.
remote: Total 2214 (delta 936), réutilisés 2214 (delta 936)
Réception d'objets: 100% (2214/2214), 1.25 MiB | 22.78 MiB/s, fait.
Résolution des deltas: 100% (936/936), fait.
0.25user 0.04system 0:00.38elapsed 79%CPU (0avgtext+0avgdata 8200maxresident)k
0inputs+8664outputs (0major+3678minor)pagefaults 0swaps

So this is clearly a win, and I think it would be possible to rewrite
the history using the filter-branch command. Commit IDs would change,
but we would keep all commits and so annotate and all that good stuff
would still work.

The split-by-year bash script was too slow for my purposes: it was
taking a solid 15 seconds for each run, which meant it would have taken
9 *days* to process the entire repository.

So I tried to see if this could be optimized, so we could split the file
while keeping history without having to shutdown the whole system for
days. I first rewrote it in Python, which processed the 1000 commits in
801 seconds. This gives an estimate of 15 hours for the 68278 commits I
had locally. Concerned about the Python startup time, I then tried
golang, which processed the tree in 262 seconds, giving final estimate
of 4.8 hours.

Attached are both implementations, for those who want to reproduce my
results. Note that they differ from the original implementation in that
they have to (naturally) remove the data/CVE/list file itself otherwise
it's kept in history.

Here's how to call it:

git -c commit.gpgSign=false filter-branch --tree-filter '/home/anarcat/src/security-tracker/bin/split-by-year.py data/CVE/list' HEAD

Also observe how all gpg commit signatures are (obviously) lost. I have
explicitely disabled that because those actually take a long time to
compute...

I haven't tested if a graft would improve performance, but I suspect it
would not, given the sheer size of the repository that would effectively
need to be carried over anyways.

A.

-- 
Man really attains the state of complete humanity when he produces,
without being forced by physical need to sell himself as a commodity.
                        - Ernesto "Che" Guevara

package main

import (
	"bufio"
	"bytes"
	"io"
	"log"
	"os"
	"strconv"
	"strings"
)

func main() {
	file, err := os.Open("data/CVE/list")
	if err != nil {
		log.Fatal(err)
	}
	defer file.Close()

	var (
		line     []byte
		cve      []byte
		year     uint64
		year_str string
		target   *os.File
		header   bool
	)
	fds := make(map[uint64]*os.File, 20)
	scanner := bufio.NewReader(file)
	for {
		line, err = scanner.ReadBytes('\n')

		if bytes.HasPrefix(line, []byte("CVE-")) {

			cve = line
			year_str = strings.Split(string(line), "-")[1]
			year, _ = strconv.ParseUint(year_str, 0, 0)
			header = true
		} else {
			if target, ok := fds[year]; !ok {
				target, err = os.OpenFile("data/CVE/list."+year_str, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
				if err != nil {
					log.Fatal(err)
				}
				fds[year] = target
			}
			if header {
				target.Write(cve)
				header = false
			}
			target.Write(line)
		}
		if err != nil {
			break
		}
	}
	if err != io.EOF {
		log.Fatal(err)
	}
	os.Remove("data/CVE/list")
}

#!/usr/bin/python3

import os

data = 'data/CVE/list'

fds = {}

with open(data) as source:
    for line in source:
        if line.startswith('CVE-'):
            cve = line
            year = int(line.split('-')[1])
        else:
            yearly = 'data/CVE/list.{:d}'.format(year)
            target = fds.get(year, None)
            if target is None:
                fds[year] = target = open(yearly, 'a')
            if cve:
                target.write(cve)
                cve = None
            target.write(line)

for year, fd in fds.items():
    fd.close()
os.unlink(data)

Reply to:

Follow-Ups:
- Bug#908678: Testing the filter-branch scripts
  - From: Daniel Lange <DLange@debian.org>

References:
- Bug#908678: Some more thoughts and some tests on the security-tracker git repo
  - From: Antoine Beaupré <anarcat@debian.org>

Prev by Date: Bug#908678: Some more thoughts and some tests on the security-tracker git repo
Next by Date: Bug#908678: Testing the filter-branch scripts
Previous by thread: Bug#908678: Some more thoughts and some tests on the security-tracker git repo
Next by thread: Bug#908678: Testing the filter-branch scripts
Index(es):
- Date
- Thread