[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#908678: Testing the filter-branch scripts



> The Python job finished successfully here after 10 hours.
6h40 mins here as I ported your improved logic to the python2 version :).

# git filter-branch --tree-filter '/usr/bin/python2 /split-by-year.pyc' HEAD
Rewrite 1169d256b27eb7244273671582cc08ba88002819 (68356/68357) (24226 seconds passed, remaining 0 predicted)
Ref 'refs/heads/master' was rewritten

The tree-filter blows up the .git/objects store to 13G though.
But nothing a git gc can't fix.

> 
> I did some tests on the new git repository. Cloning the repository from
> scratch takes around 2 minutes (the original repo: 21 minutes).
Confirmed.

> So that's about it. I have not done a thorough job at checking the
> actual *integrity* of the results. It's difficult, considering CVE
> identifiers are not sequential in the data/CVE/list file, so a naive
> diff like this will fail:
> 
> $ diff -u <(cat ../security-tracker-full-test-filtered-bis/data/CVE/list.{2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000,1999} ) data/CVE/list | diffstat
>  list |106562 +++++++++++++++++++++++++++++++++----------------------------------
>  1 file changed, 53281 insertions(+), 53281 deletions(-)
> 
> But at least the numbers add up: it looks like no line is lost. And
> indeed, it looks like all CVEs add up:
> 
> $ diff -u <(cat ../security-tracker-full-test-filtered-bis/data/CVE/list.{2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000,1999} | grep ^CVE | sort -n ) <( grep ^CVE data/CVE/list | sort -n  ) | diffstat
>  0 files changed
> 
> A cursory look at the diff seems to indicate it is clean, however.

I uploaded "my" version to https://people.debian.org/~dlange/
so people can poke the log and diffs and see whether there are any
issues left.

> I looked at splitting that file per CVE. That did not scale and just
> created new problems. But splitting by *year* seems like a very
> efficient switch, and I think it would be worth pursuing that idea
> forward.

The tools in bin/ would need a brush through. I.e. throw away the
unused ones and amend the ones that are used on data/CVE/* to learn
about the split files.


Reply to: