GIT for pdiff generation (was: Meeting Minutes, FTPMaster meeting March 2011)

To: d-d <debian-devel@lists.debian.org>
Subject: GIT for pdiff generation (was: Meeting Minutes, FTPMaster meeting March 2011)
From: Henrique de Moraes Holschuh <hmh@debian.org>
Date: Sun, 27 Mar 2011 10:29:11 -0300
Message-id: <[🔎] 20110327132911.GA31834@khazad-dum.debian.net>
In-reply-to: <874o6prlfn.fsf@delenn.ganneff.de>
References: <874o6prlfn.fsf@delenn.ganneff.de>

On Sun, 27 Mar 2011, Joerg Jaspert wrote:
> - As there have been intermittent problems with the current tool which
>   generates the pdiff files (on occasion causing us to have to restart
>   the whole diff series), we looked into improving the situation. We
>   finally came up with the idea to store the affected files (Packages,
>   Sources, Contents) uncompressed in a local git repository and use gits
>   ability to output the needed ed scripts (which pdiffs are). The basic
>   idea would be to save the git commitid relating to the mirrorpushes we
>   did and then use that, combined with a call to "git difftool --extcmd
>   'diff --ed' --no-prompt" to output the ed scripts. This ought to be
>   more stable, and even better, we can replay whole series of pdiffs in
>   case there is some bug in them again.
> 
>   As we are no git gurus ourself: Does anyone out there see any trouble
>   doing it this way? It means storing something around 7GB of
>   uncompressed text files in git, plus the daily changes happening to
>   them, then diffing them in the way described above, however the
>   archive will only need to go back for a couple of weeks and therefore
>   we should be able to apply git gc --prune (coupled with whatever way
>   to actually tell git that everything before $DATE can be removed) to
>   keep the size down.

AFAIK, there can be trouble.  It all depends on how you're structuring
the data in git, and the size of the largest data object you will want
to commit to the repository.

If you want to shed data off a git repository [easily], you must not
have a parent-child relationship when you want to drop the parents.
There is a way around that, but it is too ugly and cubersome (store time
domain series as branches, not parent-child commits).

There is an alternative: git can rewrite the entire history
(invalidating all commit IDs from the start pointing up to all the
branch heads in the process).  You can use that facility to drop old
commits.  Given the indented use, where you don't seem to need the
commit ids to be constant across runs and you will rewrite the history
of the entire repo at once and drop everything that was not rewritten,
this is likely the less ugly way of doing what you want.  Refer to git
filter-branch.

Other than that, git loads entire objects to memory to manipulate them,
which AFAIK CAN cause problems in datasets with very large files (the
problem is not usually the size of the repository, but rather the size
of the largest object).  You probably want to test your use case with
several worst-case files AND a large safety margin to ensure it won't
break on us anytime soon, using something to track git memory usage.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

Reply to:

Follow-Ups:
- Re: GIT for pdiff generation
  - From: Joerg Jaspert <joerg@debian.org>

Prev by Date: Re: Meeting Minutes, FTPMaster meeting March 2011
Next by Date: Re: either bash or dash should be enough
Previous by thread: Re: Meeting Minutes, FTPMaster meeting March 2011
Next by thread: Re: GIT for pdiff generation
Index(es):
- Date
- Thread