Re: GIT for pdiff generation

To: Henrique de Moraes Holschuh <hmh@debian.org>
Cc: d-d <debian-devel@lists.debian.org>
Subject: Re: GIT for pdiff generation
From: Joerg Jaspert <joerg@debian.org>
Date: Sun, 27 Mar 2011 19:07:37 +0200
Message-id: <[🔎] 87lj00a3vq.fsf@gkar.ganneff.de>
Mail-followup-to: Henrique de Moraes Holschuh <hmh@debian.org>, d-d <debian-devel@lists.debian.org>
In-reply-to: <[🔎] 20110327132911.GA31834@khazad-dum.debian.net> (Henrique de Moraes Holschuh's message of "Sun, 27 Mar 2011 10:29:11 -0300")
References: <874o6prlfn.fsf@delenn.ganneff.de> <[🔎] 20110327132911.GA31834@khazad-dum.debian.net>

>>   As we are no git gurus ourself: Does anyone out there see any trouble
>>   doing it this way? It means storing something around 7GB of
>>   uncompressed text files in git, plus the daily changes happening to
>>   them, then diffing them in the way described above, however the
>>   archive will only need to go back for a couple of weeks and therefore
>>   we should be able to apply git gc --prune (coupled with whatever way
>>   to actually tell git that everything before $DATE can be removed) to
>>   keep the size down.
> AFAIK, there can be trouble.  It all depends on how you're structuring
> the data in git, and the size of the largest data object you will want
> to commit to the repository.

Right now the source contents of unstable has, unpacked, 220MB. (Packed
gzip its 28MB, while the binary contents per have each have 18MB
packed).

Lets add a safety margin: 350MB is a good guess for the largest.
A packages file nearly doesnt count compared to them, unpacked its just
some 34mb

> There is an alternative: git can rewrite the entire history
> (invalidating all commit IDs from the start pointing up to all the
> branch heads in the process).  You can use that facility to drop old
> commits.  Given the indented use, where you don't seem to need the
> commit ids to be constant across runs and you will rewrite the history
> of the entire repo at once and drop everything that was not rewritten,
> this is likely the less ugly way of doing what you want.  Refer to git
> filter-branch.

Its the one and only thing I ever seen where "history rewrite" is
actually something one wants to do.

> Other than that, git loads entire objects to memory to manipulate them,
> which AFAIK CAN cause problems in datasets with very large files (the
> problem is not usually the size of the repository, but rather the size
> of the largest object).  You probably want to test your use case with
> several worst-case files AND a large safety margin to ensure it won't
> break on us anytime soon, using something to track git memory usage.

Well, yes.

-- 
bye, Joerg
Some NM:
> FTBFS=Fails to Build from Start
Err, yes? How do you start in the middle?

Reply to:

Follow-Ups:
- Re: GIT for pdiff generation
  - From: Henrique de Moraes Holschuh <hmh@debian.org>

References:
- GIT for pdiff generation (was: Meeting Minutes, FTPMaster meeting March 2011)
  - From: Henrique de Moraes Holschuh <hmh@debian.org>

Prev by Date: Processed: Re: either bash or dash should be enough
Next by Date: Packaging Openstack for Debian: anyone else interested?
Previous by thread: GIT for pdiff generation (was: Meeting Minutes, FTPMaster meeting March 2011)
Next by thread: Re: GIT for pdiff generation
Index(es):
- Date
- Thread