[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: archive rebuilds wrt Lucas' victory



On Tue, Apr 16, 2013 at 11:29:43AM +0200, Goswin von Brederlow wrote:
> On Mon, Apr 15, 2013 at 12:30:43AM +0200, Adam Borowski wrote:
> > Too bad, I see what seems to be most of the time being spent in dpkg
> > installing dependencies -- how could this be avoided?  One of ideas would
> > be to reformat as btrfs (+eatmydata) and find some kind of a tree of
> > packages with similar build-depends, snapshotting nodes of the tree to
> > quickly reset to a wanted state
> 
> I think snapshoting is a good idea there. Or rather forking the
> filesystem. Say you have 2 packages:
> 
> Package: A
> Build-Depends: X, Y
> 
> Package: B
> Build-Depends: X, Z
> 
> You would start with the bare build chroot and install X. Then you
> create snapshots SA and SB from that. In SA you install Y and in SB
> you install Z. Now both packages can be built.

So you would include intermediate states as nodes in the graph as well? 
Interesting -- this could indeed optimize cases like that, at the cost of
making the problem a good deal harder algorithmically.

> BUT:
> 
> - Easy with 2 packages. But how do you do that with 30000?

You mean, an algorithmical challenge?  In our kind of crowd?  That's the fun
stuff!

If we reduce the problem by two simplifications:

* can snapshot only before building a package (no intermediate states)
* the cost of purging a package is same as installing it

a solution is to find a minimal spanning tree, possibly with a
constraint on the tree's height.  And with the graph not being malicious,
I have a hunch the tree would behave nicely, not requiring too many
snapshots (random graphs tend to produce short trees).

The full problem may take a bit more thinking.

> - Y and Z may both depend on W. So initialy we should have
>   installed X and W.
> - Package C may Build-Conflicts: X but depend on most of the stuff X
>   depends on. So taking the filesystem with X installed and purging X
>   will be faster than starting from scratch.

Ie, edges that purge should have a lesser cost than edges that install.

> - Doing multiple apt/dpkg runs is more expensive than a combined one.
>   A single run will save startup time and triggers.

Again, parameters to the edge cost function.

> - Could we install packages without running triggers and only trigger
>   them at the end of each chain? Or somewhere in the middle?

Could be worth looking at.  Not sure how many triggers can work
incrementally, and how many rebuild everything every time like man-db.
Of course, this is moot if we snapshot only before package build.

> - There will be multiple ways to build the tree. We might install U
>   first and then V or V first and then U. Also we might have to install
>   V in multiple branches and V can not be installed in a commong root.
>   Unless we install V in a common root and then uninstall V again for a
>   subtree. This probably needs a heuristic for how long installing (or
>   uninstalling) a package takes. Package size will be a major factor
>   but postinst scripts can take a long time to run (update-texmf anyone?).

What about something akin to: log(size) + size/X ?  For smallish packages,
dpkg churn is the dominant factor, for big ones it's actual I/O and
registering individual files.

> - Build chroots, even as snapshots, take space. You can only have so
>   many of them in parallel. A depth first traversal would be best there.

I guess the tree won't be tall enough for disk space to be a concern.

>   Building packages against locally build packages (instead of the
>   existing official ones) gives a better test. But that would require a
>   more width first ordering. Some compromise between the two would be
>   needed.

Bootstrapping is a whole new can of worms, and I'm afraid I haven't followed
Wookey and co closely enough.

> With multiple cores it is better run multiple builds in parallel,
> given enough ram, than trying to build a single package on multiple
> cores. Most packages don't support parallel building (even if they
> could) and some break if you force the issue.

And others spend a good deal of the time on a single core, even if they
allow concurrency in a part.  Watching libreoffice, it was at 25%
utilization more time than at 100%.

"Given enough ram" is another problem: a vast majority of packages need
quite little, yet some can be pretty hungry.  I don't think we have
per-package data about memory needed, do we?  In my case, libreoffice
-j4 OOMed near the end.

The box has only 2GB, non-extensible.  Someone had a bright idea of shipping
a kernel with CONFIG_SWAP unset; the source for all needed out-of-tree
drivers is there, but I need to figure out why that uboot thingy doesn't
work, without being able to view the screen.

> Now if you have multiple builds that are based around the same snapshot
> then common header files and libraries will be cached only once.  So cache
> locality (and therefor efficiency) should increase.

Bad news: both btrfs and lvm (and I'm 99% sure zfs) fail to tell the VM
system that a page is shared on the disk and thus can be shared in the
memory as well.  This means, snapshots and even cp --reflink won't utilize
the similarity.  This limitation seems quite fundamental on lvm; in btrfs
thanks to rampant layering violations it's fixable but hasn't been fixed
yet.  Somewhere among vserver patches there's IUNLINK -- a file attribute
that duplicates hardlinked files when you try to open them for writing,
that's the only cow I know that lets the page cache work.

One could run builds with exactly the same Build-Depends together, but as
they would need to run on the same mount, this doesn't seem to be worth the
hassle:
grep ^Build-Depends […]debian_dists_unstable_main_source_Sources|sort|uniq|wc -l
15564
grep ^Package: […]debian_dists_unstable_main_source_Sources|sort|uniq|wc -l
18555


-- 
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ


Reply to: