[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Speeding up dpkg, a proposal



On Thu, 03 Mar 2011, Marius Vollmer wrote:
> ext Raphael Hertzog <hertzog@debian.org> writes:
> 
> > On Wed, 02 Mar 2011, Marius Vollmer wrote:
> >> - Instead, we move all packages that are to be unpacked into
> >>   half-installed / reinstreq before touching the first one, and put a
> >>   big sync() right before carefully writing /var/lib/dpkg/status.
> >
> > The big sync() doesn't work. It means dpkg never finishes its work on
> > systems with lots of unrelated I/O.
> 
> Ok, understood.  It's now clear to me that the big sync should be
> replaced with deferred fsyncs.  (I would defer the fsync of the content
> of all packages until modstatdb_checkpoint, not just until
> tar_deferred_extract.)

This is assuming you don't use --force-unsafe-io. Otherwise you don't sync
packages content at all.

> With that change, do you think the approach is sound?

It looks like it could work in principle. But it might have unexpected
complications in case of interruptions. You said it yourself:
"it leaves its database behind in a correct but quite outdated and not so
friendly state"

The "reinstreq" flag is usually present on a single package only, and we know
that this single package is (likely) broken. So we reinstall it and we can go ahead.

Now with your scheme, we have many packages in that state and we don't know which
ones are really broken. At least the one which was being processed at the time
of the interruption (as in power loss).

Are we sure there are no case where this brokenness leads to failures in preinst
of some of the other packages to be reinstalled? How is the package manager
supposed to order the reinstallations?

> To understand our troubles, you need to know that we have around 2500
> packages with just a single file in it.  For those packages, dpkg spends
> the largest part of its time in writing the nine journal entries to
> /var/lib/dpkg/updates.

nine? I haven't reviewed the code but that's quite a lot indeed. Maybe there's
room for optimization here.

A quick review indeed reveals this sequence (for an upgrade):
- half_installed + reinstreq
- unpacked + reinstreq
- half_installed + reinstreq
- unpacked + reinstreq
- unpacked
- unpacked (again at start of configure, don't know why)
- half_configured
- installed
- the final installation in the status file

Indeed, your scenario is very particular. Usually you have many files and thus
the fsync() of all the files is what takes the most time (compared to the 9
fsync() for the status information) and there --force-unsafe-io shows a sizable
improvement.

> We will reduce the number of our packages, so this issue might solve
> itself that way, but I had good success in reducing the per-package
> overhead of dpkg, and if it is correct and works for us, why not use the
> 'reckless' option as well?

I don't think we're interested in adding more options that make it even more
difficult to understand what dpkg does. Either there's a better way of doing it
and we use it all the time, or we keep it like it is.

For instance, I wonder if we could not get rid of two modstatdb_note in the above
list:
- the first "unpacked + reinstreq" could be directly brought back to
"half_installed + reinstreq" with minimal consequences (the only difference
comes when one of the conflictor/to-be-deconfigured package fails to be
deconfigured).
- the other one at the start of the configure process

Cheers,
-- 
Raphaël Hertzog ◈ Debian Developer

Follow my Debian News ▶ http://RaphaelHertzog.com (English)
                      ▶ http://RaphaelHertzog.fr (Français)


Reply to: