[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Speed improvements



Hello,

I would like to begin some discussion about dpkg.

I have written some patches improving it, that it runs much faster (on
my computer, new dpkg -s is over 700 times faster, it's not a mistake
:)).  But the changes make dpkg database backwards incompatible with
current version, so I'll try to justify my decisions below.  Please tell
me what do you think about it.  Oh, forgive me my awful English. :)

I have the feeling that dpkg hasn't been designed to hold as many
packages as there are in Debian today -- it keeps everything in large,
non-indexed text files.  It also lacks some useful features, eg. ability
to make sophisticated queries to package database.  And even simple ones
take very much time (tens of seconds on slower machines).

In this case a complete rewrite probably would be the best solution.
But, as I found in Debian archives, there were people who wanted make
dpkgv2 since 1999, and I there have been no results :)  So I decided to
find solutions to things that are the most annoying for me:

- There should be no such thing as /var/lib/dpkg/available -- I think
  that dpkg shouldn't know anything about not installed packages.
  Higher-level package management tools, as apt, already have that
  information and it is enough.  Moreover, parsing of available file
  unnecessarily takes a lot of time in most dpkg operations.  Dselect
  also may use apt database instead of available file.

- Parsing of /var/lib/dpkg/status also takes a lot of time, so there
  should be some better way of storing that information.  As putting it
  into binary database might be controversial, I thought that splitting
  that file would be the best solution -- every installed package should
  have its own *.status file in /var/lib/dpkg/info directory.  This
  makes recreating of original status file (for backward compatibility)
  very simple: just

# cat /var/lib/dpkg/info/*.status >/var/lib/dpkg/status

- As dpkg -S is very slow (I know there is dlocate, but it is only a
  workaround, not a real solution) there should be some binary database
  that holds information to which package every file belongs.  It would
  be created from *.list files, so primary information would be still in
  good old text files.

- There should be ability to make more complicated queries.  I like
  grep-dctrl very much.  I also think that many features may be taken
  from rpm...

- As /var/lib/dpkg/info contains a lot of files (and if we add status
  files there it will contain even more), maybe there should be
  _possibility_ (but not _need_) to use one large indexed file that
  would hold its content.  When I archived my info dir using ar, the
  output file took only half of the space that was occupied by
  directory.  Maybe there should be also optional possibility of
  in-flight (de)compression of these files.  There are servers, where
  space on filesystem is limited (eg. systems on flash), so that would
  gave Debian more flexibility.

So here is what I have done until now.  In short, I implemented features
mentioned in first two points from the list above.  Current version of
patch may be obtained here:

  http://nh.pl/~michau/proj/dpkg/

It modifies dpkg and dselect, that they do the following:

- dselect reads /var/lib/apt/lists/*Packages instead of
  /var/lib/dpkg/available
- dpkg doesn't read or write /var/lib/dpkg/available anymore,
- dpkg reads and writes /var/lib/dpkg/info/(package name).status files
  instead of /var/lib/dpkg/status.
    - Records about purged, not installed packages are considered
      not informative and aren't saved
    - New field, maybedirty, has been added to pkginfo structure --
      only if it is set to 1 record is considered to be dumped to
      appropriate status file (so when we install some package,
      only one status file is written, not all).
    - Query code has been modified that it doesn't read whole
      package database anymore if it doesn't have to.
    - As there is no available database, dpkg -p runs apt-cache show,
      which should provide the same information.

In result dpkg is much faster -- simple execution time comparisons
(available at http://nh.pl/~michau/proj/dpkg/time.txt) show that patched
dpkg may be about 6 time faster than original (while installing and
uninstalling packages).  In case of some queries, such as dpkg -s and
dpkg -L patched version of dpkg is over 700 faster.

What do you think about it?  Is it the way dpkg should go?  Yes, my code
needs several improvements, I want to implement rest of the features I
mentioned above, but I'd like to hear your opinions first...

Cheers

-- 
michau@
 Oh no I've set too much / I haven't set enough
 I thought that I straced you sleeping / I thought that I straced you run
 I think I thought I saw core dumped [ R.A.M., "Loosing my revision" ]



Reply to: