Idea for system installation/maintenance

To: debian-devel@lists.debian.org
Subject: Idea for system installation/maintenance
From: "Ph. Marek" <philipp@marek.priv.at>
Date: Thu, 8 Mar 2007 14:58:40 +0100 (CET)
Message-id: <[🔎] 1305.193.171.152.33.1173362320.squirrel@webmail.marek.priv.at>
Hello debian developers,


I'd like to present you a new idea for installation/maintenance of complete
installations. Please take a bit of time, read this mail, and if you've got
any comments, questions or suchlike, don't hesitate to reply.
(This is also a plea for help, in case you might wonder.)


Now, you might come to this questions:
  Why a new installation system? Doesn't dpkg work?

Now, the short introduction above might be too short, in that it allows for
a much broader interpretation as desired. No, dpkg is not dead - it will be
needed as well. This is just an additional idea for helping administrators.


Here are some points where I think my proposal should apply:
- You have some independent groups, where every one maintains some part of a
  machine's installation, and where these groups should nonetheless get their
  data merged into a single installation?
- You'd like to test software updates on a test machine and, if successful,
  do a rollout to a few hundred nearly identical servers, which should
  nonetheless run *without* a central authority (1)?
- You changed a configuration file and now some strange things happen? Hmmm,
  what was the change again? Was it this setting? Or that?
- You're doing backups of user data only? So you've got to re-install the
  machine if the harddisk fails, which can take some time. Oh, you're backing
  up your full installation? How many last versions do you keep? Enough so
  that the spreadsheet-eating worm (2) that you've had for the last two
months
  doesn't bother you? How do you find out which data files might have been
  changed *without* you saying so? Do you compare all your backups regularly?


We all know that kind of problems -- and nearly all administrators don't do
something about them.


Imagine for a moment that the processes you are using for software
development could be used for system installation too. (3)
You'd have the whole system under version control; if a package breaks Xorg
or kdm, you'd simply "update" (to use the CVS or subversion terminology) to
an older version of that package - or just completely restore the old state
of this machine to the last version.


So, to scratch that itch, I started to wrote some patches for subversion
to store at least the modification time; later I posted patches for owner,
group, and mode, too.
In the subversion repository the branches still exist; you can take a look
at http://svn.collab.net/viewvc/svn/branches/meta-data-versioning/.

But the subversion client libraries were not really applicable for this
usage; the local storing of the unmodified files needed a lot of space,
the ".svn" directories with 4 files for each versioned item used too many
inodes, and subversion was simply too slow to show a status on 150000
files. And so on and so on. (4)


So I started a new frontend for subversion repositories: fsvs.
It uses a subversion backend, stores meta-data (5), accepts most file
types (6), and has some speed improvements against subversion (7).

The current status is:
- It has (with doxygen-documentation in work) 17000 lines (*.h, *.c); without
  comments, brace-only-lines ("{" and "}") and empty lines 6500 code-lines
  remain. So it's still an maintainable project.
- It has currently version 1.1.0; there are 40 subscribers on freshmeat,
  each release gets some hundred downloads, there are some other people more
  or less actively helping (mostly bug-reports).
- The easiest case is unidirectional usage -- doing only commits (backup) or
  updates (restore).
- It's always possible to "export" some part of an repository to an arbitrary
  directory - so the backup machine can always restore some parts, too.
- Doing updates/commits in a working copy works.
- The known problems are with multi-url updates (8), ie. overlaying
  multiple URLs into a single destination directory.

The wanted/needed features are:
- Extensions to a few standard commands ("diff" and "revert" for trees)
- Handling of user-defined properties, with a few special features (9)
- Improvements to the ignore patterns (syntax, and features)
- The subversion libraries could make use of some features already in the
  issue database (10)
- And the most important thing: full multi-url operations.


This last point is where I expect fsvs to be of *real* value.
I want fsvs to be able to merge a set of URLs locally (one for each package),
just like unionfs would do (11). Higher priority URLs would override the
lower priority ones; changes would get applied as difference, so a locally
changed configuration file would normally simply be updated.

But: there can be conflicts; handling these is likely to be a lot of work.
My first solution is to *not* merge, but simply keep the "highest" version,
with locally changed files being above all others.

The set of files *not* being in any of the source-URLs would be committable
to another repository URL, so that the "local" configuration could be stored
and re-used.


So, why am I writing this?
- First of all, it's always good to do a step backwards and take a look what
  has been done and what is left to do.
- Furthermore I am looking forward to hearing ideas, comments, critique and
  so on from people who wrote some large-scale installation system;
although I
  believe to have some experience, I'd like to hear about others' opinions.
- At last, if you think that this system could prove to be advantageous, I
  beg for your help. I invested about 700 hours in the last 21 months; I
  see the program grow, but I'd wish to make faster progress.



Summary:
Please tell me what you think about this; if you know it won't work, tell
me too, so that I can think about possibly forgotten aspects.
If you want to take a further look at fsvs, you can start at
http://fsvs.tigris.org/.
If you'd like to help (or know someone who could help), please don't
hesitate to contact me at philipp@marek.priv.at or us at dev@fsvs.tigris.org.


Thank you for your patience!


Regards,

Phil


~~~~~~~

Ad 1: If they'd run *with* some central server, you could simply let them
net-boot with DHCP and diversify based on the MAC address.
And yes, I know about FAI.

Ad 2: No, AFAIK there's no such worm for linux. But one of your users had
this share mounted, and he "accidentally" put the cable of his windows box
into the wrong plug ...

Ad 3: I know -- some administrators use CVS, subversion or svk for keeping
track of /etc; but all of these packages have some drawbacks.
I tried all three, but it doesn't really work. Of course, you can at least
keep the configuration data under control - but even the missing rights on
the files are a big problem. (Try an update to an older version!)

Ad 4: There's something being done about these points - but my project is
active since May 2005, and other performance problems are still present in
subversion.
And there's much ... well, let's say cruft in the subversion client
libraries, because it's designed to handle sources (with keyword expansion)
and not simple data blobs.

Ad 5: Modification time, mode, owner and group are currently stored; owner
and group as (name, id) pair, in case the id changes or no corresponding
passwd/group entry is found.

Ad 6: Currently files, symlinks, character and blockdevices get versioned.
I see no need in versioning sockets and pipes - they'll be re-created when
the server process starts.

Ad 7: fsvs does no translations of files (CR/LF, keyword substitution) - it
stores them as binary blob, so a changed size is seen as change.
If the mtime has changed, it is marked as "possibly changed" - depending on
the given operation (and parameters) optionally an MD5 check is done.
A file has additionally to its MD5-checksum checksums stored on a per-block
basis (manber-blocks), which means it knows if a file has changed with about
half diskusage (compared to subversion).
All working copy data is stored in a single file sorted by inode number, so
a status update stat()s the entries in disk-order, which makes it faster than
find (on cold caches).

Ad 8: This is the main new feature in 1.1.0.
I'm working on stabilizing the multi-url operations.

Ad 9: I'm planning to implement two special properties.
The first one would be named something like "fsvs:commit-pipe", and defines
a command, that, when given the filename as first parameter, pipes the data
to put into the repository to STDOUT, where fsvs would take it. So fsvs could
safely used for backups - just give something like "gpg -e -r backupkey",
and the *needed* things are encrypted (gpg-keys, ssh-keys, kwallet, etc.)
The second would complement the first, and be called about
"fsvs:install-cmd".
Please see for more details the doc/TODO file in the repository or
distributed binaries.

Ad 10: A very interesting point is to automatically "link" identical binaries
within a repository. As soon as the repository layer sees that a binary which
already exists in the repository gets committed in some other path (easy with
MD5-indexing), it simply stores a link to the "old" data.
So 500 clients each committing their full root-filesystem take more or less
the same space as one - because 499 copies are only pointers in the database.

Ad 11: A possible solution would be to update into a set of directories
locally, and using unionfs to "overlay" them. But a statistic I remember
shows that unionfs has access time penalties O(sqrt(overlays)), and a speed
penalty of 30 or more for 1000 installed packages is not ok.
Especially as the "upgrade" process via fsvs is maybe once a day or a month,
but the penalty would apply for *every* access.


-- 
Versioning your /etc, /home or even your whole installation?
             Try fsvs (fsvs.tigris.org)!
Reply to:
Prev by Date: Bug#413575: debian-policy: New virtual package: dictd-dictionary
Next by Date: Re: On maintainers not responding to bugs
Previous by thread: Bug#413985: ITP: elfio -- library for reading and generating ELF files
Next by thread: Bug#413998: ITP: giggle -- Gtk+ frontend for the git directory tracker
Index(es):
- Date
- Thread