[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: System-critical package management



Hello,

The lack of any system of recognition for packages that are critical to system operation impedes the reliability of Debian-based systems. For example, a reboot during a background package upgrade process on critical system packages unbeknownst to the user may result in the system unable to boot as expected, with little readily-available feedback to the user as to the cause.

Locking out reboots while the package manager is active is a policy that needs to be provided by the policy layer that allows ordinary users to reboot -- so this is the responsibility of the desktop environment.

The base system and package manager require superuser privileges for both reboot and invoking the package manager. For single-user systems, it is the responsibility of the administrator to not issue a reboot command while a package upgrade is in progress, which is not an onerous requirement because the package upgrade must be manually commanded as well.

Packages are often installed in environments where no control over reboots is possible and where system services usually found on desktops are unavailable, such as inside containers during preparation of container images.

There is no appropriate place to implement such a lockout at a low level. The kernel is informed of the intention to reboot only after system shutdown is complete, so this is the wrong place, and above that, users have a choice of different policy layers that fit their use case best, including "none".

But: because background updates on desktop systems are implemented as a system service that is run through a policy layer, it is possible to implement such a lockout on this layer.

Other operating systems like Windows and MacOS manage this by updating system-critical components separately from user-land during shutdown, while clearly giving user-feedback that critical updates are taking place, and that for example the system should not be turned off.

No, these systems make no distinction between system and user components. The reason upgrades are performed through a reboot is a historical shortcoming in the file system implementation: Unix separates the contents of a file from its file name, so if a file is open, its name can still be changed or removed, while the file contents are kept until no more names point to it *and* no more open file handles exist.

On Windows, open files cannot be renamed or deleted unless the program has specifically allowed this, which (for historical reasons) few programs do, so the upgrade process works by unpacking the new files to a temporary name, making a note to rename the files, then rebooting and performing the rename while no users are logged on and no services are running, and then subsequently starting the system.

This process is the same even for user programs, so if you update WinRAR while it is open (so the file cannot be updated), the installation process will ask for a reboot to complete the upgrade.

A potential middle-ground solution to this is to allow packages to be marked as "system-critical" to DPKG by external system components - for example a standard desktop Ubuntu system might mark the Gnome Display Manager, Networking drivers, and others in this way during installation.  These system-critical packages could then be protected by DPKG in the following ways:

	- They are automatically reverted to a known good state on upgrade failure (e.g. previous version)

Generally, packages are expected to go from one functional state to another in a very quick operation after verifying that the operation can be performed.

For example, grub installed into the MBR will check that all components are present, prepare the image to be written in memory, and only in the last step, write the first and second stage bootloaders in one go. Any failure at this stage would be "hardware error", which would also apply to the old version, and until that point, the old version would still work.

It is much more likely for a package to indicate success and subsequently fail on reboot because of a missing check, but this is not something the package manager can help with.

What already exists is automatic revert if a package fails to unpack because of an I/O error (or the disk being full).

	- They cannot be removed without being unmarked as "system-critical"

We have "Essential: yes", which dpkg protects, and "Protected: yes", which are protected by apt. The latter category is what bootloaders fall in (it also helps that the main author for apt is also a grub maintainer).

The dpkg program will allow you to remove the bootloader, because that is what allows changing bootloaders easily, the "Essential" set is basically just what is required for dpkg to function -- so dpkg cannot self-destruct.

	- The system could check during every shutdown that system-critical packages are in a consistent state, reverting to a known good state if not

Again, this would need to be inside the policy layer that defines "shutdown" -- there are many of those, and most of them are outside the Debian system (e.g. if you run Debian in a container under Kubernetes, then Kubernetes is the policy layer that would be responsible for that.

On desktop systems, systemd is the appropriate policy layer to decide about reboots, and (if I remember correctly) packagekit is the policy layer that invokes dpkg, so packagekit would need to inhibit reboots while it is working, and it can do so easily because it can assume systemd to be present and running.

I am interested in knowing the communities' thoughts on this, and if these ideas have any merit to them.

On the lower levels, what can be reasonably implemented already is. The lockout you describe belongs into the desktop system, but it would require new UI to be developed to be useful -- rejecting the reboot is easy, but indicating to the user why the reboot was rejected or disabling the option requires a new communication channel, and without that functionality, the user experience would be "I tried to reboot and it didn't do anything."

Breaking the layer separation would be a horrible complicated mess -- adding new low level errors means adding appropriate error handlers to all intermediate layers until the error can bubble up to the user. This is something component systems have historically struggled with -- every time Windows displays some "error code c0312313" type dialog, this is a missing handler chain.

   Simon


Reply to: