[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

dpkg removes conffiles when purging a package that no longer owns them



tags 10879 patch pending
thanks

There's quite a few people on this list, so I'll start out by saying
that if you don't care about this bug anymore, or are just glad I think
I've fixed it, then you don't really need to read this mail.  There's a
patch at the bottom, the rest is going to be pretty technical :-)


First a summary of this bug, picking on bind as it's a handy package
that can be used to reliably produce the symptoms.

 1. Install bind
 2. Install bind9, this will remove bind.
 3. Purge bind

You'll now find that a random selection of bind9's configuration files
have been removed.


The clue to solving this bug occurred when we realised that the
selection wasn't random at all, and in fact was the same on every
machine.

Here's the important bit of debug output concerning the purging of bind:

D000200: removal_bulk set to new conffile `/etc/init.d/bind'
D000020: removal_bulk conffile not ours any more `/etc/bind/named.conf'
D000020: removal_bulk conffile not ours any more `/etc/bind/named.conf.local'
D000020: removal_bulk conffile not ours any more `/etc/bind/named.conf.options'
D000020: removal_bulk conffile not ours any more `/etc/bind/db.0'
D000020: removal_bulk conffile not ours any more `/etc/bind/db.127'
D000020: removal_bulk conffile not ours any more `/etc/bind/db.255'
D000020: removal_bulk conffile not ours any more `/etc/bind/db.local'
D000020: removal_bulk conffile not ours any more `/etc/bind/db.root'

dpkg is correctly realising that the only conffile it needs to remove is
‘/etc/init.d/bind’ and that the rest should be kept because bind9 has
taken them over.

However it's removing some of them, and fortunately the ones it removes
is constant.  It removes ‘named.conf.local’, ‘db.0’, ‘db.255’ and
‘db.root’.

If we number the configuration files that it shouldn't remove, starting
at 1, it's removing #2, #4, #6 and #8.


That's a pretty damned specific pattern!  It also explains why the
single-conffile test package pairs don't exhibit the bug, if they had
two conffiles the second would get wrongly deleted.

A more extreme example (a test package I created):

 #1 should be kept
 #2 needs deleting
 #3 should be kept
 #4 should be kept
 #5 needs deleting

#4 gets incorrectly deleted by dpkg.

To put the problem into words; every second conffile in a consecutive
subset that need to be kept get incorrectly deleted.


Old C hands should already be thinking ‘linked list bug’ at this point.


We now cast our attention on the removal_bulk_remove_configfiles()
function in main/remove.c.  What it does can be summed up as follows:

 1. Iterate the conffiles list, removing nodes if the conffile isn't
    ours anymore or is part of a diversion.

 2. Iterate the conffiles list again (which will have the nodes removed)
    and do the actual work.


Reading the debug output, during the second iteration there are nodes
that were supposed to have been removed that weren't.  We're really
thinking ‘linked list bug’ at this point, almost certainly a classic
‘bad node removal’ problem.

So we focus our attention on the first loop, and walk through it.  I'm
going to assume you've got the code in front of you at this point, feel
free to grab it if you're interested.

Our players in this are:

pkg->installed.conffiles
	A pointer to the first node in a linked list of conffile
	structures.  Each structure has a ‘next’ pointer which either
	points to the next one in the list, or NULL.

lconffp
	At the start of the loop, this points to the linked list pointer
	itself (pkg->installed.conffiles), otherwise it points to the
	‘next’ member of the previous node.

conff
	A pointer to a node.  This is set at the start of each iteration
	to the contents of whatever lconffp points to, so the first node
	for the first iteration, and each subsequent node.


Let's go back to using bind/bind9 as our example pair, because that's
quite an elegant example.

The first conffile needs to be deleted, so nothing interesting happens
until the loop's step statement:

	lconffp= &conff->next;

This alters lconffp to point to the current node's ‘next’ pointer, and
when the list begins, conff will point at the next node.


The second conffile needs to be kept, so the node needs to be removed
from the linked list.  Here's the code that does that:

	*lconffp= conff->next;

This alters either the linked list pointer, or the previous node's
‘next’ pointer (which is pointed to by lconffp) to point to the next
node, not the current one.  In other words, it removes the current node
from the linked list.

Now we get to the step instruction:

	lconffp= &conff->next;

This alters lconffp to point to the current node's ‘next‘ pointer. 
Hopefully your brain just threw an exception there.  That's right,
lconffp is now pointing to *the wrong thing*; if the next node needs to
be removed as well, the ‘next’ pointer of a node that's been removed
from the linked list will be changed.

We start off with:

	[  |--->[  |--->[  |x]
	   ^     ^
	[**]	[--]

([**] is lconffp, [--] is conff)

We alter the pointer to remove the node:

	[  |----------->[  |x]
	   ^    [  |--->
	   |     ^
	[**]    [--]

Then we end up pointing at the wrong next node:

	[  |----------->[  |x]
	        [  |---> ^
	           ^     |
	        [**]    [--]

If we need to remove the next node as well, we'll end up altering the
wrong pointer:

	[  |----------->[  |x]
	        [  |x]
	           ^
	        [**]    [--]

In fact, we didn't need to move lconffp at all.  If we remove a node,
the linked list pointer itself or the previous ‘next’ pointer is still
the previous node pointer.

In other words, we only need to run the step statement if we don't
remove the node.  If we modify the code to do that, and run that
simulation again:

We start off with:

	[  |--->[  |--->[  |x]
	   ^     ^
	[**]	[--]

([**] is lconffp, [--] is conff)

We alter the pointer to remove the node:

	[  |----------->[  |x]
	   ^    [  |--->
	   |     ^
	[**]    [--]

And we begin the new iteration of the loop with no step condition:

	[  |----------->[  |x]
	   ^    [  |---> ^
	   |             |
	[**]            [--]

The removed node has now become irrelevant.  First let's simulate
removing the next node, as that's what's failing.  We alter the pointer
to remove the node:

	[  |x]          [  |x]
	   ^             
	[**]    [--]

That's done the right thing, both nodes have been removed.  Now let's
assume that node needed keeping.  We don't alter the pointer to remove
the node, instead we *do* run the loop's step condition:

	[  |----------->[  |x]
	                   ^
	                [**]    [--]

This correctly removes the second node, but keeps the third in the
list.  If the next node was to be removed, the right pointer would be
modified.


So in summary, what was causing this bug was some incorrect code to
remove nodes from a linked list.  The step condition caused a pointer to
be moved that should've been kept where it was.  By only running this
step condition when we haven't removed a node, the right pointers are
modified and the list can be traversed properly the second time around.

Patch attached and committed.


I'm a little nervous about trying to push this change for sarge without
some extensive testing.

Scott
-- 
Have you ever, ever felt like this?
Had strange things happen?  Are you going round the twist?

Index: ChangeLog
===================================================================
--- ChangeLog	(revision 1897)
+++ ChangeLog	(working copy)
@@ -1,3 +1,13 @@
+Fri Mar 12 19:02:21 GMT 2004 Scott James Remnant <scott@netsplit.com>
+
+  * main/remove.c (removal_bulk_remove_configfiles): Don't change the
+  "previous pointer" pointer if we remove the node from the linked list,
+  ensuring that if the next node is to be removed the right thing will
+  happen.
+
+  This corrects the bug where every second shared or diverted conffile
+  would be incorrectly deleted by dpkg.
+ 
 Fri Mar 12 15:05:52 GMT 2004 Scott James Remnant <scott@netsplit.com>
 
   * utils/start-stop-daemon.c: Don't require an argument for -V (version).
Index: debian/changelog
===================================================================
--- debian/changelog	(revision 1897)
+++ debian/changelog	(working copy)
@@ -1,6 +1,10 @@
 dpkg (1.10.21) unstable; urgency=low
 
   * Don't require argument for start-stop-daemon -V.  Closes: #237589.
+  * Fix incorrect linked list node removal code that caused every second
+    shared or diverted conffile to be deleted by dpkg.
+    Closes: #10879, #33046, #47267, #90623, #98210, #109691, #146167.
+    Closes: #155456, #204275.
 
  -- Scott James Remnant <scott@netsplit.com>  UNRELEASED
 
Index: main/remove.c
===================================================================
--- main/remove.c	(revision 1891)
+++ main/remove.c	(working copy)
@@ -403,9 +403,7 @@
      * are involved in diversions, except if we are the package doing the
      * diverting.
      */
-    for (lconffp= &pkg->installed.conffiles;
-         (conff= *lconffp) != 0;
-         lconffp= &conff->next) {
+    for (lconffp= &pkg->installed.conffiles; (conff= *lconffp) != 0; ) {
       for (searchfile= pkg->clientdata->files;
            searchfile && strcmp(searchfile->namenode->name,conff->name);
            searchfile= searchfile->next);
@@ -422,6 +420,7 @@
       } else {
         debug(dbg_conffdetail,"removal_bulk set to new conffile `%s'",conff->name);
         conff->hash= NEWCONFFILEFLAG; /* yes, cast away const */
+        lconffp= &conff->next;
       }
     }
     modstatdb_note(pkg);

Attachment: signature.asc
Description: This is a digitally signed message part


Reply to: