[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#982002: buster-pu: package openafs/1.8.2-1



Package: release.debian.org
Severity: normal
Tags: buster
User: release.debian.org@packages.debian.org
Usertags: pu
X-Debbugs-Cc: kaduk@mit.edu

[ Reason ]
All upstream openafs releases from the 1.8.x series, prior to 1.8.7,
contain a "time bomb" bug that activates when the unix epoch time
passes 0x60000000 (Thu 14 Jan 2021 08:25:36 AM UTC).  The bug
(actually a combination of bugs) results in all outgoing Rx RPC
connections having a fixed connection ID value of 0x80000002.
In addition to the problems caused by duplicate CIDs, this initial
value is also not even valid as the first CID for the first call on
a connection, so all RPC attempts time out and fail.
The combination of bugs includes insufficient initial randomness
(tv.sec ^ tv.usec) that overwrote an actual random initial value, code
to prevent signed integer overflow when incrementing the global "next CID"
value, and a separate change to convert the field in question to an
unsigned type.  (The changes were written and merged in a different order.)
After the cutover time, when an initial CID value is produced, it is
seen as being larger than the overflow threshold, and assigned to hold
what would be the smallest negative value.  That value, in turn, when
interpreted as an unsigned value, is also larger than the overflow threshold,
so successive updates produce the same value for the connection ID.

[ Impact ]
Both AFS clients and AFS servers are affected.
Unpatched clients started after the cutover time are unable to perform any
filesystem access (the error "connection timed out" is reported).
Unpatched file servers started after the cutover time are unable to connect to
protection servers and verify user group membership to enforce ACLs, and are
unable to connect to other file (volume) servers to move volumes.
Unpatched database servers started after the cutover time are unable to
connect to each other, resulting in a breakdown of the ubik distributed
consensus protocol in deployments that use more than one database server
(three databaser servers is common).

[ Tests ]
Rx RPCs are core to essentially all AFS operation, so running any activity
that goes over the network (including file system access, administrative
commands, and group modification commands) will exercise the code in question.
There are also automated tests that cover basic Rx behaviors, which failed
until the code was patched.
Significant manual testing has been done, with multiple sites running the
patched code in production, and I have performed targeted manual tests
with mixed environments (patched client, unpatched server, etc.) including
rolling back to the broken version to confirm that everything works with
the patched software and doesn't work with the unpatched software.

[ Risks ]
The changes are very targeted and easy to understand fixes.
Risks are, accordingly, limited, and the risks of not updating are
catastrophic (any process restart will introduce the broken condition).

[ Checklist ]
  [x] *all* changes are documented in the d/changelog
  [x] I reviewed all changes and I approve them
  [x] attach debdiff against the package in (old)stable
  [x] the issue is verified as fixed in unstable

[ Changes ]
There are two conceptual changes: removing the stale not-very-random initial
value to leave the actually random initial value, and removing the
signed-integer-overflow-detection code entirely, since unsigned integer
overflow will give the desired behavior.  The latter is split across two
patches, since an intermediate state was committed upstream during the crisis.

As additional context for the former change,
https://salsa.debian.org/debian/openafs/-/blob/debian/1.8.2-1/src/rx/rx.c#L591-604
highlights the current code's call to RAND_bytes() that is subsequently
overwritten by a "*slightly* random start time".


-- System Information:
Debian Release: bullseye/sid
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: amd64 (x86_64)

Kernel: Linux 5.10.0-2-amd64 (SMP w/12 CPU threads)
Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled
diff -Nru openafs-1.8.2/debian/changelog openafs-1.8.2/debian/changelog
--- openafs-1.8.2/debian/changelog	2018-09-11 22:53:43.000000000 -0700
+++ openafs-1.8.2/debian/changelog	2021-01-26 20:13:14.000000000 -0800
@@ -1,3 +1,10 @@
+openafs (1.8.2-1+deb10u1) buster; urgency=high
+
+  * Pull in upstream patches to fix outgoing connections after unix
+    epoch time 0x60000000 (Closes: #980115, #980116)
+
+ -- Benjamin Kaduk <kaduk@mit.edu>  Tue, 26 Jan 2021 20:13:14 -0800
+
 openafs (1.8.2-1) unstable; urgency=high
 
   * New upstream release 1.8.1.1:
diff -Nru openafs-1.8.2/debian/patches/0003-rx-rx_InitHost-do-not-overwrite-RAND_bytes-rx_nextCi.patch openafs-1.8.2/debian/patches/0003-rx-rx_InitHost-do-not-overwrite-RAND_bytes-rx_nextCi.patch
--- openafs-1.8.2/debian/patches/0003-rx-rx_InitHost-do-not-overwrite-RAND_bytes-rx_nextCi.patch	1969-12-31 16:00:00.000000000 -0800
+++ openafs-1.8.2/debian/patches/0003-rx-rx_InitHost-do-not-overwrite-RAND_bytes-rx_nextCi.patch	2021-01-26 20:09:33.000000000 -0800
@@ -0,0 +1,44 @@
+From: Jeffrey Altman <jaltman@auristor.com>
+Date: Thu, 14 Jan 2021 09:41:39 -0500
+Subject: rx: rx_InitHost do not overwrite RAND_bytes rx_nextCid
+
+39b165cdda941181845022c183fea1c7af7e4356 ("Move epoch and cid
+generation into the rx core") introduced the use of RAND_bytes()
+to generate the initial 'rx_nextCid' but failed to remove the
+
+  rx_nextCid = ((tv.tv_sec ^ tv.tv_usec) << RX_CIDSHIFT;
+
+assignment inherited from IBM/Transarc.
+
+At Thu, 14 Jan 2021 08:25:36 GMT the IBM inherited calculation
+overflows the value CID range.   This triggers broken overflow
+logic in update_nextCid().
+
+Change-Id: Ib7283def1ded9792d394133a3969a6d86f3a6123
+Reviewed-on: https://gerrit.openafs.org/14491
+Reviewed-by: Andrew Deason <adeason@sinenomine.net>
+Tested-by: Andrew Deason <adeason@sinenomine.net>
+Reviewed-by: Jeffrey Hutzelman <jhutz@cmu.edu>
+Reviewed-by: Cheyenne Wills <cwills@sinenomine.net>
+Tested-by: Mark Vitale <mvitale@sinenomine.net>
+Reviewed-by: Benjamin Kaduk <kaduk@mit.edu>
+(cherry picked from commit a3bc7ff1501d51ceb3b39d9caed62c530a804473)
+(cherry picked from commit bb6f29732ec25de832b3366b6914be51c021f67b)
+---
+ src/rx/rx.c | 3 ---
+ 1 file changed, 3 deletions(-)
+
+diff --git a/src/rx/rx.c b/src/rx/rx.c
+index 651136a..78a2a54 100644
+--- a/src/rx/rx.c
++++ b/src/rx/rx.c
+@@ -597,9 +597,6 @@ rx_InitHost(u_int host, u_int port)
+     MUTEX_ENTER(&rx_quota_mutex);
+     rxi_dataQuota += rx_extraQuota; /* + extra pkts caller asked to rsrv */
+     MUTEX_EXIT(&rx_quota_mutex);
+-    /* *Slightly* random start time for the cid.  This is just to help
+-     * out with the hashing function at the peer */
+-    rx_nextCid = ((tv.tv_sec ^ tv.tv_usec) << RX_CIDSHIFT);
+     rx_connHashTable = (struct rx_connection **)htable;
+     rx_peerHashTable = (struct rx_peer **)ptable;
+ 
diff -Nru openafs-1.8.2/debian/patches/0004-rx-update_nextCid-overflow-handling-is-broken.patch openafs-1.8.2/debian/patches/0004-rx-update_nextCid-overflow-handling-is-broken.patch
--- openafs-1.8.2/debian/patches/0004-rx-update_nextCid-overflow-handling-is-broken.patch	1969-12-31 16:00:00.000000000 -0800
+++ openafs-1.8.2/debian/patches/0004-rx-update_nextCid-overflow-handling-is-broken.patch	2021-01-26 20:09:33.000000000 -0800
@@ -0,0 +1,58 @@
+From: Jeffrey Altman <jaltman@auristor.com>
+Date: Thu, 14 Jan 2021 09:57:13 -0500
+Subject: rx: update_nextCid overflow handling is broken
+
+The overflow handling in update_nextCid() produces a rx_nextCid
+value of 0x80000001 which itself is out of the valid range.   When
+used to construct the first call of a new connection the connection
+id for the call becomes 0x80000002, and all subsequent connections
+also trigger the overflow handling and thus also receive connection
+id 0x80000002.
+
+If the same connection id is used for multiple connections from
+the same endpoint the accepting rx peer will be very confused.
+
+When authenticated connections are used, the CHALLENGE/RESPONSE
+will fail because of a mismatch in the connection's callNumber
+array.
+
+If an initiator makes only a single connection to a given rx peer,
+that connection would succeed, but once multiple connections are
+initiated all communication from a broken initiator to any rx peer
+will fail.
+
+The incorrect overflow calculation was introduced by
+39b165cdda941181845022c183fea1c7af7e4356 ("Move epoch and cid
+generation into the rx core").
+
+This change corrects the overflow value to become
+
+  1 << RX_CIDSHIFT
+
+Change-Id: If36e3aa581d557cc0f4d2d478f84a6593224c3cc
+Reviewed-on: https://gerrit.openafs.org/14492
+Reviewed-by: Andrew Deason <adeason@sinenomine.net>
+Reviewed-by: Benjamin Kaduk <kaduk@mit.edu>
+Tested-by: Benjamin Kaduk <kaduk@mit.edu>
+(cherry picked from commit 2c0a3901cbfcb231b7b67eb0899a3133516f33c8)
+(cherry picked from commit fbb854873d84271b4e32f19106f1f3d9c2d0f897)
+---
+ src/rx/rx.c | 5 ++---
+ 1 file changed, 2 insertions(+), 3 deletions(-)
+
+diff --git a/src/rx/rx.c b/src/rx/rx.c
+index 78a2a54..03eb946 100644
+--- a/src/rx/rx.c
++++ b/src/rx/rx.c
+@@ -6609,9 +6609,8 @@ update_nextCid(void)
+ {
+     /* Overflow is technically undefined behavior; avoid it. */
+     if (rx_nextCid > MAX_AFS_INT32 - (1 << RX_CIDSHIFT))
+-	rx_nextCid = -1 * ((MAX_AFS_INT32 / RX_CIDSHIFT) * RX_CIDSHIFT);
+-    else
+-	rx_nextCid += 1 << RX_CIDSHIFT;
++	rx_nextCid = 0;
++    rx_nextCid += 1 << RX_CIDSHIFT;
+ }
+ 
+ static void
diff -Nru openafs-1.8.2/debian/patches/0005-Remove-overflow-check-from-update_nextCid.patch openafs-1.8.2/debian/patches/0005-Remove-overflow-check-from-update_nextCid.patch
--- openafs-1.8.2/debian/patches/0005-Remove-overflow-check-from-update_nextCid.patch	1969-12-31 16:00:00.000000000 -0800
+++ openafs-1.8.2/debian/patches/0005-Remove-overflow-check-from-update_nextCid.patch	2021-01-26 20:09:33.000000000 -0800
@@ -0,0 +1,33 @@
+From: Benjamin Kaduk <kaduk@mit.edu>
+Date: Thu, 14 Jan 2021 10:20:59 -0800
+Subject: Remove overflow check from update_nextCid
+
+The rx_nextCid global has been an unsigned type since
+http://gerrit.openafs.org/11106 (which was actually merged before
+the refactoring of overflow check to avoid signed integer overflow)
+and thus there is no need to avoid signed overflow.  The per-connection
+cid has been unsigned since the IBM import.
+
+The natural unsigned behavior on overflow of wrapping is the desired
+behvaior here, so just remove the extra logic and always increment.
+
+Change-Id: I2d9fd24082b762eb871199da3ac1cc0983764585
+(cherry picked from commit 53a0119abd6bb7c460386d6f6d19c049f40fe71d)
+---
+ src/rx/rx.c | 3 ---
+ 1 file changed, 3 deletions(-)
+
+diff --git a/src/rx/rx.c b/src/rx/rx.c
+index 03eb946..4c1ac6f 100644
+--- a/src/rx/rx.c
++++ b/src/rx/rx.c
+@@ -6607,9 +6607,6 @@ rxi_CancelGrowMTUEvent(struct rx_call *call)
+ static void
+ update_nextCid(void)
+ {
+-    /* Overflow is technically undefined behavior; avoid it. */
+-    if (rx_nextCid > MAX_AFS_INT32 - (1 << RX_CIDSHIFT))
+-	rx_nextCid = 0;
+     rx_nextCid += 1 << RX_CIDSHIFT;
+ }
+ 
diff -Nru openafs-1.8.2/debian/patches/series openafs-1.8.2/debian/patches/series
--- openafs-1.8.2/debian/patches/series	2018-09-11 22:49:48.000000000 -0700
+++ openafs-1.8.2/debian/patches/series	2021-01-26 20:09:33.000000000 -0800
@@ -1,2 +1,5 @@
 0003-Catch-up-to-roken-s-rename-of-base64-symbols.patch
 butc-repair-build-error.patch
+0003-rx-rx_InitHost-do-not-overwrite-RAND_bytes-rx_nextCi.patch
+0004-rx-update_nextCid-overflow-handling-is-broken.patch
+0005-Remove-overflow-check-from-update_nextCid.patch

Reply to: