[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: OpenSSH: cause of random kex_exchange_identification errors?



On Tue, 14 Jun 2022, Vincent Lefevre wrote:

On 2022-06-07 17:19:12 +0100, Tim Woodall wrote:
On Tue, 7 Jun 2022, Vincent Lefevre wrote:
I eventually did a packet capture on the client side as I was able to
reproduce the problem. When it occurs, I get the following sequence:

Client ? Server: [SYN] Seq=0
Server ? Client: [SYN, ACK] Seq=0
Client ? Server: [ACK] Seq=1
Server ? Client: [FIN, ACK] Seq=1
Client ? Server: Client: Protocol (SSH-2.0-OpenSSH_9.0p1 Debian-1)
Server ? Client: [RST] Seq=2
Client ? Server: [FIN, ACK] Seq=33
Server ? Client: [RST] Seq=2

So the issue comes from the server, which sends [FIN, ACK] to terminate
the connection. In OpenSSH's sshd.c, this could be due to

                       if (unset_nonblock(*newsock) == -1 ||
                           drop_connection(*newsock, startups) ||
                           pipe(startup_p) == -1) {
                               close(*newsock);
                               continue;
                       }

At least 2 kinds of errors are not logged:

* In unset_nonblock(), a "fcntl(fd, F_SETFL, val) == -1" condition.

* the "pipe(startup_p) == -1" condition.

I'm not sure about drop_connection(), which is related to MaxStartups.


I've not seen the start of this thread but is this occasional or always?

Occasional. Someone else at my lab could reproduce the issue.
But the admins can't.

If occasional, how many concurrent connections do you have starting all
at once.

I'm not sure what you mean by "concurrent connections". The server
is a SSH gateway, so that many users connect to it. But for the
client host above (my personal machine at my lab), this was the
only connection from this machine; note I did this connection only
for testing, as there is no need to connect to this SSH gateway
from the lab.


It doesn't matter if they're from the same machine, the problem happens
if the target machine has too many connections that haven't finished
authenticating (but from what you say below I doubt this is the problem)

The default ssh config has a super-annoying default that
randomly kills sessions if too many are handshaking at once.

It's the MaxStartups setting you allude to. I've been bitten by this
where cron jobs all start at the same time and ssh to the same host.

MaxStartups was increased in February, after I initially reported
the problem.

So long as they've increased the first parameter then that should have
fixed it if it was the cause.

Since this is a Debian 10 machine with OpenSSH_7.9p1 Debian-10+deb10u2,
I should have quoted the code from this sshd.c version. Thus the
connection close issue should occur in

	if (unset_nonblock(*newsock) == -1) {
		close(*newsock);
		continue;
	}
	if (drop_connection(startups) == 1) {
		char *laddr = get_local_ipaddr(*newsock);
		char *raddr = get_peer_ipaddr(*newsock);

		verbose("drop connection #%d from [%s]:%d "
		    "on [%s]:%d past MaxStartups", startups,
		    raddr, get_peer_port(*newsock),
		    laddr, get_local_port(*newsock));
		free(laddr);
		free(raddr);
		close(*newsock);
		continue;
	}
	if (pipe(startup_p) == -1) {
		close(*newsock);
		continue;
	}

Now, it appears that verbose() logs at SYSLOG_LEVEL_VERBOSE, and it
is just below the default SYSLOG_LEVEL_INFO, so that nothing would be
logged by default concerning MaxStartups, if I understand correctly.

But the admins changed the log level to some debug one a few days ago,
and debug messages effectively appear, but nothing concerning my case
(I had sent the exact time of the failures to the admins).

BTW, the issue also occurs at night, while there should be very few
connections at handshaking status.


In the case where I hit it it was a cron job starting an ssh connection
from multiple machines - 'out of hours' where 'convenience' was more
valuable than 'performance'.

I don't have any more suggestions, sorry. Do you know how unset_nonblock
can fail? Other than building a patched version with more logging I
don't know what else to try that you haven't already done.

Tim.


Reply to: