[snip]
I did some more narrowing down. The problem was almost certainly with
LDAP. Our LDAP server was heavily overloaded (19! Never seen a
15-minute load average that high before...) because we had an index on
the wrong key (uid instead of uidNumber, and all the queries used
uidNumber as their search term)
19 is workable. But it starts to hurt around there. I once had one of my
boxes up to 78 (didn't want to reboot as this would loose both uptime
counter and a diagnostic opportunity).
So what was apparently happening was name lookups were taking too long
(a few seconds?). Adding a proper index to slapd made the problem go
away. It's probably a bug though that ls and friends would abort if it
couldn't resolve the name in a certain amount of time though - is that
intended behavior?
I suspect that this is *not* the intended behaviour of ls :-) Sounds
like there's an obscure bug somewhere there...
Wouldn't it be better to log a timeout and use the numeric ID or
something?
I concur. But setting up a testcase for it might require a bit of
work - might not be worth it for such an obscure bug...