Re: ls aborts due to free()ing an invalid pointer
Karl E. Jorgensen wrote:
On Wed, May 02, 2007 at 11:23:21AM -0700, Steven Schlansker wrote:
I'm having a rather strange error while trying to ls a large directory.
The setup is as follows:
/home is nfs-mounted from a BSD box
nsswitch is set to use LDAP for passwd, shadow, and group info
nscd is running to cache the responses from LDAP
I try to run ls -l /home, and get the error
steven@soda:~$ ls -l /home
*** glibc detected *** free(): invalid pointer: 0xa7f9ad38 ***
Questions that might help narrow it down:
- Does other commands (find, shell wildcard expansion) behave strangely
- Do you get the same error if you omit "-l" ?
- What about "ls --numeric-uid-gid /home" ? (might blame/eliminate ldap)
- Does the same happen if you run the commands on the actual (BSD?) box?
This would eliminate/blame NFS...
- Any out-of-the-ordinary options in /etc/fstab for /home ?
It would be nice to narrow it down to one of:
- specific users/groups
- specific files
- network trouble (unlikely...)
Looks like it's error reporting from here on...
Does the same happen if you "ls -l" any of joew's files? The strace
output might reveal this, but it may have been a few hundred lines
before the interesting bit...
And finally the trace
Sorry about the rather verbose debugging information, I don't really
know where to proceed from here. Any help would be much appreciated!
verbose is good - especially when it's not random ramblings :-)
Hope this helps
I did some more narrowing down. The problem was almost certainly with
LDAP. Our LDAP server was heavily overloaded (19! Never seen a
15-minute load average that high before...) because we had an index on
the wrong key (uid instead of uidNumber, and all the queries used
uidNumber as their search term)
So what was apparently happening was name lookups were taking too long
(a few seconds?). Adding a proper index to slapd made the problem go
away. It's probably a bug though that ls and friends would abort if it
couldn't resolve the name in a certain amount of time though - is that
intended behavior? Wouldn't it be better to log a timeout and use the
numeric ID or something?