[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Lockups on stadler machine



  Hi everyone,

 I just subscribed to this mailing.

 I am member of the core to of Free Pascal compiler
and we are currently trying to add support for sparc64 architecture.

 To hep us for this, we got access to the stadler machine,
and I was already able to crash that machine several times,
but I did not manage yet to find a simple, reproducible way
to achieve this.

  I try to run nightly testsuite runs, both of sparc and sparc64 CPU.
Those testsuite seems to have perturbed the kernel enough
to make the system crash.

  The sparc64 programs are of course "experimental" as we are
in the development phase, but sparc 32-bit architecture
is not new for Free Pascal (but there was no regular
testsuite runs for the CPU before stadler...).

  Even with sparc 32bit only programs,
we get into specific problems, like below:
the main compiler called "ppcsparc" becomes a zombie process,
which leads to a complete halt of the testsuite.

  Is there a standard way to deal with zombie processes inside
bash scripts or GNU make?

  Below you will find a description of the problems I encountered.

  These also include the fact the the cron daemon seemed to stop
working for an unknown reason, at least for my concurrent script,
supposed to kill test executables after 4 minutes.
  I also found "RT WatchDog" report in syslog for sparc64 executables
and report of segfaults, probably related to wrong
code generated for libraries, which is not really position independent.
  I don't know really what "WatchDog" are...
  What does "error 30001" mean for the segfault?

  Any advice would be most welcome,

Pierre Muller
member of Free Pascal core development team


Le 20/07/2017 à 14:03, John Paul Adrian Glaubitz a écrit :
> On Thu, Jul 20, 2017 at 01:58:58PM +0200, Pierre Muller wrote:
>> pierre@stadler:~/pas/fixes/fpcsrc$ ps xf
>>   PID TTY      STAT   TIME COMMAND
>>  5468 ?        Ss     0:00 /bin/sh -c /home/pierre/bin/kill_tests.sh > /home/pierre/.kill.log 2>&1
>>  5469 ?        S      0:00  \_ bash /home/pierre/bin/kill_tests.sh
>>  5485 ?        S      0:00      \_ sleep 240
>>  2079 ?        Ss     0:00 /bin/sh -c /home/pierre/bin/linux-fpcallup.sh
>>  2080 ?        S      0:00  \_ /bin/bash /home/pierre/bin/linux-fpcallup.sh
>> 18859 ?        S      0:00      \_ /bin/bash /home/pierre/bin/linux-fpccommonup.sh
>> 22934 ?        S      0:00          \_ make -j 16 distclean fulldb TEST_USER=pierre TEST_HOSTNAME=fpc-sparc64 TEST_OPT=-Cg -ao-32 -Fo/usr/lib32 -Fl/usr/lib32 -Fl/usr/sparc64-linux-gnu/lib32 -Fl/hom
>> 23072 ?        S      0:00              \_ make allexectests
>> 23114 ?        S      0:01                  \_ make gparmake_allexectests
>> 23120 ?        S      0:00                      \_ make -C tstunits FPC_VERSION= FPC=/home/pierre/pas/fpc-3.1.1/bin/ppcsparc CPU_TARGET=sparc OS_TARGET=linux SUBARCH= OPT=-Cg -ao-32 -Fo/usr/lib32 -
>> 23178 ?        S      0:00                          \_ make -C ../../packages all OPT=-Cg -ao-32 -Fo/usr/lib32 -Fl/usr/lib32 -Fl/usr/sparc64-linux-gnu/lib32 -Fl/home/pierre/local/lib32 -Fd -n CROSS
>> 23183 ?        S      0:01                              \_ ./fpmake compile --localunitdir=.. --os=linux --cpu=sparc -o -Cg -o -ao-32 -o -Fo/usr/lib32 -o -Fl/usr/lib32 -o -Fl/usr/sparc64-linux-gnu/
>> 23188 ?        Z      0:07                                  \_ [ppcsparc] <defunct>
>>
>>   The compiler ppcsparc seems to become a zombie quite fast...
>>
>>   Is there nothing inside fpmake that tests for zombies?
>>
>>   Any idea howto handle such troubles?
>
> If this is 100% reproducible, it would be a good idea to send the
> information to the sparclinux mailing list [1].

Hi Adrian,
  I tried to generate a script adding elements one after another,
but I ended up with something that contained almost the whole night testsuite
crontab, without getting any zombie process...

  By the way, all night testsuites seemed to have completed
without errors tonight, and I am currently running...

  The testsuite was stopped inside the -O1 test,
but mainly because the other cron job that I added,
call kill_tests.sh, which purpose is quite obvious after its name,
and which should be called every 5 minutes was not called anymore...
This led to the same test hanging for over an hour, and to a general
failure of the testsuite.

  I have no idea why cron stopped launching this script,
I ran `crontab -e`
and it start again every 5 minutes as it should.

  Logging into root, I found this:
root@stadler:~# grep -in segfault  /var/log/syslog
64:Jul 21 07:26:22 stadler kernel: [87979.372370] tlib1b[7782]: segfault at 2a1f30 ip ffff80010014ab70 (rpc ffff800100011998) sp 000007feff8ba591 error 30001 in libtlib1a.so[ffff800100128000+70000]
65:Jul 21 07:26:30 stadler kernel: [87986.945139] tlibrary2[8102]: segfault at 299bc8 ip ffff800100148050 (rpc ffff800100011998) sp 000007feff854591 error 30001 in libtlibrary1.so[ffff800100128000+68000]
120:Jul 21 07:30:27 stadler kernel: [88224.142276] tw12704b[16856]: segfault at 3347f8 ip ffff800100171f80 (rpc ffff800100011998) sp 000007feffcea591 error 30001 in libtw12704a.so[ffff800100128000+f2000]
199:Jul 21 07:33:17 stadler kernel: [88394.358561] tw8730c[20952]: segfault at 413ec0 ip ffff8001002c04e0 (rpc ffff800100011998) sp 000007feffc82591 error 30001 in libtw8730a.so[ffff8001002a0000+6a000]
211:Jul 21 07:34:02 stadler kernel: [88397.854698] tw8730d[21018]: segfault at 67bec0 ip ffff8001005284e0 (rpc ffff800100011998) sp 000007feffe72591 error 30001 in libtw8730a.so[ffff800100508000+6a000]
226:Jul 21 07:34:10 stadler kernel: [88446.653939] tw16949b[21811]: segfault at 29be38 ip ffff800100148470 (rpc ffff800100011998) sp 000007feffe22591 error 30001 in libtw16949a.so[ffff800100128000+6a000]
227:Jul 21 07:34:30 stadler kernel: [88466.811468] tweaklib2[22197]: segfault at 299bd8 ip ffff800100148060 (rpc ffff800100011998) sp 000007feff806591 error 30001 in libtweaklib1.so[ffff800100128000+68000]
272:Jul 21 07:39:09 stadler kernel: [88746.414196] tw9089c[23212]: segfault at 413e68 ip ffff8001002c04c0 (rpc ffff800100011998) sp 000007feff898591 error 30001 in libtw9089a.so[ffff8001002a0000+6a000]
290:Jul 21 07:39:53 stadler kernel: [88789.566650] tw6586b[23970]: segfault at 29c038 ip ffff800100148800 (rpc ffff800100011998) sp 000007feffee4591 error 30001 in libtw6586a.so[ffff800100128000+6a000]
437:Jul 21 07:44:52 stadler kernel: [89089.294884] tw3964b[24539]: segfault at ffffffff85f2b340 ip ffffffff85f2b340 (rpc 00000000002921a0) sp 0000000025410c00 error 30001
1063:Jul 21 07:56:18 stadler kernel: [89775.364564] tlib1b[28474]: segfault at 2a1f30 ip ffff80010014ab70 (rpc ffff800100011998) sp 000007feff8cc561 error 30001 in libtlib1a.so[ffff800100128000+70000]
1064:Jul 21 07:56:26 stadler kernel: [89782.683184] tlibrary2[28767]: segfault at 299bc8 ip ffff800100148050 (rpc ffff800100011998) sp 000007feff97e561 error 30001 in libtlibrary1.so[ffff800100128000+68000]
1199:Jul 21 08:01:28 stadler kernel: [90084.546224] tw12704b[6315]: segfault at 3347f8 ip ffff800100171f80 (rpc ffff800100011998) sp 000007feffefe561 error 30001 in libtw12704a.so[ffff800100128000+f2000]
1315:Jul 21 08:04:10 stadler kernel: [90246.757176] tw16949b[8839]: segfault at 29be38 ip ffff800100148470 (rpc ffff800100011998) sp 000007feff816561 error 30001 in libtw16949a.so[ffff800100128000+6a000]
1356:Jul 21 08:09:05 stadler kernel: [90541.405817] tw6586b[9747]: segfault at 29c038 ip ffff800100148800 (rpc ffff800100011998) sp 000007feffa92561 error 30001 in libtw6586a.so[ffff800100128000+6a000]
1371:Jul 21 08:09:29 stadler kernel: [90565.548370] tweaklib2[10049]: segfault at 299bd8 ip ffff800100148060 (rpc ffff800100011998) sp 000007feff9b6561 error 30001 in libtweaklib1.so[ffff800100128000+68000]
1416:Jul 21 08:14:40 stadler kernel: [90877.032652] tw3964b[10588]: segfault at ffffffff85f2b340 ip ffffffff85f2b340 (rpc 00000000002921a0) sp 0000000025410c00 error 30001
1685:Jul 21 08:27:06 stadler kernel: [91622.732191] tlib1b[14398]: segfault at 2a1f30 ip ffff80010014ab70 (rpc ffff800100011998) sp 000007feffb00561 error 30001 in libtlib1a.so[ffff800100128000+70000]
1686:Jul 21 08:27:13 stadler kernel: [91629.951692] tlibrary2[14636]: segfault at 299bc8 ip ffff800100148050 (rpc ffff800100011998) sp 000007feffb1a561 error 30001 in libtlibrary1.so[ffff800100128000+68000]
1880:Jul 21 08:39:07 stadler kernel: [92344.026902] tw16949b[27632]: segfault at 29be38 ip ffff800100148470 (rpc ffff800100011998) sp 000007feff918561 error 30001 in libtw16949a.so[ffff800100128000+6a000]
1961:Jul 21 08:44:28 stadler kernel: [92664.517210] tweaklib2[28322]: segfault at 299bd8 ip ffff800100148060 (rpc ffff800100011998) sp 000007feff874561 error 30001 in libtweaklib1.so[ffff800100128000+68000]

  The last cronjob starting kill_tests was at 8:40...
  I also found "RT WatchDog Timeout", but I don't really know what those are:

1973-Jul 21 08:45:00 stadler kernel: [92696.872780]  [000000000048bfa4] kthread+0xe4/0x120
1974-Jul 21 08:45:00 stadler kernel: [92696.936256]  [0000000000406064] ret_from_fork+0x1c/0x2c
1975-Jul 21 08:45:00 stadler kernel: [92697.005638]  [0000000000000000]           (null)
1976:Jul 21 08:47:34 stadler kernel: [92850.577708] RT Watchdog Timeout (hard): tparray16[28126]
1977:Jul 21 08:48:32 stadler kernel: [92908.360558] RT Watchdog Timeout (hard): twide2[28334]
1978:Jul 21 08:51:35 stadler kernel: [93091.784209] RT Watchdog Timeout (hard): tparray17[28351]
1979-Jul 21 08:52:41 stadler kernel: [93147.223245] INFO: rcu_sched detected stalls on CPUs/tasks:
1980-Jul 21 08:52:41 stadler kernel: [93147.296195]     2-...: (33 GPs behind) idle=270/0/0 softirq=611204/611206 fqs=0
1981-Jul 21 08:52:41 stadler kernel: [93147.390887]     3-...: (439 GPs behind) idle=3bc/0/0 softirq=529429/529431 fqs=0

  Any advice?

Pierre


Reply to: