Problem with cpu time
Hi,
We have observed a strange behaviour on our compute-nodes after the
upgrade to squeeze and on new nodes freshly installed with squeeze.
All processes running longer than 24.8 days lead to "nonsense"
cpu-time. Below is an example output of "ps -u username f" over time:
[2012-05-29 05:49:33] 30590 ? R 35793:27 ./fortran_kdis 2 25
[2012-05-29 05:49:38] 30590 ? R 35793:32 ./fortran_kdis 2 25
[2012-05-29 05:49:43] 30590 ? R 35793:37 ./fortran_kdis 2 25
[2012-05-29 05:49:48] 30590 ? R 11129636:45 ./fortran_kdis 2 25
[2012-05-29 05:49:53] 30590 ? R 11129636:45 ./fortran_kdis 2 25
[2012-05-29 05:49:58] 30590 ? R 11129636:45 ./fortran_kdis 2 25
[2012-05-29 11:20:36] 30590 ? R 11129636:45 ./fortran_kdis 2 25
Several days later, the accumulated cpu time value remains the same.
The daily report of the "Grid Engine 2011.11" job scheduler for this
job shows:
...
...:86412.030000:6925734.008546:...
...:86380.140000:6923160.882762:...
...:86423.790000:6926644.923450:...
...:30016546.230000:2405779823.509468:... <---- day with jump
...:0.000000:0.000000:...
...:0.000000:0.000000:...
...:0.000000:0.000000:...
...:0.000000:0.000000:...
...:0.000000:0.000000:...
...:0.000000:0.000000::...
...:0.000000:0.000000:...
...:17112.340000:1371446.414438:...
...:86395.480000:6924655.520745:...
...:86411.810000:6926306.216313:...
...:86415.170000:6926575.536817:...
...:85071.220000:6818939.616130:...
...
The two numbers between the ... represent values for "ru_utime" and
"ru_stime". The accounting values for ru_utime the days before the
"jump" are correct but afterwards they are nonsense for some days and
than ok again (this job was running with 100% cpu usage all the
time!). But all values for ru_stime are looking strange. Keep in mind:
1 day == 86400 sec.
In addition for all jobs showing this behaviour after 35793:37, the
values for the accumulated cpu-usage differ for every job:
[2012-05-29 05:18:32] 30591 ? R 10557290:44 ./fortran_kdis 2 27
[2012-05-29 05:34:42] 30636 ? R 11129626:19 ./fortran_kdis 2 31
[2012-05-29 05:58:20] 30637 ? R 12274089:59 ./fortran_kdis 2 30
[2012-05-29 06:02:37] 30630 ? R 12274256:17 ./fortran_kdis 2 28
[2012-05-29 06:03:12] 30634 ? R 11129641:38 ./fortran_kdis 2 29
[2012-05-29 06:09:44] 30638 ? R 12274280:17 ./fortran_kdis 2 32
[2012-05-29 06:23:55] 30587 ? R 11701990:44 ./fortran_kdis 2 26
Used kernel and architecture are:
# uname -a
Linux warg09 2.6.32-5-amd64 #1 SMP Sun May 6 04:00:17 UTC 2012 x86_64 GNU/Linux
Any help to get rid of this issue, would be highly appreciated.
Thanks in advance...
--
Uwe Bolick
Zentrum für Astronomie und Astrophysik
Technische Universität Berlin
EW 8-1, Hardenbergstr. 36, D-10623 Berlin (Germany)
Reply to: