rumpdisk device timeouts
I've been using the 32 bit Hurd for my stress-ng testing so as to
isolate from developments in the 64 bit version. I've used virtual
machines using 'QEMU HARDDISK' and have repeatedly hit read and write
timeouts when using rump disk but no evidence of any similar failures
using the Linux block driver without rumpdisk.
For example:
[ 2011.9300050] wd0d: device timeout reading fsbn 10184704 of
10184704-10184711 (wd0 bn 10184704; cn 10103 tn 13 sn 61), xfer d20,\
retry 0
[ 2011.9300050] wd0d: device timeout reading fsbn 449552 of
449552-449559 (wd0 bn 449552; cn 445 tn 15 sn 47), xfer ec0, retry 0
[ 2011.9300050] wd0d: device timeout reading fsbn 440176 of
440176-440183 (wd0 bn 440176; cn 436 tn 10 sn 58), xfer b80, retry 0
[ 2011.9300050] wd0d: device timeout reading fsbn 1502104 of
1502104-1502111 (wd0 bn 1502104; cn 1490 tn 2 sn 58), xfer f90, retry\
0
[ 10508.3700050] wd0d: device timeout writing fsbn 176616 of
176616-176623 (wd0 bn 176616; cn 175 tn 3 sn 27), xfer d88, retry 0
[ 10508.3700050] wd0d: device timeout writing fsbn 176624 of
176624-176631 (wd0 bn 176624; cn 175 tn 3 sn 35), xfer f90, retry 0
[ 10518.8700050] wd0d: device timeout writing fsbn 176624 of
176624-176631 (wd0 bn 176624; cn 175 tn 3 sn 35), xfer f90, retry 1
I've also had a number of occasions where the rumpdisk task was semingly
the central figure in a system wide lockup with the kernel debugger user
space stack trace showing, for example:
thread: 32
Continuation mach_msg_continue
>>>>> user space <<<<<
mach_msg_trap 0x822daec(0x81b2360(7b518b4,2,0,18,ae)
__pthread_block 0x81e0369(7c04320,20118840,7b518f8,8120e02,0)
__pthread_cond_timedwait_internal
0x81e0ce4(7c04260,7c04b40,ffffffff,0,81e0e09)
pthread_cond_wait 0x81e0e21(7c04260,7c04b40,7c04b40,812c2bd,0)
rumpuser_cv_wait 0x812cadd(7c04260,7c04b40,7b519e8,811bd11,7c04b40)
0x811bd7c(200aefa0,200aef9c,1000,1,0)
rumpns_physio 0x818a37c(806f110,0,303,0,0)
0x8070512(303,0,7b51c64,10,1)
rumpns_cdev_write 0x80d666b(303,0,7b51c64,10,812c2a9)
rumpns_spec_write 0x8160ccb(7b51bac,8354d74,819c98b,8354d74,0)
0x819b5ba(20092000,7b51c64,10,20048040,0)
0x816d399(2011e0c0,7b51cd8,7b51c64,20048040,0)
rumpns_dofilewrite 0x8096a44(3,2011e0c0,71fb000,1000,7b51cd8)
0x817d105(20118840,7b51d58,7b51d50,8124fbe,0)
0x812500e(ae,7b51d58,18,7b51d50,0)
0x8117dac(3,71fb000,1000,20b7b000,0)
rumpdisk_device_write 0x804a413(20013f80,b2,12,0,105bd8)
_Xdevice_write 0x804d371(7b51ee0,7b53ef0,819febb,cb87f3c,7b53ef0)
0x804ab9b(7b51ee0,7b53ef0,7b51e94,0,7b53ef0)
0x804e21b(7b51ee0,7b53ef0,0,0,80001712)
0x81b26da(7b55f98,2000,10,900,1d4c0)
0x804e34b(0,8354d74,7b55fe8,819e545,0)
0x819e589(7c04320,cb87f78,0,0,
The above might not be abnormal, I haven't looked through the code yet,
but nevertheless the task was stalled and the cause was not related to
page-in (page wiring does seem to be functioning correctly) as it often
is with other tasks during this test case. I have a virtual machine
snapshot with rumpdisk in the above state if more information is helpful.
With an additional improvement to libports interruptions (which I'll
mail separately about), I switched off the non-rumpdisk Hurd test case
after 20 successful hours whereas using rumpdisk I can only sometimes
achieve 90 minutes at the most.
Is a 64 bit rumpdisk virtual machine likely to be any more stable than
the 32 bit ?
There's no indication that my host machine has any hardware issues but I
suppose that there could be a bug in qemu q35/SATA rather than the
i440/IDE setup on the non-rumpdisk guest. I did try a 64 bit Hurd
install on an old real PC but without any success so far. My 2nd PC has
only UEFI BIOS so that won't get very far I believe.
Reply to: