Re: rx2660 + debian
On 26.04.22 17:01, Pedro Miguel Justo wrote:
On 2022/Apr/26, at 06:34, Frank Scheiner <email@example.com> wrote:
So maybe best to give `hardened_usercopy=off` a try on your rx2660, too.
From my testing on rx2660 and rx2620 this seems to unbreak the kernel
boot and maybe also makes it less likely to hit the problem post boot. I
don't know why Adrian's rx2660 seems to be unaffected by this, though.
I did. That is why I ended up compiling 5.17 with the entire thing turned off. With 5.17, on my rx2660 Montvale with 8 cores the machine can’t get past early boot even with hardened_usercopy=off.
Those ‘warnings' are actually processes being killed. And they depend on the direction the bad copy was happening.
Thanks for clarification.
If you look at my prior responses, with the 4.19 kernel I was also running along fine for hours and, after some time building the kernel (a benchmark in itself) it would start producing these warning and would not allow compilation to continue any further. I would reboot the machine and that gave me a few more hours. When I tried 'hardened_usercopy=off’ on the 4.19 kernel that worked. I no longer got these process terminations after a few hours and the machine was able to build the entire kernel from beginning to end.
So, 4.19 and 5.17 are different in many ways (symptom-wise):
- I never got a bugckeck (panic) level failure on the 4.19. They were all process termination level.
- On the 4.19 these took quite some time to show up. Seemed to depend on the number of processes created in the past and was mitigated by a reboot. On the 5.17 it was very aggressive, showing up early in boot, even on system threads like the crypto bot self test. Disabling the crypto boot self test made it go father but not much. If the error is detected on a system thread, there is no process to terminate: it is game over.
- hardened_usercopy=off was observed by 4.19 but ignored by 5.17
Well, it seems to make a difference for my rx2660, maybe because of
Montecitos instead of Montvales, I don't know. Or it depends on the
available memory (i.e. maybe it happens more/less often with less/more
memory available). Mine has 32 GiB in total.
I don’t exclude the possibility of human error in conducting all these experiments (some of the process is error prone), but I did run these experiments more than just a few times, so it would have to be a heck of a coincidence to and up with consistent results.
Sure, my test results are also more anecdotal as it takes so much time
to boot and run things (`openssl speed -elapsed` takes around 23 mins).
I'll now look at my other Itanium gear, rx2800 i2 first,
First testing with 5.17.0-1-mckinley on my rx2800 i2 interestingly shows
no issues with memcopy at all, not during kernel boot, nor post boot. My
kernel cmdline is as follows:
root@rx2800-i2:~# cat /proc/cmdline
BOOT_IMAGE=net0:/AC10027B.vmlinuz root=/dev/nfs ip=:::::enp8s0f0:dhcp
It could well be, that the Tukwilas behave differently in that case. In
the end they have their memory controller included in the processor and
not in the chipset like the older Montecitos or Montvales.
[rx2800-i2-mp-ilo] CM:hpiLO-> sysrev
Revisions Active Pending
iLO FW : 01.54.03
System FW : 01.93
MHW FPGA : 02.02
Power Mon FW : 02.09
PRS HW : 02.06
IOH HW : 02.02
Power Supply 1 : 02.01
Power Supply 2 : 02.01
root@rx2800-i2:~# uname -a
Linux rx2800-i2 5.17.0-1-mckinley #1 SMP Debian 5.17.3-1 (2022-04-18)
CPU op-mode(s): 64-bit
Byte Order: Little Endian
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel(R) Itanium(R) Processor 9320
Model name: Intel(R) Itanium(R) Processor 9320
BIOS Model name: Intel(R) Itanium(R) Processor 9320
CPU family: 32
Thread(s) per core: 2
Core(s) per socket: 4
Flags: branchlong, 16-byte atomic ops, 0x8
Caches (sum of all):
L1d: 64 KiB (4 instances)
L1i: 64 KiB (4 instances)
L2d: 1 MiB (4 instances)
L2i: 4 MiB (8 instances)
L3: 32 MiB (8 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
root@rx2800-i2:~# free -m
total used free shared buff/cache
Mem: 24218 138 23983 2 96
Swap: 0 0 0