Re: rx2660 + debian

To: Pedro Miguel Justo <pmsjt@texair.net>
Cc: Anton Borisov <anton.borisov@gmail.com>, John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>, debian-ia64 <debian-ia64@lists.debian.org>, Sergei Trofimovich <slyich@gmail.com>
Subject: Re: rx2660 + debian
From: Frank Scheiner <frank.scheiner@web.de>
Date: Tue, 26 Apr 2022 18:04:31 +0200
Message-id: <[🔎] 3937eb1b-b88c-5db8-adf0-b7b819c8afdb@web.de>
In-reply-to: <[🔎] 9161309F-1243-44C9-8B17-696B873017E9@texair.net>
References: <CANHzssBhm=A2Q+npDqdwYsd9DKhqYyT_VVnYwQr=7WpPBNccqg@mail.gmail.com> <[🔎] db15a1d2-69be-8a02-476f-d7c183c2f46e@web.de> <[🔎] CANHzssC5sa4JpiW_ZrbW0yDWnxa1u9j56eVCETE-M5qMD6YTVA@mail.gmail.com> <[🔎] 0927961a-1c49-e57f-37db-fd62e5b19a6e@web.de> <[🔎] 6c888a13-4737-8a1b-023f-c683f12e083d@web.de> <[🔎] E8BAF1B0-6C33-4E65-BCEC-B2D258D97898@texair.net> <[🔎] c7c5702b-8123-b3fc-c0f5-c5e132393d60@physik.fu-berlin.de> <[🔎] E0CC7F7A-01A2-4F36-A080-266B56830C74@texair.net> <[🔎] 549c036f-fbb3-f990-deb3-4b47a57d38b2@physik.fu-berlin.de> <[🔎] 3FBD5838-E30A-4825-AD47-657F065062AE@texair.net> <[🔎] 8CDCF0F2-C2C2-4EED-ADD0-9E3889938908@texair.net> <[🔎] 6fc4b6d1-5c84-1eef-d7fa-952d9a09ee3f@physik.fu-berlin.de> <[🔎] 529EF8D2-F76E-4EC4-8642-ED7B0175E888@texair.net> <[🔎] fa5bc5de-09f3-c047-5eb3-0253ed6959f5@physik.fu-berlin.de> <[🔎] b1d000e5-1698-7f24-4d78-eced6c98e57b@web.de> <[🔎] 1E227A5F-0C03-4615-B921-D2EBED3DB058@texair.net> <[🔎] 42b98a5f-7f14-6c11-55e8-737ebd3ed50c@web.de> <[🔎] 9161309F-1243-44C9-8B17-696B873017E9@texair.net>

Hi Pedro,

On 26.04.22 17:01, Pedro Miguel Justo wrote:

On 2022/Apr/26, at 06:34, Frank Scheiner <frank.scheiner@web.de> wrote:
@Anton:
So maybe best to give `hardened_usercopy=off` a try on your rx2660, too.
 From my testing on rx2660 and rx2620 this seems to unbreak the kernel
boot and maybe also makes it less likely to hit the problem post boot. I
don't know why Adrian's rx2660 seems to be unaffected by this, though.


I did. That is why I ended up compiling 5.17 with the entire thing turned off. With 5.17, on my rx2660 Montvale with 8 cores the machine can’t get past early boot even with hardened_usercopy=off.

Those ‘warnings' are actually processes being killed. And they depend on the direction the bad copy was happening.


Thanks for clarification.


If you look at my prior responses, with the 4.19 kernel I was also running along fine for hours and, after some time building the kernel (a benchmark in itself) it would start producing these warning and would not allow compilation to continue any further. I would reboot the machine and that gave me a few more hours. When I tried 'hardened_usercopy=off’ on the 4.19 kernel that worked. I no longer got these process terminations after a few hours and the machine was able to build the entire kernel from beginning to end.

So, 4.19 and 5.17 are different in many ways (symptom-wise):
- I never got a bugckeck (panic) level failure on the 4.19. They were all process termination level.
- On the 4.19 these took quite some time to show up. Seemed to depend on the number of processes created in the past and was mitigated by a reboot. On the 5.17 it was very aggressive, showing up early in boot, even on system threads like the crypto bot self test. Disabling the crypto boot self test made it go father but not much. If the error is detected on a system thread, there is no process to terminate: it is game over.
- hardened_usercopy=off was observed by 4.19 but ignored by 5.17


Well, it seems to make a difference for my rx2660, maybe because of
Montecitos instead of Montvales, I don't know. Or it depends on the
available memory (i.e. maybe it happens more/less often with less/more
memory available). Mine has 32 GiB in total.

I don’t exclude the possibility of human error in conducting all these experiments (some of the process is error prone), but I did run these experiments more than just a few times, so it would have to be a heck of a coincidence to and up with consistent results.


Sure, my test results are also more anecdotal as it takes so much time
to boot and run things (`openssl speed -elapsed` takes around 23 mins).

I'll now look at my other Itanium gear, rx2800 i2 first,


First testing with 5.17.0-1-mckinley on my rx2800 i2 interestingly shows
no issues with memcopy at all, not during kernel boot, nor post boot. My
kernel cmdline is as follows:

```
root@rx2800-i2:~# cat /proc/cmdline
BOOT_IMAGE=net0:/AC10027B.vmlinuz  root=/dev/nfs ip=:::::enp8s0f0:dhcp
modprobe.blacklist=hpsa,radeon
```

It could well be, that the Tukwilas behave differently in that case. In
the end they have their memory controller included in the processor and
not in the chipset like the older Montecitos or Montvales.

For reference:

firmware info:
```
[rx2800-i2-mp-ilo] CM:hpiLO-> sysrev


SYSREV

 Revisions        Active    Pending
 -------------------------------------
 iLO FW         : 01.54.03
 System FW      : 01.93
 MHW FPGA       : 02.02
 Power Mon FW   : 02.09
 PRS HW         : 02.06
 IOH HW         : 02.02
 Power Supply 1 : 02.01
 Power Supply 2 : 02.01
```

hardware info:
```
root@rx2800-i2:~# uname -a
Linux rx2800-i2 5.17.0-1-mckinley #1 SMP Debian 5.17.3-1 (2022-04-18)
ia64 GNU/Linux

root@rx2800-i2:~# lscpu
Architecture:           ia64
  CPU op-mode(s):       64-bit
  Byte Order:           Little Endian
CPU(s):                 8
  On-line CPU(s) list:  0-7
Vendor ID:              GenuineIntel
  BIOS Vendor ID:       Intel(R)  Itanium(R)  Processor 9320
  Model name:           Intel(R)  Itanium(R)  Processor 9320
    BIOS Model name:    Intel(R)  Itanium(R)  Processor 9320
    CPU family:         32
    Model:              4
    Thread(s) per core: 2
    Core(s) per socket: 4
    Socket(s):          1
    BogoMIPS:           2920.44
    Flags:              branchlong, 16-byte atomic ops, 0x8
Caches (sum of all):
  L1d:                  64 KiB (4 instances)
  L1i:                  64 KiB (4 instances)
  L2d:                  1 MiB (4 instances)
  L2i:                  4 MiB (8 instances)
  L3:                   32 MiB (8 instances)
NUMA:
  NUMA node(s):         1
  NUMA node0 CPU(s):    0-7

root@rx2800-i2:~# free -m
               total        used        free      shared  buff/cache
available
Mem:           24218         138       23983           2          96
   23871
Swap:              0           0           0
```

Cheers,
Frank

Reply to:

References:
- Re: rx2660 + debian
  - From: Frank Scheiner <frank.scheiner@web.de>
- Re: rx2660 + debian
  - From: Anton Borisov <anton.borisov@gmail.com>
- Re: rx2660 + debian
  - From: Frank Scheiner <frank.scheiner@web.de>
- Re: rx2660 + debian
  - From: Frank Scheiner <frank.scheiner@web.de>
- Re: rx2660 + debian
  - From: Pedro Miguel Justo <pmsjt@texair.net>
- Re: rx2660 + debian
  - From: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
- Re: rx2660 + debian
  - From: Pedro Miguel Justo <pmsjt@texair.net>
- Re: rx2660 + debian
  - From: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
- Re: rx2660 + debian
  - From: Pedro Miguel Justo <pmsjt@texair.net>
- Re: rx2660 + debian
  - From: Pedro Miguel Justo <pmsjt@texair.net>
- Re: rx2660 + debian
  - From: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
- Re: rx2660 + debian
  - From: Pedro Miguel Justo <pmsjt@texair.net>
- Re: rx2660 + debian
  - From: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
- Re: rx2660 + debian
  - From: Frank Scheiner <frank.scheiner@web.de>
- Re: rx2660 + debian
  - From: Pedro Miguel Justo <pmsjt@texair.net>
- Re: rx2660 + debian
  - From: Frank Scheiner <frank.scheiner@web.de>
- Re: rx2660 + debian
  - From: Pedro Miguel Justo <pmsjt@texair.net>

Prev by Date: Re: Side question on iLO
Next by Date: Re: rx2660 + debian
Previous by thread: Re: rx2660 + debian
Next by thread: Re: rx2660 + debian
Index(es):
- Date
- Thread