[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: rx2660 + debian




> On 2022/Apr/26, at 06:34, Frank Scheiner <frank.scheiner@web.de> wrote:
> 
> Hi Pedro, Anton, all
> 
> so I did some first testing on my Montecito driven rx2660:
> 
> firmware info:
> ```
> [rx2660-mp-ilo] MP:CM> sysrev
> 
> 
> SYSREV
> 
> Current firmware revisions
> 
> MP FW     : F.02.17
> BMC FW    : 05.23
> EFI FW    : ROM A 07.12, ROM B 07.12
> System FW : ROM A 04.04, ROM B 04.04, Boot ROM A
> UCIO FW   : 03.0b
> PRS FW    : 00.08 UpSeqRev: 02, DownSeqRev: 01
> ```
> 
> hardware info:
> ```
> root@rx2660:~# uname -a
> Linux rx2660 4.19.0-5-mckinley #1 SMP Debian 4.19.37-5 (2019-06-19) ia64
> GNU/Linux
> 
> root@rx2660:~# lscpu
> Architecture:        ia64
> CPU op-mode(s):      64-bit
> Byte Order:          Little Endian
> CPU(s):              8
> On-line CPU(s) list: 0-7
> Thread(s) per core:  2
> Core(s) per socket:  2
> Socket(s):           2
> NUMA node(s):        1
> Vendor ID:           GenuineIntel
> CPU family:          32
> Model:               7
> Model name:          Dual-Core Intel(R) Itanium(R) Processor 9050
> CPU MHz:             1594.639
> BogoMIPS:            3182.59
> L1d cache:           16K
> L1i cache:           16K
> L2d cache:           256K
> L2i cache:           1024K
> L3 cache:            12288K
> NUMA node0 CPU(s):   0-7
> Flags:               branchlong, 16-byte atomic ops
> 
> ## 8 CPUs (or better hardware threads) => SMT enabled!
> 
> root@rx2660:~# free -m
>              total        used        free      shared  buff/cache
> available
> Mem:          32574         394       31054          17        1125
>  31869
> Swap:             0           0           0
> ```
> 
> ...and after successfully upgrading my root FS (last touched in 2019!)
> with 4.19.0-5-mckinley w/o a problem, on first boot with
> 5.17.0-1-mckinley I also get those usercopy related problem(s), despite
> having two Montecitos installed:
> 
> ```
>  Booting `Debian GNU/Linux Sid (diskless)'
> 
> Loading Linux kernel ...
> Loading initial ramdisk ...
> [    0.000000] Linux version 5.17.0-1-mckinley
> (debian-kernel@lists.debian.org) (gcc-11 (Debian 11.2.0-20) 11.2.0, GNU
> ld (GNU Binutils for Debian) 2.38) #1 SMP Debian 5.17.3-1 (2022-04-18)
> [    0.000000] efi: EFI v2.00 by HP
> [    0.000000] efi: SALsystab=0x3ee7a000 ACPI 2.0=0x3fde6000
> ESI=0x3ee7b000 SMBIOS=0x3ee7c000 HCDP=0x3fde4000
> [    0.000000] PCDP: v3 at 0x3fde4000
> [...]
> [    1.199313] zbud: loaded
> [    1.199313] integrity: Platform Keyring initialized
> [    1.199313] Key type asymmetric registered
> [    1.199313] Asymmetric key parser 'x509' registered
> [    1.927433] Freeing initrd memory: 26688kB freed
> [    1.930079] usercopy: Kernel memory overwrite attempt detected to
> linear kernel text (offset 450555, size 4)!
> [    1.930079] kernel BUG at mm/usercopy.c:100!
> [    1.930079] kworker/u16:1[71]: bugcheck! 0 [1]
> [    1.930079] Modules linked in:
> [    1.930079]
> [    1.930079] CPU: 3 PID: 71 Comm: kworker/u16:1 Not tainted
> 5.17.0-1-mckinley #1  Debian 5.17.3-1
> [    1.930079] Hardware name: hp server rx2660                   , BIOS
> 04.04                                                            07/15/2008
> [    1.930079] psr : 00001010084a6010 ifs : 8000000000000410 ip  :
> [<a000000101353690>]    Not tainted (5.17.0-1-mckinley Debian 5.17.3-1)
> [    1.930079] ip is at usercopy_abort+0x120/0x130
> [...]
> ```
> 
> It wasn't dead in the water there, but continued kernel boot for a while
> still until it paniced.
> 
> Trying the 5.16.0-6-mckinley kernel on this rx2660 shows similar
> problems like above, though a little later in the kernel boot process:
> 
> ```
> [    0.000000] Linux version 5.16.0-6-mckinley
> (debian-kernel@lists.debian.org) (gcc-11 (Debian 11.2.0-19) 11.2.0, GNU
> ld (GNU Binutils for Debian) 2.38) #1 SMP Debian 5.16.18-1 (2022-03-29)
> [    0.000000] efi: EFI v2.00 by HP
> [    0.000000] efi: SALsystab=0x3ee7a000 ACPI 2.0=0x3fde6000
> ESI=0x3ee7b000 SMBIOS=0x3ee7c000 HCDP=0x3fde4000
> [    0.000000] PCDP: v3 at 0x3fde4000
> [...]
> [    1.213851] zbud: loaded
> [    1.217851] integrity: Platform Keyring initialized
> [    1.217851] Key type asymmetric registered
> [    1.217851] Asymmetric key parser 'x509' registered
> [    1.217851] Block layer SCSI generic (bsg) driver version 0.4 loaded
> (major 250)
> [    1.217859] io scheduler mq-deadline registered
> [    1.222505] input: Power Button as
> /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> [    1.222505] ACPI: button: Power Button [PWRF]
> [    1.222505] input: Sleep Button as
> /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
> [    1.222505] ACPI: button: Sleep Button [SLPF]
> [...]
> [    2.025839] rtc-efi rtc-efi.0: setting system clock to
> 2022-04-26T11:59:48 UTC (1650974388)
> [    2.026516] ledtrig-cpu: registered to indicate activity on CPUs
> [    2.026516] NET: Registered PF_INET6 protocol family
> [    2.030517] usercopy: Kernel memory overwrite attempt detected to
> linear kernel text (offset 450555, size 4)!
> [    2.030517] kernel BUG at mm/usercopy.c:99!
> [    2.030517] kworker/u16:0[81]: bugcheck! 0 [1]
> [    2.030517] Modules linked in:
> [    2.030517]
> [    2.030517] CPU: 2 PID: 81 Comm: kworker/u16:0 Not tainted
> 5.16.0-6-mckinley #1  Debian 5.16.18-1
> [    2.031443] Hardware name: hp server rx2660                   , BIOS
> 04.04                                                            07/15/2008
> [    2.031443] psr : 00001010084a6010 ifs : 8000000000000410 ip  :
> [<a000000101336b70>]    Not tainted (5.16.0-6-mckinley Debian 5.16.18-1)
> [    2.031443] ip is at usercopy_abort+0x120/0x130
> [...]
> ```
> 
> With `hardened_usercopy=off` added to the kernel commandline I get
> 5.16.0-6-mckinley to boot the rx2660 to the login prompt, though I still
> see:
> 
> ```
> [...]
> [    1.915245] Freeing initrd memory: 27200kB freed
> [    1.917530] usercopy: Kernel memory overwrite attempt detected to
> linear kernel text (offset 450555, size 4)!
> [    1.917739] kernel BUG at mm/usercopy.c:99!
> [    1.917739] kworker/u16:1[82]: bugcheck! 0 [1]
> [    1.917739] Modules linked in:
> 
> [    1.917739] CPU: 7 PID: 82 Comm: kworker/u16:1 Not tainted
> 5.16.0-6-mckinley #1  Debian 5.16.18-1
> [    1.917739] Hardware name: hp server rx2660                   , BIOS
> 04.04
>           07/15/2008
> [    1.921739] psr : 00001010084a6010 ifs : 8000000000000410 ip  :
> [<a000000101336b70>]    Not tainted (5.16.0-6-mckinley Debi
> an 5.16.18-1)
> [    1.921739] ip is at usercopy_abort+0x120/0x130
> [...]
> ```
> 
> ...in the boot process. Runing some benchmarks (7z, openssl) didn't
> print any issues into the system console.
> 
> It's similar with 5.17.0-1-mckinley, though with much more error
> messages during kernel boot, but it succeeds. Again during benchmark
> runs, no additional errors logged to the system console.
> 
> So maybe `hardened_usercopy=off` works more like changing "errors" to
> "warnings" or so.
> 
> ****
> 
> BTW, checking the bootloader configuration of my rx2620 I recognized
> that it uses `hardened_usercopy=off` since April 2019, which would
> explain, why booting 5.16.0-6-mckinley and benchmarking it in early
> April 2022 worked well. :-/
> 
> Until I found that out, I suspected a difference between zx1 (rx2620)
> and zx2 (rx2660) chipsets in regard to that memcopy issues, but the
> chipset could be unrelated then.
> 
> @Anton:
> So maybe best to give `hardened_usercopy=off` a try on your rx2660, too.
> From my testing on rx2660 and rx2620 this seems to unbreak the kernel
> boot and maybe also makes it less likely to hit the problem post boot. I
> don't know why Adrian's rx2660 seems to be unaffected by this, though.
> 

I did. That is why I ended up compiling 5.17 with the entire thing turned off. With 5.17, on my rx2660 Montvale with 8 cores the machine can’t get past early boot even with hardened_usercopy=off.

Those ‘warnings' are actually processes being killed. And they depend on the direction the bad copy was happening.

If you look at my prior responses, with the 4.19 kernel I was also running along fine for hours and, after some time building the kernel (a benchmark in itself) it would start producing these warning and would not allow compilation to continue any further. I would reboot the machine and that gave me a few more hours. When I tried 'hardened_usercopy=off’ on the 4.19 kernel that worked. I no longer got these process terminations after a few hours and the machine was able to build the entire kernel from beginning to end.

So, 4.19 and 5.17 are different in many ways (symptom-wise):
- I never got a bugckeck (panic) level failure on the 4.19. They were all process termination level.
- On the 4.19 these took quite some time to show up. Seemed to depend on the number of processes created in the past and was mitigated by a reboot. On the 5.17 it was very aggressive, showing up early in boot, even on system threads like the crypto bot self test. Disabling the crypto boot self test made it go father but not much. If the error is detected on a system thread, there is no process to terminate: it is game over.
- hardened_usercopy=off was observed by 4.19 but ignored by 5.17

I don’t exclude the possibility of human error in conducting all these experiments (some of the process is error prone), but I did run these experiments more than just a few times, so it would have to be a heck of a coincidence to and up with consistent results.

> I'll now look at my other Itanium gear, rx2800 i2 first,
> 
> Cheers,
> Frank


Reply to: