[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: rx2660 + debian

> On 2022/Apr/25, at 14:09, Sergei Trofimovich <slyich@gmail.com> wrote:
> On Mon, 25 Apr 2022 15:07:58 +0000
> Pedro Miguel Justo <pmsjt@texair.net> wrote:
>>> On 2022/Apr/25, at 01:22, Pedro Miguel Justo <pmsjt@texair.net> wrote:
>>>> On 2022/Apr/25, at 01:14, Frank Scheiner <frank.scheiner@web.de> wrote:
>>>> Hi guys,
>>>> On 25.04.22 10:09, John Paul Adrian Glaubitz wrote:  
>>>>>> From what I can understand by the information in the bugcheck, this is somewhat related to a violation
>>>>>> in parameter copy from user to kernel during some boot-time, crypto, self-test. Does that sound right?
>>>>>> If that is the case, how would this be related to FW?  
>>>>> I'm not claiming that it must be related to the firmware, I'm just saying that I don't see this problem
>>>>> on my RX2660 at all and I have even reinstalled it recently with one of the latest firmware images
>>>>> without having to pass any parameter to the command line.  
>>>> A difference between Adrian's rx2660 and Pedro's rx2660 is Montecito
>>>> left and Montvale right.
>>>> But could still be multiple other reasons we haven't looked at yet in
>>>> detail:
>>>> * amount of memory installed
>>>> * SMT enabled or not
>>>> * number of processor modules installed
>>>> It might be possible for me to check on my rx2660s (one with Montvale
>>>> and one with Montecito(s)) tomorrow. I will then also look at my other
>>>> Itanium gear and gather relevant information.
>>> Yes, this sounds mode likely to me too.
>>> The crypto self-tests seem to be an innocent bystander here. I tried booting the most recent kernel with the option “cryptomgr.notests” and it went much farther. Alas it still failed with another buffer copy validation for a different caller altogether:
>>> [    3.836466]  [<a000000101353690>] usercopy_abort+0x120/0x130
>>> [    3.836466]                                 sp=e0000001000cfdf0 bsp=e0000001000c9388
>>> [    3.836466]  [<a0000001004c5660>] __check_object_size+0x3c0/0x420
>>> [    3.836466]                                 sp=e0000001000cfe00 bsp=e0000001000c9350
>>> [    3.836466]  [<a000000100570030>] sys_getcwd+0x250/0x420
>>> [    3.836466]                                 sp=e0000001000cfe00 bsp=e0000001000c92c8
>>> [    3.836466]  [<a00000010000c860>] ia64_ret_from_syscall+0x0/0x20
>>> [    3.836466]                                 sp=e0000001000cfe30 bsp=e0000001000c92c8
>>> [    3.836466]  [<a000000000040720>] ia64_ivt+0xffffffff00040720/0x400
>>> [    3.836466]                                 sp=e0000001000d0000 bsp=e0000001000c92c8
>>> This suggests the bug might be in the logic validating these buffers against the allocations (heap, span, etc).
>>> I don’t know why hardened_usercopy=off is not being observed by the kernel. As a work-around I am copying myself a new kernel with CONFIG_HARDENED_USERCOPY disabled at the source. 
>> Even with kernel "Linux debian 4.19.0-5-mckinley #1 SMP Debian 4.19.37-5 (2019-06-19) ia64 GNU/Linux"
>> Things are still not 100%. After a few hours into building the kernel it started crashing also with usercopy validations but, this time, the other way around. And because it was the other way around, it led to process termination instead of full-blown bugcheck. This could be related or not. Coule very well be a different bug that happens to manifest itself round the same validation.
>>  CC [M]  drivers/net/wireless/realtek/rtw88/rtw8822be.o
>>  LD [M]  drivers/net/wireless/realtek/rtw88/rtw88_8822be.o
>>  CC [M]  drivers/net/wireless/realtek/rtw88/rtw8822c.o
>> Segmentation fault
>> make[5]: *** [scripts/Makefile.build:293: drivers/net/wireless/realtek/rtw88/rtw8822c.o] Error 139
>> make[5]: *** Deleting file 'drivers/net/wireless/realtek/rtw88/rtw8822c.o'
>> make[4]: *** [scripts/Makefile.build:555: drivers/net/wireless/realtek/rtw88] Error 2
>> make[3]: *** [scripts/Makefile.build:555: drivers/net/wireless/realtek] Error 2
>> make[2]: *** [scripts/Makefile.build:555: drivers/net/wireless] Error 2
>> make[1]: *** [scripts/Makefile.build:555: drivers/net] Error 2
>> make: *** [Makefile:1855: drivers] Error 2
>> pmsjt@debian:~/linux-source-5.17$ make
>> Message from syslogd@debian at Apr 25 07:58:08 ...
>> kernel:[23420.984012] usercopy: Kernel memory overwrite attempt detected to linear kernel text (offset 1916912, size 8)!
>> Message from syslogd@debian at Apr 25 07:58:08 ...
>> kernel:[23421.268009] usercopy: Kernel memory overwrite attempt detected to linear kernel text (offset 1818608, size 8)!
>>  HOSTCC  scripts/sign-file
>>  CALL    scripts/checksyscalls.sh
>> <stdin>:1517:2: warning: #warning syscall clone3 not implemented [-Wcpp]
>>  CALL    scripts/atomic/check-atomics.sh
>>  CHK     include/generated/compile.h
>> make[2]: *** [scripts/Makefile.build:294: arch/ia64/kernel/signal.o] Segmentation fault
>> Message from syslogd@debian at Apr 25 07:58:11 ...
>> kernel:[23423.626254] usercopy: Kernel memory overwrite attempt detected to linear kernel text (offset 1933296, size 8)!
>> make[1]: *** [scripts/Makefile.build:555: arch/ia64/kernel] Error 2
>> make: *** [Makefile:1855: arch/ia64] Error 2

Hi Sergei

> In my understanding hardened_usercopy=on is completely broken on ia64
> today. It can't run any userspace. Even init process would not survive
> machine boot. At least that's what I experienced on rx3600.
> Thus I think if your system survives that much time I would guess
> that you have hardened_usercopy=off in full effect at least at boot.

I want to make sure there is no confusion here. My system only ’survives’ this much when I am using the 4.19 kernel (even when the hardened_usercopy=off is not present). With kernels more recent than that the system will bugcheck very early on boot even if hardened_usercopy=off is present.

> I would speculate it's some kind of memory corruption around
> 'bypass_usercopy_checks' key.
> Worth adding a few printk()s to mm/usercopy.c into 'usercopy_abort()'
> and into 'set_hardened_usercopy()' just to make sure 'bypass_usercopy_checks'
> has expected 'true' setting at boot time and at crash time.

Right - we definitively need more context about what is the root cause and characteristics of the bug. When the failure happens, is the (pointer, range) of the copy really out-of-whack, or is the validation code not making sense of the boundaries and over-actively failing.

> -- 
>  Sergei

Reply to: