[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Recurrent alerts "Package temperature above threshold, cpu clock throttled"



Hi L0f4r0,

Ben (or other team members), if you're reading this and are short on
time, would you please skip to the question at the bottom and reply to
it?

<l0f4r0@tuta.io> writes:

> At least, Internet resources indicate that's it's safer this way with
> HT deactivated (regarding MDS attacks) but I don't know (and really
> not anyone as it really depends on current CPU tasks) how my
> performances are impacted now...  In your case, it seems to be a
> benefit but surely it's not always the case...

You're absolutely right.  My approach is to sacrifice peak performance
for consistency, but other people may prefer (or require) peak
performance.

> Maybe I can just continue like this and restore HT if this is
> impossible to live with ;)
>

Anecdotally, I found HT makes interactivity under high load slightly
better.  Someone who knows about things like CPU context switches and
cache misses might be able to say why.  As you noted though, increased
risk of MDS attack.  Too bad disabling it didn't help with heat.

>> Have you tried disabling CPU freq boost?  When the ambient
>> temperature is above 27°C my X220 and X230 need to have boost
>> disabled to avoid overheating/throttling.
>>
> I've seen at least 2 ways to deactivate turbo boost:
> 1) echo "1" to /sys/devices/system/cpu/intel_pstate/no_turbo
> Visibly, as sysctl only works with /proc/sys (and not /sys), this
> needs to be set permanently via a systemd service. What do you think
> about this procedure:
> https://blog.christophersmart.com/2017/02/08/manage-intel-turbo-boost-with-systemd/?
> 2) modify MSR registers via wrmsr (https://askubuntu.com/a/619881). I don't know if there is persistance here...
>
> Do you use any of these? Something else?
>

I used the systemd method on my sister's old Macbook.  It seems to help
with heat and fan noise, and everything is still consistently smooth, so
we count it as a win.

> PS: My CPU is Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz.

Oooh.  Earlier your wrote that this is an X390, right?  The powerful CPU
in a thin and tiny case with lightweight cooling solution problem may
apply.

BTW, have you checked for Lenovo-provided firmware/BIOS/EC updates?
I've seen temperature and fan profile-related fixes in a couple of them
(for other Lenovo models).

> Actually my main questions are: 
>
> i) Are those temperature warnings legitimate?  I mean maybe those
> warnings have been wrongly triggered because probes are not accurate?
> I'm not speaking about HW failure (my laptop is brand new) but maybe I
> just don't have the right driver somewhere...  Indeed, it would be too
> bad to decrease all my performances with HT/turbo-boost disabled
> (security apart) whereas the journalctl warnings are wrong
> initially... ;)
>

Honestly, I'm not sure.  This seems plausible.  In your other email you wrote:

> I've just upgraded my kernel 4.19.0-6-amd64 to backports longterm
> 5.4.0-0.bpo.2-amd64 because I had issues with light-docker.  The
> aforementioned journalctl entries have been reassigned from crit/2 to
> warning/4!  Maybe it's not so serious? ^^

This severity change, plus what I've read other reports of newer laptops
overheating with recent kernels makes me wonder if there might be some
churn in the Intel P-State driver.  On reddit and stackexchange some
people have reported success disabling it and using the older
acpi-cpufreq driver instead.  From what I've read the P-State driver
bypasses the ACPI hints, so if I had to guess, the reported success of
this method would depend on not-buggy ACPI firmware that provides
hardware-specific hints that work better than the Intel solution for the
general case.  That said, the new solution is supposed to be better in
every way (read on).

> PS: I have installed intel-microcode 3.20191115.2~deb10u1. Still the
> same issues.  I don't know if it's an issue but microcode module is
> blacklisted in /etc/modprobe.d/intel-microcode-blacklist.conf (it
> seems to be a precaution regarding unsafe updates).
>

OT: IIRC this is to prevent updates at an unsafe time, and to use the
newer early microcode loading method rather than the older (later in the
boot process method).

> ii) Is it risky to do nothing about these temperature warnings?  I
> have no idea what EC means (Embedded Controller?) but you said EC
> eventually shutdowns the laptop if need be. I presume it's not really
> beneficial from the user point of view as the current tasks will be
> shutdowned and some work/data might be lost during the process.
>

Yes, "EC" means embedded controller :-)  Intel hardware is excellent
about shutting itself down before damage occurs.

So anyways, given that you have a new ultrabook with a powerful CPU, I
think thermald is probably the best solution to try.  You can read about
how it combines many other methods and aims to solve the problems
inherent to ultrabooks here (01.org is the Intel open source project):

  https://01.org/linux-thermal-daemon/documentation/introduction-thermal-daemon

Ben (and team), is there any reason why thermald isn't part of
task-laptop?  If it solves L0f4r0's issue it might be worth adding it,
and if it's true that laptops...these new ultrabooks...are being
designed to require it (eg: insufficient cooling solution) then it would
make sense to enable it for the general case that it purportedly solves.


Best,
Nicholas

Attachment: signature.asc
Description: PGP signature


Reply to: