lördag 11 april 2020

Overclock works in Windows but hangs in Linux.

So I have this old Intel Lynnfield system from 2010. It's overclocked and has been so for many years. Works perfectly in Windows even during high load for several days. Ubuntu and other distributions I tried would randomly freeze. Even more strangely, it never happened during heavy load.

Maybe Linux uses some CPU instruction Windows doesn't and that particular instruction doesn't work with the overclock? I clocked the CPU down, but Linux would still hang randomly.
I upgraded the motherboard bios, I upgraded the GPU bios. I replaced the soundcard. I checked SMART data for the drives. I ran memtest. I tested with a different GPU vendor.
But I couldn't find what was going on. Why was it freezing, so randomly, when there was almost no load in the system?

But then, for some reason I checked the cpu frequency using /proc/cpuinfo. The CPU wasn't running at the speed I had specified in bios, it was clocking down. That could certainly explain why the system froze during low load scenarios. CPU scaling and overclocking seldom works well, but I had disabled the C-states in bios. I had disabled turbo in bios. I knew for a fact that this was the case, since I've checked it numerous times in Windows.

So I started searching about C-states and the Linux kernel. For some reason the kernel ignores whether the bios has turned off the c-states or not. Thus, my disabled C-states was enabled. Confusing and frustrating.

I tried to fix this by disabling cpu scaling within Ubuntu, but for some reason it never worked as it was supposed to. Then I found that there is a kernel parameter for specifying what level of C-state that is allowed.
intel_idle.max_cstate=1

So I opened up /etc/default/grub and modified the GRUB_CMDLINE_LINUX_DEFAULT:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_idle.max_cstate=1"

Then I ran
sudo update-grub2
sudo reboot

Finally I confirmed that the cpu was no longer scaling using
cat /proc/cpuinfo

If I understand it correctly (haven't verified), you can set the max_cstate to 0 to disable the intel_idle driver, which should make the kernel use the bios/uefi settings instead.

Sources:
Info from Dell
Info from IBM