Hi everyone,

I have been experiencing some weird problems lately, starting with the OS becoming unresponsive “randomly”. After reinstalling multiple times (different filesystems, tried XFS and BTRFS, different nvme slots with different nvme drives, same results) I have narrowed it down to heavy IO operations on the nvme drive. Most of the time, I can’t even pull up dmesg, and force shutdown, as ZSH gives an Input/Output error no matter the command. A couple of times I was lucky enough for the system to stay somewhat responsive, so that I could pull up dmesg.

It gives a controller is down, resetting message, which I’ve seen on archwiki for some older Kingston and Samsung nvmes, and gives Kernel parameters to try (didn’t help much, they pretty much disable aspm on pcie).

What did help a bit was reverting a recent bios upgrade on my MSI Z490 Tomahawk, causing the system to not crash immediately with heavy I/O, but rather mount as ro, but the issue still persists. I have additionally run memtest86 for 8 passes, no issues there.

I have tried running the lts Kernel, but this didn’t help. The strange thing is, this error does not happen on Windows 11.

Has anyone experienced this before, and can give some pointers on what to try next? I’m at my wits end here.

Here are hardware and software details:

  • Arch with latest Kernel, 6.7.4 I believe, happened with other, older kernels too though, tried lts and zen
  • BTRFS on LUKS
  • i9-10850k
  • MSI z490 Tomahawk
  • GSkill 3200 MHz RAM, 32GB, DDR4
  • Samsung 970 Evo 1TB & Kioxia Exceria G2 1TB (tested both drives, in both slots each, over multiple installs)
  • Vega 56 GPU
  • Be quiet Straight Power 11 750W PSU
  • d3Xt3r@lemmy.nz
    link
    fedilink
    arrow-up
    1
    ·
    5 months ago

    Could you post your full (relevant/redacted) journalct logs prior to the crash? Would be interesting to see if there was something else going on at the time which could’ve triggered it.

    journalctl -b -1 will show all messages from the previous boot.

    Also, what’s your /etc/fstab like? Just wondering whether changing any of the mount options could help (eg commit=, discard etc).

    Finally, have you checked fwupdmgr for firmware updates for your NVMe controller/drives? You should also check the respective manufacturer’s website, since not everyone publishes their firmware to the LVFS.

    Also, I found this thread where someone with a similar issue solved it by swapping out their PSU, so might be worth swapping it out if you can and see if it makes a difference.