Home | News | Download | Packages | Forum | Wiki | Github

Nouveau cpu soft/hard lockup


#1

Hello,

I recently bought a new asus laptop, with the i7-8550u processor.
I tried Fedora, but I eventually gave up and switched back to Void, which was what I was using on my older laptop.

With Fedora I was having some system failures that mentioned both the kernel and the nouveau driver, due to a CPU lockup. There were some occasional freezes, but it worked mostly fine. I noticed that with Xorg though, I sometimes had to force reboot due to these freezes.

Installing void and the nouveau drivers, I’ve been having similar issues, and I’m not sure how to fix this, since I can’t really find similar issues.

If I grep dmesg for nouveau, I can find some info:

WARNING: CPU: 1 PID: 18 at drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgf100.c:207 gf100_vmm_flush_+0x14e/0x190 [nouveau]

WARNING: CPU: 2 PID: 952 at drivers/gpu/drm/nouveau/nvkm/engine/gr/gf100.c:1507 gf100_gr_init_ctxctl+0x7fa/0x990 [nouveau]

WARNING: CPU: 4 PID: 110 at drivers/gpu/drm/nouveau/nvkm/subdev/pmu/base.c:86 nvkm_pmu_reset+0x14c/0x160 [nouveau]

These are some of the different messages I was able to find. There are some other messages that looked strange to me, I was not having on the older laptop. At the beginning of dmesg, I have:

[    1.540108] nouveau: detected PR support, will not use DSM
[    1.540134] nouveau 0000:01:00.0: enabling device (0006 -> 0007)
[    1.540562] nouveau 0000:01:00.0: NVIDIA GP108 (138000a1)
[    1.581820] nouveau 0000:01:00.0: bios: version 86.08.0e.00.55
[    1.661816] nouveau 0000:01:00.0: fb: 2048 MiB GDDR5
[    3.665805] nouveau 0000:01:00.0: timeout

I also found that sometimes when I try to reboot, I get a hard cpu lockup, which lead to a kernel panic once, though I don’t have the log (it only happened once, so far).

Here’s the dmesg output from those warnings:

[   50.948505] nouveau 0000:01:00.0: timeout
[   50.948567] WARNING: CPU: 1 PID: 18 at drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgf100.c:207 gf100_vmm_flush_+0x14e/0x190 [nouveau]
[   50.948568] Modules linked in: ctr ccm 8021q garp mrp stp llc nls_iso8859_1 nls_cp437 vfat fat hid_multitouch arc4 iwlmvm snd_soc_skl mac80211 snd_soc_skl_ipc coretemp snd_hda_ext_core snd_hda_codec_hdmi intel_rapl snd_soc_sst_dsp snd_soc_sst_ipc x86_pkg_temp_thermal snd_soc_acpi spi_pxa2xx_platform iTCO_wdt intel_powerclamp iTCO_vendor_support 8250_dw snd_hda_codec_generic snd_soc_core kvm_intel i2c_designware_platform iwlwifi i2c_designware_core snd_compress kvm snd_pcm_dmaengine ac97_bus irqbypass crct10dif_pclmul crc32_pclmul cfg80211 ghash_clmulni_intel pcbc snd_hda_intel snd_hda_codec aesni_intel uvcvideo aes_x86_64 crypto_simd glue_helper snd_hda_core videobuf2_vmalloc cryptd videobuf2_memops snd_hwdep intel_cstate videobuf2_v4l2 asus_nb_wmi intel_rapl_perf asus_wmi videobuf2_common snd_pcm
[   50.948588]  pcspkr sparse_keymap i2c_i801 wmi_bmof idma64 input_leds joydev virt_dma videodev tpm_crb mei_me btusb shpchp media btrtl processor_thermal_device tpm_tis mei btbcm intel_lpss_pci intel_soc_dts_iosf tpm_tis_core btintel int3403_thermal intel_lpss intel_pch_thermal battery ac int340x_thermal_zone tpm int3400_thermal thermal acpi_thermal_rel rng_core evdev asus_wireless acpi_pad mac_hid snd_seq snd_seq_device snd_timer snd soundcore vhost_vsock vmw_vsock_virtio_transport_common vsock vhost_net vhost tap uhid hci_vhci bluetooth ecdh_generic rfkill vfio_iommu_type1 vfio dm_mod uinput userio ppp_generic slhc tun loop btrfs xor zstd_compress raid6_pq zstd_decompress xxhash cuse fuse ext4 crc32c_generic crc16 mbcache jbd2 sd_mod hid_generic usbkbd usbmouse usbhid nouveau i915 intel_gtt hwmon
[   50.948613]  i2c_algo_bit drm_kms_helper ahci syscopyarea libahci sysfillrect ttm sysimgblt xhci_pci fb_sys_fops libata xhci_hcd drm crc32c_intel scsi_mod mxm_wmi usbcore serio_raw i2c_hid agpgart hid wmi video button
[   50.948621] CPU: 1 PID: 18 Comm: kworker/1:0 Tainted: G        W        4.16.7_1 #1
[   50.948622] Hardware name: ASUSTeK COMPUTER INC. X510UNR/X510UNR, BIOS X510UNR.301 09/25/2017
[   50.948624] Workqueue: pm pm_runtime_work
[   50.948635] RIP: 0010:gf100_vmm_flush_+0x14e/0x190 [nouveau]
[   50.948635] RSP: 0018:ffffbab540d6f668 EFLAGS: 00010282
[   50.948636] RAX: 0000000000000000 RBX: ffff93b1e4b80b28 RCX: ffffffffae057e48
[   50.948636] RDX: 0000000000000001 RSI: 0000000000000096 RDI: 0000000000000246
[   50.948637] RBP: ffff93b1de5d3468 R08: 0000000000000001 R09: 0000000000000503
[   50.948637] R10: 0000000000000009 R11: 0000000000000000 R12: ffff93b1dc98c0d0
[   50.948638] R13: 0000000b658ee3a0 R14: ffff93b1de12b0e0 R15: ffff93b1dcbbd860
[   50.948639] FS:  0000000000000000(0000) GS:ffff93b1eec40000(0000) knlGS:0000000000000000
[   50.948639] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   50.948640] CR2: 00007f2b84353b04 CR3: 000000021100a006 CR4: 00000000003606e0
[   50.948640] Call Trace:
[   50.948652]  nvkm_vmm_ptes_get_map+0x24b/0x410 [nouveau]
[   50.948662]  ? gm20b_vmm_new_fixed+0x40/0x40 [nouveau]
[   50.948671]  nvkm_vmm_map+0x1f5/0x3a0 [nouveau]
[   50.948680]  nvkm_mem_map_dma+0x56/0x80 [nouveau]
[   50.948690]  nvkm_uvmm_mthd+0x747/0x880 [nouveau]
[   50.948696]  nvkm_ioctl+0x10a/0x240 [nouveau]
[   50.948701]  nvif_object_mthd+0x108/0x130 [nouveau]
[   50.948704]  ? __slab_alloc.isra.23+0x27/0x40
[   50.948705]  ? __kmalloc+0x121/0x220
[   50.948709]  nvif_vmm_map+0x81/0xb0 [nouveau]
[   50.948718]  nouveau_mem_map+0x87/0x100 [nouveau]
[   50.948727]  nouveau_bo_move_m2mf.constprop.14+0x1cf/0x1f0 [nouveau]
[   50.948735]  nouveau_bo_move+0xac/0x480 [nouveau]
[   50.948740]  ? nvif_vmm_unmap+0x38/0x60 [nouveau]
[   50.948748]  ? nouveau_vma_unmap+0x20/0x30 [nouveau]
[   50.948751]  ttm_bo_handle_move_mem+0x28e/0x5b0 [ttm]
[   50.948753]  ttm_bo_evict+0x153/0x330 [ttm]
[   50.948755]  ttm_mem_evict_first+0x189/0x200 [ttm]
[   50.948757]  ttm_bo_force_list_clean+0x8e/0x160 [ttm]
[   50.948760]  ? pci_pm_runtime_resume+0xa0/0xa0
[   50.948767]  nouveau_do_suspend+0x7b/0x2a0 [nouveau]
[   50.948775]  nouveau_pmops_runtime_suspend+0x54/0xb0 [nouveau]
[   50.948777]  pci_pm_runtime_suspend+0x61/0x160
[   50.948778]  __rpm_callback+0xbc/0x1f0
[   50.948780]  ? __switch_to_asm+0x34/0x70
[   50.948781]  ? pci_pm_runtime_resume+0xa0/0xa0
[   50.948782]  rpm_callback+0x1f/0x70
[   50.948783]  ? pci_pm_runtime_resume+0xa0/0xa0
[   50.948784]  rpm_suspend+0x163/0x690
[   50.948785]  pm_runtime_work+0x64/0xa0
[   50.948787]  process_one_work+0x15b/0x3c0
[   50.948788]  worker_thread+0x2e/0x380
[   50.948790]  ? process_one_work+0x3c0/0x3c0
[   50.948791]  kthread+0x113/0x130
[   50.948792]  ? kthread_create_on_node+0x70/0x70
[   50.948793]  ret_from_fork+0x35/0x40
[   50.948794] Code: 41 5f e9 e6 e3 1a ed 48 8b 7d 10 48 8b 5f 50 48 85 db 74 46 e8 84 30 f5 ec 48 89 da 48 89 c6 48 c7 c7 f2 d9 67 c0 e8 92 e9 b3 ec <0f> 0b eb c2 48 8b 7d 10 48 8b 5f 50 48 85 db 74 24 e8 5c 30 f5 
[   50.948811] ---[ end trace 06574576f19915d3 ]---
[   83.377403] ------------[ cut here ]------------
[   83.377404] nouveau 0000:01:00.0: timeout
[   83.377462] WARNING: CPU: 1 PID: 18 at drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgf100.c:207 gf100_vmm_flush_+0x14e/0x190 [nouveau]
[   83.377463] Modules linked in: ctr ccm 8021q garp mrp stp llc nls_iso8859_1 nls_cp437 vfat fat hid_multitouch arc4 iwlmvm snd_soc_skl mac80211 snd_soc_skl_ipc coretemp snd_hda_ext_core snd_hda_codec_hdmi intel_rapl snd_soc_sst_dsp snd_soc_sst_ipc x86_pkg_temp_thermal snd_soc_acpi spi_pxa2xx_platform iTCO_wdt intel_powerclamp iTCO_vendor_support 8250_dw snd_hda_codec_generic snd_soc_core kvm_intel i2c_designware_platform iwlwifi i2c_designware_core snd_compress kvm snd_pcm_dmaengine ac97_bus irqbypass crct10dif_pclmul crc32_pclmul cfg80211 ghash_clmulni_intel pcbc snd_hda_intel snd_hda_codec aesni_intel uvcvideo aes_x86_64 crypto_simd glue_helper snd_hda_core videobuf2_vmalloc cryptd videobuf2_memops snd_hwdep intel_cstate videobuf2_v4l2 asus_nb_wmi intel_rapl_perf asus_wmi videobuf2_common snd_pcm
[   83.377485]  pcspkr sparse_keymap i2c_i801 wmi_bmof idma64 input_leds joydev virt_dma videodev tpm_crb mei_me btusb shpchp media btrtl processor_thermal_device tpm_tis mei btbcm intel_lpss_pci intel_soc_dts_iosf tpm_tis_core btintel int3403_thermal intel_lpss intel_pch_thermal battery ac int340x_thermal_zone tpm int3400_thermal thermal acpi_thermal_rel rng_core evdev asus_wireless acpi_pad mac_hid snd_seq snd_seq_device snd_timer snd soundcore vhost_vsock vmw_vsock_virtio_transport_common vsock vhost_net vhost tap uhid hci_vhci bluetooth ecdh_generic rfkill vfio_iommu_type1 vfio dm_mod uinput userio ppp_generic slhc tun loop btrfs xor zstd_compress raid6_pq zstd_decompress xxhash cuse fuse ext4 crc32c_generic crc16 mbcache jbd2 sd_mod hid_generic usbkbd usbmouse usbhid nouveau i915 intel_gtt hwmon
[   83.377510]  i2c_algo_bit drm_kms_helper ahci syscopyarea libahci sysfillrect ttm sysimgblt xhci_pci fb_sys_fops libata xhci_hcd drm crc32c_intel scsi_mod mxm_wmi usbcore serio_raw i2c_hid agpgart hid wmi video button
[   83.377520] CPU: 1 PID: 18 Comm: kworker/1:0 Tainted: G        W        4.16.7_1 #1
[   83.377520] Hardware name: ASUSTeK COMPUTER INC. X510UNR/X510UNR, BIOS X510UNR.301 09/25/2017
[   83.377523] Workqueue: pm pm_runtime_work
[   83.377534] RIP: 0010:gf100_vmm_flush_+0x14e/0x190 [nouveau]
[   83.377535] RSP: 0018:ffffbab540d6f7f8 EFLAGS: 00010282
[   83.377536] RAX: 0000000000000000 RBX: ffff93b1e4b80b28 RCX: ffffffffae057e48
[   83.377536] RDX: 0000000000000001 RSI: 0000000000000096 RDI: 0000000000000246
[   83.377537] RBP: ffff93b1de5d3468 R08: 0000000000000001 R09: 000000000000053d
[   83.377537] R10: ffffffffc0542f00 R11: 0000000000000000 R12: ffff93b1dc98c0d0
[   83.377538] R13: 00000012f277a920 R14: ffff93b1de12b0e0 R15: ffff93b1e13f4e38
[   83.377538] FS:  0000000000000000(0000) GS:ffff93b1eec40000(0000) knlGS:0000000000000000
[   83.377539] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   83.377540] CR2: 00007f9ddb6aa440 CR3: 000000021100a005 CR4: 00000000003606e0
[   83.377540] Call Trace:
[   83.377553]  nvkm_vmm_map+0x1ac/0x3a0 [nouveau]
[   83.377563]  ? gp100_vmm_pgt_dma+0x220/0x220 [nouveau]
[   83.377572]  nvkm_vram_map+0x56/0x80 [nouveau]
[   83.377583]  nvkm_uvmm_mthd+0x747/0x880 [nouveau]
[   83.377588]  nvkm_ioctl+0x10a/0x240 [nouveau]
[   83.377593]  nvif_object_mthd+0x108/0x130 [nouveau]
[   83.377596]  ? __slab_alloc.isra.23+0x27/0x40
[   83.377597]  ? __kmalloc+0x121/0x220
[   83.377602]  nvif_vmm_map+0x81/0xb0 [nouveau]
[   83.377611]  nouveau_mem_map+0x87/0x100 [nouveau]
[   83.377619]  nouveau_vma_map+0x44/0x70 [nouveau]
[   83.377627]  nouveau_bo_move_ntfy+0x70/0xd0 [nouveau]
[   83.377630]  ttm_bo_handle_move_mem+0x3e3/0x5b0 [ttm]
[   83.377632]  ttm_bo_evict+0x153/0x330 [ttm]
[   83.377634]  ttm_mem_evict_first+0x189/0x200 [ttm]
[   83.377636]  ttm_bo_force_list_clean+0x8e/0x160 [ttm]
[   83.377638]  ? pci_pm_runtime_resume+0xa0/0xa0
[   83.377646]  nouveau_do_suspend+0x7b/0x2a0 [nouveau]
[   83.377654]  nouveau_pmops_runtime_suspend+0x54/0xb0 [nouveau]
[   83.377656]  pci_pm_runtime_suspend+0x61/0x160
[   83.377658]  __rpm_callback+0xbc/0x1f0
[   83.377660]  ? __switch_to_asm+0x34/0x70
[   83.377661]  ? pci_pm_runtime_resume+0xa0/0xa0
[   83.377662]  rpm_callback+0x1f/0x70
[   83.377663]  ? pci_pm_runtime_resume+0xa0/0xa0
[   83.377664]  rpm_suspend+0x163/0x690
[   83.377665]  pm_runtime_work+0x64/0xa0
[   83.377667]  process_one_work+0x15b/0x3c0
[   83.377668]  worker_thread+0x2e/0x380
[   83.377669]  ? process_one_work+0x3c0/0x3c0
[   83.377671]  kthread+0x113/0x130
[   83.377672]  ? kthread_create_on_node+0x70/0x70
[   83.377673]  ret_from_fork+0x35/0x40
[   83.377674] Code: 41 5f e9 e6 e3 1a ed 48 8b 7d 10 48 8b 5f 50 48 85 db 74 46 e8 84 30 f5 ec 48 89 da 48 89 c6 48 c7 c7 f2 d9 67 c0 e8 92 e9 b3 ec <0f> 0b eb c2 48 8b 7d 10 48 8b 5f 50 48 85 db 74 24 e8 5c 30 f5 
[   83.377691] ---[ end trace 06574576f19915d4 ]---
[   83.377708] [TTM] Buffer eviction failed
[   83.377798] DMAR: DRHD: handling fault status reg 3
[   83.377809] DMAR: [DMA Write] Request device [01:00.0] fault addr fffa7000 [fault reason 05] PTE Write access is not set
[   83.377814] DMAR: DRHD: handling fault status reg 3
[   83.377826] DMAR: [DMA Write] Request device [01:00.0] fault addr fff95000 [fault reason 05] PTE Write access is not set
[   83.377832] DMAR: DRHD: handling fault status reg 3
[   83.377842] DMAR: [DMA Write] Request device [01:00.0] fault addr fff92000 [fault reason 05] PTE Write access is not set
[   83.377851] DMAR: DRHD: handling fault status reg 3
[   83.377862] DMAR: [DMA Write] Request device [01:00.0] fault addr fff8e000 [fault reason 05] PTE Write access is not set
[   83.377868] DMAR: DRHD: handling fault status reg 3
[   83.377878] DMAR: [DMA Write] Request device [01:00.0] fault addr fff8a000 [fault reason 05] PTE Write access is not set
[   98.376999] nouveau 0000:01:00.0: DRM: failed to idle channel 0 [DRM]

(Erin) #2

Do you have the Intel firmware installed and an up to date BIOS?


#3

I have linux-firmware-intel installed, (not sure if relevant, but also have mesa-intel-dri and xf86-video-intel installed)

The BIOS I didn’t try to update yet, but running dmidecode I get this:

# dmidecode 3.1
Getting SMBIOS data from sysfs.
SMBIOS 3.0.0 present.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
	Vendor: American Megatrends Inc.
	Version: X510UNR.301
	Release Date: 09/25/2017
	Address: 0xF0000
	Runtime Size: 64 kB
	ROM Size: 8192 kB
	Characteristics:
		PCI is supported
		BIOS is upgradeable
		BIOS shadowing is allowed
		Boot from CD is supported
		Selectable boot is supported
		BIOS ROM is socketed
		EDD is supported
		5.25"/1.2 MB floppy services are supported (int 13h)
		3.5"/720 kB floppy services are supported (int 13h)
		3.5"/2.88 MB floppy services are supported (int 13h)
		Print screen service is supported (int 5h)
		8042 keyboard services are supported (int 9h)
		Serial services are supported (int 14h)
		Printer services are supported (int 17h)
		ACPI is supported
		USB legacy is supported
		Smart battery is supported
		BIOS boot specification is supported
		Targeted content distribution is supported
		UEFI is supported
	BIOS Revision: 5.12

I’ll check if there is a newer version meanwhile.


(Erin) #4

How about intel-ucode? xbps-query -Rs intel


#5

I updated the microcode, but I still have issues if I grep dmesg

$ sudo dmesg | grep microcode
[    0.000000] microcode: microcode updated early to revision 0x84, date = 2018-01-21
[    1.196719] microcode: sig=0x806ea, pf=0x80, revision=0x84
[    1.196906] microcode: Microcode Update Driver: v2.2.

(I followed this to update the microcode, not sure if makes a difference)

I’ve noticed that if I reboot or shutdown, it freezes and I have to forcefully shutdown the pc. It doesn’t seem to happen if I don’t start the X server, though.

BIOS version seems to be the latest available as well, version 301.


#6

As of late, I have been having the exact same problems, which weren’t happening with the NVIDIA binary driver.
I’m also using an asus motherboard.


#7

I didn’t mention, but this is one of those optimus setups, with NVIDIA MX150. I don’t really intend on using it, though, at least for now, didn’t even try

I was able to get the kernel panic output, by changing to tty2 and trying to reboot. (Kinda low quality, sorry about that)


#8

I read your comment I tried out installing the NVIDIA binary driver, just to check it out, and it seems the problem does go away.

I’ve noticed that I was having issues even running lspci which would just freeze up the system, like if I tried to shutdown, though I’m not sure if it’s a kernel panic again.


#9

Just an update:

I’ve been trying to research more about the problem and it seems to be related to NVIDIA’s GP108M. I found some articles mentioning nouveau’s progress with some nvidia’s graphics cards wasn’t great, which I guess could explain the issues I was having (lscpi and xorg freezes).

Installing the NVIDIA proprietary driver seems to fix it, but I didn’t really want to, so I kept trying other stuff (I only really need the Intel GPU).

I’m not sure of what the fix was, but I tried compiling the kernel with a custom config without loadable modules, including the needed firmware in the kernel, and tried disabling the nouveau driver, as well as the laptop gpu switch mechanism. Disabling the driver seems to have fixed it, and I haven’t noticed any problems yet.

I actually found something that seemed to help with the freezes in the arch wiki page for the nouveau driver, under “Random lockups with kernel error messages”, although I wasn’t really happy with this solution so I tried completely disabling nouveau, which seems to be working for now.


#10

For the record, I’m using GM107 and I still occassionaly have hard lockups on shutdown when using Nouveau, sporadically.