nvidia using wbinvdt() < an Intel Instruction > causes huge latency spikes
  1 / 2    
Hi nvidia devs, users and friends. Recently, i have been (further) investigating performance issues / hacking on nvidia (to improve the driver for PREEMPT_RT_FULL use. ie: 'realtime linux'.) Through a combination of tests and googling - the source of the issue comes down to the use of winvdt(); instruction; http://www.jaist.ac.jp/iscenter-new/mpc/altix/altixdata/opt/intel/vtune/doc/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc323.htm WBINVD flushes internal cache, then signals the external cache to write back current data followed by a signal to flush the external cache. When the nvidia calls the wbinvd instruction, that invalidates the caches of *ALL* CPUs, forcing them to flush the caches and read everything again. ~ this literally *stalls all of the cpus* -> leading to fairly substantial latencies / poor performance, on a system that otherwise should be quite deterministic. (and *is* deterministic, when not making that call). On a (vanilla) linux kernel the problem is less apparent, but still present. On an PREEMPT_RT_FULL system (where latency is critical) nvidia ends up choking the entire system. Here is the patch i am using to get around the issue; [code] diff -Naur a/nv-linux.h b/nv-linux.h --- a/nv-linux.h 2013-12-03 23:24:48.484495874 +0100 +++ b/nv-linux.h 2013-12-03 23:27:44.684030888 +0100 @@ -392,8 +392,13 @@ #endif #if defined(NVCPU_X86) || defined(NVCPU_X86_64) +#if 0 #define CACHE_FLUSH() asm volatile("wbinvd":::"memory") #define WRITE_COMBINE_FLUSH() asm volatile("sfence":::"memory") +#else +#define CACHE_FLUSH() +#define WRITE_COMBINE_FLUSH() asm volatile("sfence":::"memory") +#endif #elif defined(NVCPU_ARM) #define CACHE_FLUSH() cpu_cache.flush_kern_all() #define WRITE_COMBINE_FLUSH() \ diff -Naur a/nv-pat.c b/nv-pat.c --- a/nv-pat.c 2013-12-03 23:24:33.987007640 +0100 +++ b/nv-pat.c 2013-12-03 23:26:57.615744800 +0100 @@ -34,7 +34,9 @@ { unsigned long cr0 = read_cr0(); write_cr0(((cr0 & (0xdfffffff)) | 0x40000000)); +#if 0 wbinvd(); +#endif *cr4 = read_cr4(); if (*cr4 & 0x80) write_cr4(*cr4 & ~0x80); __flush_tlb(); @@ -43,7 +45,9 @@ static inline void nv_enable_caches(unsigned long cr4) { unsigned long cr0 = read_cr0(); +#if 0 wbinvd(); +#endif __flush_tlb(); write_cr0((cr0 & 0x9fffffff)); if (cr4 & 0x80) write_cr4(cr4); [/code] As you can see, i have it disabled (when building for RT kernels) in my build of the nvidia driver. From what i am told, the Intel OSS driver on linux 3.10 also had this problem - however, they have removed that code / corrected the problem. (I am using linux 3.12.1 + rt patch) I am wondering if anyone at nvidia could tell me if this might be a reasonable workaround (maybe even suitable for inclusion for -rt users?) and/or is there any work being done in this are, or are nvidia linux devs aware that this is a (potential) problem on mainline linux and leads to horrible performance on any linux system with 'hard-realtime' requirements? it would be nice to get some feedback on this * for me, it appears safe - two days of torturing nvidia on linux-rt, no real problems. ---[Well, with one exception; the semaphore code in nvidia, when used on linux-rt does lead to some scheduling bugs (but is otherwise non-fatal). That being said, they can be replaced by mutexes, but that is a different topic alltogether.]--- If anyone wants to look at the patch(es) or test to verify - you can download my Archlinux package and extract the patches needed and apply it to your own nvidia driver (*requires a PREEMPT_RT_FULL kernel). the patches apply over nvidia-331.20 - but the wbinvd problem exists in ALL versions of nvidia. Package/tarball here; https://aur.archlinux.org/packages/nvidia-l-pa/ You will need to apply these two patches; - nvidia-rt_explicit.patch (sets PREEMPT_RT_FULL) - nvidia-rt_no_wbinvd.patch (disables wbinvd for PREEMPT_RT_FULL). 1. cd into /kernel (sub-folder of nvidia driver/installer) 2. apply the (2) above patches 3. make IGNORE_PREEMPT_RT_PRESENCE=1 SYSSRC=/usr/lib/modules/"${_kernver}/build" module 4. install the compiled binary * Don't ask me for distro-specific help - I only use Archlinux (which i DO package for). ---- You can verify what i am talking about, by using a tool that can measure latency; I use Cyclictest, which is a part of the 'rt-tests' for linux-rt; https://rt.wiki.kernel.org/index.php/Cyclictest - you will see huge latency spikes when launching videos (on youtube for example) and possibly when using things like CUDA. disabling the calls results in no spikes. It would be nice if nvidia found away to avoid this call all together, as Intel OSS developers have done. ---- BTW - * The last patch [nvidia-rt_mutexes.patch] has nothing to do with the WINVD issue. - that's for converitng semaphores in nvidia with mutexes. -> which i'm still testing, hence it isn't even enabled in my Archlinux package). It needs review, but i thought i would hit the linux-rt list to get help there - as i am not a programmer, but i do hack / understand some coding/languages to varying degrees. any insights, help, feedback would be nice as i would like to avoid wbinvdt() calls on linux-rt / see nvidia improve their driver. cheerz Jordan
Hi nvidia devs, users and friends.

Recently, i have been (further) investigating performance issues / hacking on nvidia (to improve the driver for PREEMPT_RT_FULL use. ie: 'realtime linux'.)

Through a combination of tests and googling - the source of the issue comes down to the use of winvdt(); instruction; http://www.jaist.ac.jp/iscenter-new/mpc/altix/altixdata/opt/intel/vtune/doc/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc323.htm

WBINVD flushes internal cache, then signals the external cache to write back current data followed by a signal to flush the external cache. When the nvidia calls the wbinvd instruction, that invalidates the caches of *ALL* CPUs, forcing them to flush the caches and read everything again. ~ this literally *stalls all of the cpus* -> leading to fairly substantial latencies / poor performance, on a system that otherwise should be quite deterministic. (and *is* deterministic, when not making that call).

On a (vanilla) linux kernel the problem is less apparent, but still present. On an PREEMPT_RT_FULL system (where latency is critical) nvidia ends up choking the entire system. Here is the patch i am using to get around the issue;

diff -Naur a/nv-linux.h b/nv-linux.h
--- a/nv-linux.h 2013-12-03 23:24:48.484495874 +0100
+++ b/nv-linux.h 2013-12-03 23:27:44.684030888 +0100
@@ -392,8 +392,13 @@
#endif

#if defined(NVCPU_X86) || defined(NVCPU_X86_64)
+#if 0
#define CACHE_FLUSH() asm volatile("wbinvd":::"memory")
#define WRITE_COMBINE_FLUSH() asm volatile("sfence":::"memory")
+#else
+#define CACHE_FLUSH()
+#define WRITE_COMBINE_FLUSH() asm volatile("sfence":::"memory")
+#endif
#elif defined(NVCPU_ARM)
#define CACHE_FLUSH() cpu_cache.flush_kern_all()
#define WRITE_COMBINE_FLUSH() \
diff -Naur a/nv-pat.c b/nv-pat.c
--- a/nv-pat.c 2013-12-03 23:24:33.987007640 +0100
+++ b/nv-pat.c 2013-12-03 23:26:57.615744800 +0100
@@ -34,7 +34,9 @@
{
unsigned long cr0 = read_cr0();
write_cr0(((cr0 & (0xdfffffff)) | 0x40000000));
+#if 0
wbinvd();
+#endif
*cr4 = read_cr4();
if (*cr4 & 0x80) write_cr4(*cr4 & ~0x80);
__flush_tlb();
@@ -43,7 +45,9 @@
static inline void nv_enable_caches(unsigned long cr4)
{
unsigned long cr0 = read_cr0();
+#if 0
wbinvd();
+#endif
__flush_tlb();
write_cr0((cr0 & 0x9fffffff));
if (cr4 & 0x80) write_cr4(cr4);


As you can see, i have it disabled (when building for RT kernels) in my build of the nvidia driver. From what i am told, the Intel OSS driver on linux 3.10 also had this problem - however, they have removed that code / corrected the problem. (I am using linux 3.12.1 + rt patch)

I am wondering if anyone at nvidia could tell me if this might be a reasonable workaround (maybe even suitable for inclusion for -rt users?) and/or is there any work being done in this are, or are nvidia linux devs aware that this is a (potential) problem on mainline linux and leads to horrible performance on any linux system with 'hard-realtime' requirements?

it would be nice to get some feedback on this * for me, it appears safe - two days of torturing nvidia on linux-rt, no real problems.

---[Well, with one exception; the semaphore code in nvidia, when used on linux-rt does lead to some scheduling bugs (but is otherwise non-fatal). That being said, they can be replaced by mutexes, but that is a different topic alltogether.]---

If anyone wants to look at the patch(es) or test to verify - you can download my Archlinux package and extract the patches needed and apply it to your own nvidia driver (*requires a PREEMPT_RT_FULL kernel). the patches apply over nvidia-331.20 - but the wbinvd problem exists in ALL versions of nvidia. Package/tarball here; https://aur.archlinux.org/packages/nvidia-l-pa/

You will need to apply these two patches;

- nvidia-rt_explicit.patch (sets PREEMPT_RT_FULL)
- nvidia-rt_no_wbinvd.patch (disables wbinvd for PREEMPT_RT_FULL).

1. cd into /kernel (sub-folder of nvidia driver/installer)

2. apply the (2) above patches

3. make IGNORE_PREEMPT_RT_PRESENCE=1 SYSSRC=/usr/lib/modules/"${_kernver}/build" module

4. install the compiled binary

* Don't ask me for distro-specific help - I only use Archlinux (which i DO package for).

----

You can verify what i am talking about, by using a tool that can measure latency; I use Cyclictest, which is a part of the 'rt-tests' for linux-rt; https://rt.wiki.kernel.org/index.php/Cyclictest - you will see huge latency spikes when launching videos (on youtube for example) and possibly when using things like CUDA. disabling the calls results in no spikes.

It would be nice if nvidia found away to avoid this call all together, as Intel OSS developers have done.

----

BTW - * The last patch [nvidia-rt_mutexes.patch] has nothing to do with the WINVD issue. - that's for converitng semaphores in nvidia with mutexes. -> which i'm still testing, hence it isn't even enabled in my Archlinux package). It needs review, but i thought i would hit the linux-rt list to get help there - as i am not a programmer, but i do hack / understand some coding/languages to varying degrees.

any insights, help, feedback would be nice as i would like to avoid wbinvdt() calls on linux-rt / see nvidia improve their driver.

cheerz

Jordan

#1
Posted 12/05/2013 06:11 PM   
No takers eh? I still have yet to see any issues with no_wbinvdt patch enabled. I asked on linux-rt-user list, but only got a bit of feedback, from a user, NOT a developer - so that wasn't exactly helpful. I hope someone at nvidia reads my post and looks into / replies here. --------- OT: but here is hte kinda thing you see on linux-rt (3.12.1-rt currently, but happens all throughout 3.x-rt series), when nvidia is using semaphores; [code] [197972.079574] BUG: scheduling while atomic: irq/42-nvidia/18410/0x00000002 [197972.079596] Modules linked in: nvidia(PO) snd_seq_midi snd_seq_midi_event snd_seq_dummy snd_hrtimer snd_seq isofs fuse joydev hid_generic usbhid snd_usb_audio snd_usbmidi_lib hid snd_rawmidi snd_seq_device wacom snd_hda_codec_hdmi forcedeth snd_hda_codec snd_hwdep snd_pcm snd_page_alloc snd_timer snd soundcore edac_core edac_mce_amd k10temp evdev serio_raw i2c_nforce2 video wmi asus_atk0110 button processor drm i2c_core microcode ext4 crc16 mbcache jbd2 sr_mod cdrom sd_mod ata_generic pata_acpi ahci libahci pata_amd ohci_pci ohci_hcd ehci_pci libata ehci_hcd firewire_ohci firewire_core crc_itu_t scsi_mod usbcore usb_common [last unloaded: nvidia] [197972.079599] CPU: 3 PID: 18410 Comm: irq/42-nvidia Tainted: P W O 3.12.1-rt4-3-l-pa #1 [197972.079599] Hardware name: System manufacturer System Product Name/M4N75TD, BIOS 1701 04/14/2011 [197972.079601] ffff8801efb5fa70 ffff8801efb5f910 ffffffff814ebb7d ffff88022fcd1b40 [197972.079602] ffff8801efb5f920 ffffffff814e8f00 ffff8801efb5fa28 ffffffff814eeb49 [197972.079603] 0000000000011b40 ffff8801efb5ffd8 ffff8801efb5ffd8 0000000000011b40 [197972.079603] Call Trace: [197972.079608] [<ffffffff814ebb7d>] dump_stack+0x54/0x9a [197972.079609] [<ffffffff814e8f00>] __schedule_bug+0x48/0x56 [197972.079611] [<ffffffff814eeb49>] __schedule+0x629/0x7e0 [197972.079613] [<ffffffff814f0d4e>] ? _raw_spin_unlock_irqrestore+0xe/0x60 [197972.079615] [<ffffffff810bba55>] ? task_blocks_on_rt_mutex+0x1f5/0x260 [197972.079616] [<ffffffff814eed2a>] schedule+0x2a/0x80 [197972.079618] [<ffffffff814efc2b>] rt_spin_lock_slowlock+0x177/0x2ac [197972.079726] [<ffffffffa1538401>] ? _nv014994rm+0x1395/0x25f4 [nvidia] [197972.079732] [<ffffffff814f08a5>] rt_spin_lock+0x25/0x40 [197972.079734] [<ffffffff81083129>] __wake_up+0x29/0x60 [197972.079768] [<ffffffffa17468dd>] nv_post_event+0xdd/0x120 [nvidia] [197972.079807] [<ffffffffa171e369>] _nv013270rm+0xed/0x144 [nvidia] [197972.079843] [<ffffffffa122fc7e>] ? _nv013107rm+0x9/0xb [nvidia] [197972.079906] [<ffffffffa1433868>] ? _nv005358rm+0xbe/0xe7 [nvidia] [197972.079968] [<ffffffffa1433b42>] ? _nv012422rm+0xdf/0xf8 [nvidia] [197972.080035] [<ffffffffa1433ac2>] ? _nv012422rm+0x5f/0xf8 [nvidia] [197972.080107] [<ffffffffa15721ec>] ? _nv009896rm+0xb0d/0xd40 [nvidia] [197972.080182] [<ffffffffa1572204>] ? _nv009896rm+0xb25/0xd40 [nvidia] [197972.080235] [<ffffffffa1607a73>] ? _nv011894rm+0x4df/0x709 [nvidia] [197972.080286] [<ffffffffa160618a>] ? _nv001242rm+0x21e/0x2a7 [nvidia] [197972.080337] [<ffffffffa1606570>] ? _nv011911rm+0x3d/0x14b [nvidia] [197972.080372] [<ffffffffa122fc7e>] ? _nv013107rm+0x9/0xb [nvidia] [197972.080422] [<ffffffffa160da98>] ? _nv011891rm+0x38/0x59 [nvidia] [197972.080459] [<ffffffffa172127a>] ? _nv000818rm+0xcd/0x133 [nvidia] [197972.080492] [<ffffffffa1725691>] ? rm_isr_bh+0x23/0x73 [nvidia] [197972.080523] [<ffffffffa1743a1b>] ? nvidia_isr_bh+0x3b/0x60 [nvidia] [197972.080525] [<ffffffff81055a89>] ? __tasklet_action.isra.11+0x69/0x120 [197972.080526] [<ffffffff81055bfe>] ? tasklet_action+0x5e/0x60 [197972.080527] [<ffffffff8105555c>] ? do_current_softirqs+0x19c/0x3a0 [197972.080529] [<ffffffff810a6300>] ? irq_thread_fn+0x60/0x60 [197972.080530] [<ffffffff810557be>] ? local_bh_enable+0x5e/0x80 [197972.080531] [<ffffffff810a633b>] ? irq_forced_thread_fn+0x3b/0x80 [197972.080532] [<ffffffff810a657f>] ? irq_thread+0x11f/0x160 [197972.080533] [<ffffffff810a65c0>] ? irq_thread+0x160/0x160 [197972.080534] [<ffffffff810a6460>] ? wake_threads_waitq+0x60/0x60 [197972.080536] [<ffffffff81074f12>] ? kthread+0xb2/0xc0 [197972.080537] [<ffffffff81074e60>] ? kthread_worker_fn+0x1a0/0x1a0 [197972.080539] [<ffffffff814f8bac>] ? ret_from_fork+0x7c/0xb0 [197972.080540] [<ffffffff81074e60>] ? kthread_worker_fn+0x1a0/0x1a0 [ninez@localhost ~]$ [/code] ...with mutexes replacing the semaphore code, you don't see this kind of ugliness in the kernel ring bufffer. cheers
No takers eh?

I still have yet to see any issues with no_wbinvdt patch enabled. I asked on linux-rt-user list, but only got a bit of feedback, from a user, NOT a developer - so that wasn't exactly helpful. I hope someone at nvidia reads my post and looks into / replies here.

---------

OT: but here is hte kinda thing you see on linux-rt (3.12.1-rt currently, but happens all throughout 3.x-rt series), when nvidia is using semaphores;

[197972.079574] BUG: scheduling while atomic: irq/42-nvidia/18410/0x00000002
[197972.079596] Modules linked in: nvidia(PO) snd_seq_midi snd_seq_midi_event snd_seq_dummy snd_hrtimer snd_seq isofs fuse joydev hid_generic usbhid snd_usb_audio snd_usbmidi_lib hid snd_rawmidi snd_seq_device wacom snd_hda_codec_hdmi forcedeth snd_hda_codec snd_hwdep snd_pcm snd_page_alloc snd_timer snd soundcore edac_core edac_mce_amd k10temp evdev serio_raw i2c_nforce2 video wmi asus_atk0110 button processor drm i2c_core microcode ext4 crc16 mbcache jbd2 sr_mod cdrom sd_mod ata_generic pata_acpi ahci libahci pata_amd ohci_pci ohci_hcd ehci_pci libata ehci_hcd firewire_ohci firewire_core crc_itu_t scsi_mod usbcore usb_common [last unloaded: nvidia]
[197972.079599] CPU: 3 PID: 18410 Comm: irq/42-nvidia Tainted: P W O 3.12.1-rt4-3-l-pa #1
[197972.079599] Hardware name: System manufacturer System Product Name/M4N75TD, BIOS 1701 04/14/2011
[197972.079601] ffff8801efb5fa70 ffff8801efb5f910 ffffffff814ebb7d ffff88022fcd1b40
[197972.079602] ffff8801efb5f920 ffffffff814e8f00 ffff8801efb5fa28 ffffffff814eeb49
[197972.079603] 0000000000011b40 ffff8801efb5ffd8 ffff8801efb5ffd8 0000000000011b40
[197972.079603] Call Trace:
[197972.079608] [<ffffffff814ebb7d>] dump_stack+0x54/0x9a
[197972.079609] [<ffffffff814e8f00>] __schedule_bug+0x48/0x56
[197972.079611] [<ffffffff814eeb49>] __schedule+0x629/0x7e0
[197972.079613] [<ffffffff814f0d4e>] ? _raw_spin_unlock_irqrestore+0xe/0x60
[197972.079615] [<ffffffff810bba55>] ? task_blocks_on_rt_mutex+0x1f5/0x260
[197972.079616] [<ffffffff814eed2a>] schedule+0x2a/0x80
[197972.079618] [<ffffffff814efc2b>] rt_spin_lock_slowlock+0x177/0x2ac
[197972.079726] [<ffffffffa1538401>] ? _nv014994rm+0x1395/0x25f4 [nvidia]
[197972.079732] [<ffffffff814f08a5>] rt_spin_lock+0x25/0x40
[197972.079734] [<ffffffff81083129>] __wake_up+0x29/0x60
[197972.079768] [<ffffffffa17468dd>] nv_post_event+0xdd/0x120 [nvidia]
[197972.079807] [<ffffffffa171e369>] _nv013270rm+0xed/0x144 [nvidia]
[197972.079843] [<ffffffffa122fc7e>] ? _nv013107rm+0x9/0xb [nvidia]
[197972.079906] [<ffffffffa1433868>] ? _nv005358rm+0xbe/0xe7 [nvidia]
[197972.079968] [<ffffffffa1433b42>] ? _nv012422rm+0xdf/0xf8 [nvidia]
[197972.080035] [<ffffffffa1433ac2>] ? _nv012422rm+0x5f/0xf8 [nvidia]
[197972.080107] [<ffffffffa15721ec>] ? _nv009896rm+0xb0d/0xd40 [nvidia]
[197972.080182] [<ffffffffa1572204>] ? _nv009896rm+0xb25/0xd40 [nvidia]
[197972.080235] [<ffffffffa1607a73>] ? _nv011894rm+0x4df/0x709 [nvidia]
[197972.080286] [<ffffffffa160618a>] ? _nv001242rm+0x21e/0x2a7 [nvidia]
[197972.080337] [<ffffffffa1606570>] ? _nv011911rm+0x3d/0x14b [nvidia]
[197972.080372] [<ffffffffa122fc7e>] ? _nv013107rm+0x9/0xb [nvidia]
[197972.080422] [<ffffffffa160da98>] ? _nv011891rm+0x38/0x59 [nvidia]
[197972.080459] [<ffffffffa172127a>] ? _nv000818rm+0xcd/0x133 [nvidia]
[197972.080492] [<ffffffffa1725691>] ? rm_isr_bh+0x23/0x73 [nvidia]
[197972.080523] [<ffffffffa1743a1b>] ? nvidia_isr_bh+0x3b/0x60 [nvidia]
[197972.080525] [<ffffffff81055a89>] ? __tasklet_action.isra.11+0x69/0x120
[197972.080526] [<ffffffff81055bfe>] ? tasklet_action+0x5e/0x60
[197972.080527] [<ffffffff8105555c>] ? do_current_softirqs+0x19c/0x3a0
[197972.080529] [<ffffffff810a6300>] ? irq_thread_fn+0x60/0x60
[197972.080530] [<ffffffff810557be>] ? local_bh_enable+0x5e/0x80
[197972.080531] [<ffffffff810a633b>] ? irq_forced_thread_fn+0x3b/0x80
[197972.080532] [<ffffffff810a657f>] ? irq_thread+0x11f/0x160
[197972.080533] [<ffffffff810a65c0>] ? irq_thread+0x160/0x160
[197972.080534] [<ffffffff810a6460>] ? wake_threads_waitq+0x60/0x60
[197972.080536] [<ffffffff81074f12>] ? kthread+0xb2/0xc0
[197972.080537] [<ffffffff81074e60>] ? kthread_worker_fn+0x1a0/0x1a0
[197972.080539] [<ffffffff814f8bac>] ? ret_from_fork+0x7c/0xb0
[197972.080540] [<ffffffff81074e60>] ? kthread_worker_fn+0x1a0/0x1a0
[ninez@localhost ~]$


...with mutexes replacing the semaphore code, you don't see this kind of ugliness in the kernel ring bufffer.

cheers

#2
Posted 12/08/2013 03:40 AM   
Could someone from nvidia PLEASE! f-ing respond to my post! I don't expect any users will be able to clarify the wbinvdt() issue and i would like some info from someone 'in the know'.... It's freaking annoyinf that as a long-time nvidia (linux) user/customer - who buys nvidia for every PC and recommends nvidia to others - that i can't even get ONE post answered in these developer forums. it really makes me question whether i should continue to take my hard-earned cash and spend it on nvidia, when i could be giving it all to Intel ~ who ARE helpful, DO respond to inquires, etc... (so far) thanks for nothing.
Could someone from nvidia PLEASE! f-ing respond to my post!

I don't expect any users will be able to clarify the wbinvdt() issue and i would like some info from someone 'in the know'.... It's freaking annoyinf that as a long-time nvidia (linux) user/customer - who buys nvidia for every PC and recommends nvidia to others - that i can't even get ONE post answered in these developer forums.

it really makes me question whether i should continue to take my hard-earned cash and spend it on nvidia, when i could be giving it all to Intel ~ who ARE helpful, DO respond to inquires, etc...

(so far) thanks for nothing.

#3
Posted 12/15/2013 06:26 PM   
I'm sure your loyalty is appreciated but to get the sort of feedback you're looking for is unheard of here -- not just in the driver forum but across all of the dev forums. To get that kind of information you would need to engage an engineer that you know is already overworked and can't spare the time to read these forums religiously and respond to every post. There are even serious bug issues that get reported here and don't get any attention (publicly). Maybe things are different at Intel but, in the end, I'm not sure that they're making a better product. I suggest that you might have more success getting an engineer's attention through a non-public method, e.g. through a proxy.
I'm sure your loyalty is appreciated but to get the sort of feedback you're looking for is unheard of here -- not just in the driver forum but across all of the dev forums. To get that kind of information you would need to engage an engineer that you know is already overworked and can't spare the time to read these forums religiously and respond to every post. There are even serious bug issues that get reported here and don't get any attention (publicly).

Maybe things are different at Intel but, in the end, I'm not sure that they're making a better product.

I suggest that you might have more success getting an engineer's attention through a non-public method, e.g. through a proxy.

#4
Posted 12/15/2013 11:26 PM   
ninez: Have you tested your patch with some of CUDA's more unusual memory options? Namely, write-combined memory? I think CUDA may also support uncached host-side memory. Take a look at kernel/nv-vm.c: You'll see uncached pages used all over the place (and calls to flush the CPU cache). I get the feeling that what you're doing could be dangerous for some applications. Cache inconsistencies can sometimes be really hard to trigger/detect. Do you have sense for what the driver is doing when it calls wbinvd? Is it in an interrupt handler? Tasklet? Workqueue item? System call (e.g., ioctl())?
ninez: Have you tested your patch with some of CUDA's more unusual memory options? Namely, write-combined memory? I think CUDA may also support uncached host-side memory. Take a look at kernel/nv-vm.c: You'll see uncached pages used all over the place (and calls to flush the CPU cache). I get the feeling that what you're doing could be dangerous for some applications. Cache inconsistencies can sometimes be really hard to trigger/detect.

Do you have sense for what the driver is doing when it calls wbinvd? Is it in an interrupt handler? Tasklet? Workqueue item? System call (e.g., ioctl())?

#5
Posted 12/17/2013 06:33 PM   
@Arakageeta - I just saw your message (very helpful, thanx), but unfortunately it's early morning here - and i am just about to leave for work, so i don't have time to get into this right now... I'll have a (more detailed) look through vn-vm.c when i get home + try to get some other details together. cheerz EDIT: this may take me a few days to get back to; forgot it was my Father's b-day yesterday && the holiday season / shopping for all the kids in my family / extended family is not only cracking my piggie bank - but also taking up most of my time (for the next few days anyway). but I'll free up some time soon, to sort this out.
@Arakageeta - I just saw your message (very helpful, thanx), but unfortunately it's early morning here - and i am just about to leave for work, so i don't have time to get into this right now... I'll have a (more detailed) look through vn-vm.c when i get home + try to get some other details together.

cheerz

EDIT: this may take me a few days to get back to; forgot it was my Father's b-day yesterday && the holiday season / shopping for all the kids in my family / extended family is not only cracking my piggie bank - but also taking up most of my time (for the next few days anyway). but I'll free up some time soon, to sort this out.

#6
Posted 12/18/2013 01:24 PM   
I've applied the patch and I haven't seen any difference on my system on non-RT desktop with AMD CPU + 650GTX. And I'm suffering from some [b]terrible[/b] lag in certain applications though it clearly appears to be unrelated to this cache clear instruction. ninez, you are not alone in frustration with nVidia. They seem to have spoken clearly as of late that their priorities are elsewhere. I have old hardware (8800GTS or 9400GT) that works flawlessly and smoothely with old drivers. Then you get Kepler, and it's a terrible experience. And it is even easy to duplicate lags. 1. install wine 2. install trial of Eve Online 3. start it up, undock and "warp" your ship or just look around. And the only difference between smooth experience there with old video card and jagged, unusable one is the video card + drivers. With older drivers, it even produced Xid driver errors 59 and 8 and others. nVidia fixed those errors - took them a year to address that. ....
I've applied the patch and I haven't seen any difference on my system on non-RT desktop with AMD CPU + 650GTX. And I'm suffering from some terrible lag in certain applications though it clearly appears to be unrelated to this cache clear instruction.

ninez, you are not alone in frustration with nVidia. They seem to have spoken clearly as of late that their priorities are elsewhere. I have old hardware (8800GTS or 9400GT) that works flawlessly and smoothely with old drivers. Then you get Kepler, and it's a terrible experience. And it is even easy to duplicate lags.

1. install wine
2. install trial of Eve Online
3. start it up, undock and "warp" your ship or just look around.

And the only difference between smooth experience there with old video card and jagged, unusable one is the video card + drivers. With older drivers, it even produced Xid driver errors 59 and 8 and others. nVidia fixed those errors - took them a year to address that. ....

#7
Posted 12/21/2013 04:08 AM   
[quote="Franster"]I'm suffering from some [b]terrible[/b] lag in certain applications though it clearly appears to be unrelated to this cache clear instruction.[/quote] Is this an graphics or CUDA application?
Franster said:I'm suffering from some terrible lag in certain applications though it clearly appears to be unrelated to this cache clear instruction.


Is this an graphics or CUDA application?

#8
Posted 12/25/2013 09:30 AM   
[quote="Arakageeta"][quote="Franster"]I'm suffering from some [b]terrible[/b] lag in certain applications though it clearly appears to be unrelated to this cache clear instruction.[/quote] Is this an graphics or CUDA application?[/quote] Graphics. I suspect something finicky with some shaders, but that is speculation at present. For computation, I don't use CUDA, just simple OpenCL kernels (program) and I've had no problems with OpenCL.
Arakageeta said:
Franster said:I'm suffering from some terrible lag in certain applications though it clearly appears to be unrelated to this cache clear instruction.


Is this an graphics or CUDA application?


Graphics. I suspect something finicky with some shaders, but that is speculation at present.

For computation, I don't use CUDA, just simple OpenCL kernels (program) and I've had no problems with OpenCL.

#9
Posted 12/28/2013 06:09 AM   
Hey guys - Sorry for the extremely late reply. My holiday season was an absolute mess; - massive Ice storm, no power for several days over the holidays - huge amount of cleanup + helping sort out other friends/families issues - problems at work due to some nasty data loss (actually H/W failure). - (and now) i am sick as a dog; which ironically, should free me up some time, over the weekend / into next week. hopefully. I'm @ work today - but I'm pretty sure i will be taking off several days next week, since i am almost 'caught up' @ work... and as the person who i got this bug from has been extremely sick for well over a week, i doubt i will be going into work (next week), if i can avoid it. anyway, I'll try to set aside some time to delve further into this, asaic.
Hey guys - Sorry for the extremely late reply. My holiday season was an absolute mess;

- massive Ice storm, no power for several days over the holidays
- huge amount of cleanup + helping sort out other friends/families issues
- problems at work due to some nasty data loss (actually H/W failure).
- (and now) i am sick as a dog; which ironically, should free me up some time, over the weekend / into next week. hopefully.

I'm @ work today - but I'm pretty sure i will be taking off several days next week, since i am almost 'caught up' @ work... and as the person who i got this bug from has been extremely sick for well over a week, i doubt i will be going into work (next week), if i can avoid it.

anyway, I'll try to set aside some time to delve further into this, asaic.

#10
Posted 01/03/2014 03:44 PM   
The CACHE_FLUSH macro is used in nv-vm.c, which contains this comment: [code] /* * Cache flushes and TLB invalidation * * Allocating new pages, we may change their kernel mappings' memory types * from cached to UC to avoid cache aliasing. One problem with this is * that cache lines may still contain data from these pages and there may * be then stale TLB entries. * * The Linux kernel's strategy for addressing the above has varied since * the introduction of change_page_attr(): it has been implicit in the * change_page_attr() interface, explicit in the global_flush_tlb() * interface and, as of this writing, is implicit again in the interfaces * replacing change_page_attr(), i.e. set_pages_*(). * * In theory, any of the above should satisfy the NVIDIA graphics driver's * requirements. In practise, none do reliably: * * - most Linux 2.6 kernels' implementations of the global_flush_tlb() * interface fail to flush caches on all or some CPUs, for a * variety of reasons. * * Due to the above, the NVIDIA Linux graphics driver is forced to perform * heavy-weight flush/invalidation operations to avoid problems due to * stale cache lines and/or TLB entries. */ [/code] I'll defer to the kernel experts, but my understanding is that this is required to avoid problems with cache consistency, which can be extremely difficult to track down. I'm sorry that this introduces problems with your -rt-patched kernel's latency guarantees. This is part of why -rt kernels are not officially supported.
The CACHE_FLUSH macro is used in nv-vm.c, which contains this comment:
/*
* Cache flushes and TLB invalidation
*
* Allocating new pages, we may change their kernel mappings' memory types
* from cached to UC to avoid cache aliasing. One problem with this is
* that cache lines may still contain data from these pages and there may
* be then stale TLB entries.
*
* The Linux kernel's strategy for addressing the above has varied since
* the introduction of change_page_attr(): it has been implicit in the
* change_page_attr() interface, explicit in the global_flush_tlb()
* interface and, as of this writing, is implicit again in the interfaces
* replacing change_page_attr(), i.e. set_pages_*().
*
* In theory, any of the above should satisfy the NVIDIA graphics driver's
* requirements. In practise, none do reliably:
*
* - most Linux 2.6 kernels' implementations of the global_flush_tlb()
* interface fail to flush caches on all or some CPUs, for a
* variety of reasons.
*
* Due to the above, the NVIDIA Linux graphics driver is forced to perform
* heavy-weight flush/invalidation operations to avoid problems due to
* stale cache lines and/or TLB entries.
*/

I'll defer to the kernel experts, but my understanding is that this is required to avoid problems with cache consistency, which can be extremely difficult to track down. I'm sorry that this introduces problems with your -rt-patched kernel's latency guarantees. This is part of why -rt kernels are not officially supported.

Aaron Plattner
NVIDIA Linux Graphics

#11
Posted 01/03/2014 06:07 PM   
1st. Thanks for replying aplattner. I was actually surprised to see the 'illusive' olive green background, of an nvidia dev's comment ;) [quote="aplattner"]The CACHE_FLUSH macro is used in nv-vm.c, which contains this comment: <...snip...> [/quote] Yes, I have come across that bit. (in recent days, and have been reading up a little on cache coherence / consistency). It's interesting, because afaict, none of my systems have had issue removing that call. (all system's MOBOs support nvidia (like core calibration, sli, etc) both SMP (one 4 core, one 8 core) and are running PREEMPT_RT_FULL...and i've never had this smooth / great performance / determinism. [quote="aplattner"] I'll defer to the kernel experts, but my understanding is that this is required to avoid problems with cache consistency, which can be extremely difficult to track down. I'm sorry that this introduces problems with your -rt-patched kernel's latency guarantees. This is part of why -rt kernels are not officially supported.[/quote] Yeah, i got that impression from reading the code comment - obviously, i had been running the patched nvidia (long) before seeing that bit ~ and hadn't/haven't noticed any problems. I wonder, do you have any suggestions; as to how to go about tracking down problems with cache consistency? In the (weeks?) that i have been running nvidia without wbinvdt() i haven't experienced any odd behavior. (aside from my system working as it should). I've run some (linux-related) diagnostic tools, cuda_memtest, all cuda examples/checks. my H/W accel VMs in vmware work great, unigine-* GFX benchmarks (and others) work fine... It would really be nice to NOT have to go back to a driver that introduces severe latency spikes ;) I wonder if using mutexes is having any impact here(?) (mutexes imply more than just synchronization, also memory barriers, correct opdering, etc. afaict/understand..). I also use a few other non-standard (linux) bits, like UKSM (in-kernel deduplication of memory), a few other tweaks like MAX_READAHEAD multiplied/increased(MM subsystem) and linking to ld.bfd instead of ld.gold for nvidia (and obviously. no wbinvdt, mutexes, etc)... maybe that is grasping at straws, but i would tend to think, if wbinvdt() is very critical, why haven't i experience any problems? (only benefits). anyway, it would be nice if you defer it to someone who is more in the know. Maybe there is a better way / different way to handle the caches, that could be worth exploring... this i don't know. but thnks regardless. [quote="Arakageeta"][quote="Franster"]I'm suffering from some [b]terrible[/b] lag in certain applications though it clearly appears to be unrelated to this cache clear instruction.[/quote] Is this an graphics or CUDA application?[/quote] Hi, Arakageeta - Cuda on my system without wbinvdt() works just fine; All of the examples are smooth, i pass all tests, etc... i didn't have a problem with any of them. :) I also downloaded some other cuda apps/demos from around the web to test, aside from the odd one that didn't compile (code rot, most likely), all of them worked great. ~ the only observation that i had was that certain cuda demos put a little 'strain' on Compiz, * but only involved windows moving slightly slower - that's it. (stock/unpatched kernel/driver does the same thing for me).
1st. Thanks for replying aplattner. I was actually surprised to see the 'illusive' olive green background, of an nvidia dev's comment ;)

aplattner said:The CACHE_FLUSH macro is used in nv-vm.c, which contains this comment:

<...snip...>


Yes, I have come across that bit. (in recent days, and have been reading up a little on cache coherence / consistency). It's interesting, because afaict, none of my systems have had issue removing that call. (all system's MOBOs support nvidia (like core calibration, sli, etc) both SMP (one 4 core, one 8 core) and are running PREEMPT_RT_FULL...and i've never had this smooth / great performance / determinism.

aplattner said:
I'll defer to the kernel experts, but my understanding is that this is required to avoid problems with cache consistency, which can be extremely difficult to track down. I'm sorry that this introduces problems with your -rt-patched kernel's latency guarantees. This is part of why -rt kernels are not officially supported.


Yeah, i got that impression from reading the code comment - obviously, i had been running the patched nvidia (long) before seeing that bit ~ and hadn't/haven't noticed any problems. I wonder, do you have any suggestions; as to how to go about tracking down problems with cache consistency? In the (weeks?) that i have been running nvidia without wbinvdt() i haven't experienced any odd behavior. (aside from my system working as it should). I've run some (linux-related) diagnostic tools, cuda_memtest, all cuda examples/checks. my H/W accel VMs in vmware work great, unigine-* GFX benchmarks (and others) work fine... It would really be nice to NOT have to go back to a driver that introduces severe latency spikes ;)

I wonder if using mutexes is having any impact here(?) (mutexes imply more than just synchronization, also memory barriers, correct opdering, etc. afaict/understand..). I also use a few other non-standard (linux) bits, like UKSM (in-kernel deduplication of memory), a few other tweaks like MAX_READAHEAD multiplied/increased(MM subsystem) and linking to ld.bfd instead of ld.gold for nvidia (and obviously. no wbinvdt, mutexes, etc)... maybe that is grasping at straws, but i would tend to think, if wbinvdt() is very critical, why haven't i experience any problems? (only benefits).

anyway, it would be nice if you defer it to someone who is more in the know. Maybe there is a better way / different way to handle the caches, that could be worth exploring... this i don't know. but thnks regardless.

Arakageeta said:
Franster said:I'm suffering from some terrible lag in certain applications though it clearly appears to be unrelated to this cache clear instruction.


Is this an graphics or CUDA application?


Hi, Arakageeta - Cuda on my system without wbinvdt() works just fine; All of the examples are smooth, i pass all tests, etc... i didn't have a problem with any of them. :) I also downloaded some other cuda apps/demos from around the web to test, aside from the odd one that didn't compile (code rot, most likely), all of them worked great. ~ the only observation that i had was that certain cuda demos put a little 'strain' on Compiz, * but only involved windows moving slightly slower - that's it. (stock/unpatched kernel/driver does the same thing for me).

#12
Posted 01/06/2014 06:18 AM   
I've found some tools for stress testing (including *testing cache coherency*, among other things). Right now i am using Google's "Stressful Application Test"; http://code.google.com/p/stressapptest/ . On the cache cohernecy test(s) - i get a *PASSING grade*, zero errors/issues. - I also went ot the trouble of using some CUDA at the same time, then VMware after that (rerun test during each use) - still i get a 'passing grade' / my caches are fine... Since, i am sick and not working - i am going to use a nice chunk of the day to see if i can find any test / or debugging mechanism in the kernel that will actually report an issue with cache coherency or consistency. EDIT: I've added rdtsc usage and cachegrind(part of valgrind) to my list of tools. so far so good. (in any tests that i have run). I believe Intel's Vtune should potentially be useful too, except while i do have their compiler suite installed - i don't think i ever actually got Vtune. (on my list of things to do today).
I've found some tools for stress testing (including *testing cache coherency*, among other things). Right now i am using Google's "Stressful Application Test"; http://code.google.com/p/stressapptest/ . On the cache cohernecy test(s) - i get a *PASSING grade*, zero errors/issues. - I also went ot the trouble of using some CUDA at the same time, then VMware after that (rerun test during each use) - still i get a 'passing grade' / my caches are fine...

Since, i am sick and not working - i am going to use a nice chunk of the day to see if i can find any test / or debugging mechanism in the kernel that will actually report an issue with cache coherency or consistency.

EDIT: I've added rdtsc usage and cachegrind(part of valgrind) to my list of tools. so far so good. (in any tests that i have run). I believe Intel's Vtune should potentially be useful too, except while i do have their compiler suite installed - i don't think i ever actually got Vtune. (on my list of things to do today).

#13
Posted 01/06/2014 05:19 PM   
[quote="aplattner"]This is part of why -rt kernels are not officially supported.[/quote] @aplattner: This has been NVIDIA's policy for quite some time, but it may have to change: SteamOS (at least the beta) runs the PREEMPT_RT Linux kernel. (NVIDIA could easily not fix ninez's bug and still claim full support---this is a latency issue, not a functional issue.) @ninez: I'm glad that your patch seems stable. However, "exhaustive" testing such as this can only give you warm-fuzzies about the patch on your particular software/hardware system. How do you know what you're doing is correct for other CPU models, each operating at different speeds, and with different clock ratios between the CPU and various buses? Cache errors may be the most vile and hard-to-diagnose race conditions out there. I think what you're doing is great. I'm glad your sharing your code and experience with everyone. But as for a general solution, merely removing wbinvdt() sounds very dangerous. Is there anything that can [i]replace[/i] the instruction instead of remove it? I'll grep around the Linux kernel to see how they handle the situation...
aplattner said:This is part of why -rt kernels are not officially supported.


@aplattner: This has been NVIDIA's policy for quite some time, but it may have to change: SteamOS (at least the beta) runs the PREEMPT_RT Linux kernel. (NVIDIA could easily not fix ninez's bug and still claim full support---this is a latency issue, not a functional issue.)

@ninez: I'm glad that your patch seems stable. However, "exhaustive" testing such as this can only give you warm-fuzzies about the patch on your particular software/hardware system. How do you know what you're doing is correct for other CPU models, each operating at different speeds, and with different clock ratios between the CPU and various buses? Cache errors may be the most vile and hard-to-diagnose race conditions out there. I think what you're doing is great. I'm glad your sharing your code and experience with everyone. But as for a general solution, merely removing wbinvdt() sounds very dangerous. Is there anything that can replace the instruction instead of remove it? I'll grep around the Linux kernel to see how they handle the situation...

#14
Posted 01/09/2014 04:05 PM   
The code comments [b]aplattner[/b] posted from nv-vm.c talks about change_page_attr(). Digging into arch/x86/mm/pageattr.c (3.0.x kernel), we find this function calls cpa_flush_*(): [code] /* * On success we use clflush, when the CPU supports it to * avoid the wbindv. If the CPU does not support it and in the * error case we fall back to cpa_flush_all (which uses * wbindv): */ if (!ret && cpu_has_clflush) { if (cpa.flags & (CPA_PAGES_ARRAY | CPA_ARRAY)) { cpa_flush_array(addr, numpages, cache, cpa.flags, pages); } else cpa_flush_range(baddr, numpages, cache); } else cpa_flush_all(cache); [/code] Seems like this code is sensitive to CPU capabilities. It appears that cpa_flish_range/array() call a lighter-weight method of cache invalidation: clflush() (see arch/x86/include/asm/system.h) instead of wbinvdt(). If we [i]assume[/i] Linux is working properly, then there is no need to flush the cache in nv-vm.c::nv_flush_cache(). NVIDIA says it doesn't always work, hence the wbinvdt(). Unfortunately, nv-vm.c doesn't give us any more information. Here are the code comments in nv-vm.c from an older 27x-era driver. They're a little different: [code]/* * Cache flushes and TLB invalidation * * Allocating new pages, we may change their kernel mappings' memory types * from cached to UC to avoid cache aliasing. One problem with this is * that cache lines may still contain data from these pages and there may * be then stale TLB entries. * * The Linux kernel's strategy for addressing the above has varied since * the introduction of change_page_attr(): it has been implicit in the * change_page_attr() interface, explicit in the global_flush_tlb() * interface and, as of this writing, is implicit again in the interfaces * replacing change_page_attr(), i.e. set_pages_*(). * * In theory, any of the above should satisfy the NVIDIA graphics driver's * requirements. In practise, none do reliably: * * - some Linux 2.4 kernels (e.g. vanilla 2.4.27) did not flush caches * on CPUs with Self Snoop capability, but this feature does not * interact well with AGP. * * - most Linux 2.6 kernels' implementations of the global_flush_tlb() * interface fail to flush caches on all or some CPUs, for a * variety of reasons. * * Due to the above, the NVIDIA Linux graphics driver is forced to perform * heavy-weight flush/invalidation operations to avoid problems due to * stale cache lines and/or TLB entries. */ [/code] Here, the comments state that the 2.6 kernel only needs a TLB flush. This implies to me that commenting out the call to the CACHE_FLUSH() macro in nv_flush_cache() should be safe.* I think this is a better solution than changing CACHE_FLUSH() into a noop. Why did the comments in nv-vm.c change? Did an engineer get overly zealous in cleaning up comments when AGP or 2.4 kernel support was dropped? Did NVIDIA learn of other instances where the 2.6 kernel also needed a cache flush? We'll never know. I think the best that you can do is register a bug with NVIDIA and hope that they task an engineer to reevaluate the situation. This is such a low-level and fundamental part of memory management that I could see NVIDIA being very ([i]extremely[/i]) hesitant to making any official changes unless wbinvdt() starts to create serious problems for important customers (AAA games on Linux). I'm not surprised that you've hit a bug that relates to latency: the Linux driver hasn't really had to support low-latency operations until recently. Hopefully SteamOS will help motivate a change. In the meantime, I think the best that you can do is test your system with CACHE_FLUSH() commented out from nv_flush_cache() and hope for the best. I'll test it out on my system as well. * It should be safe, assuming that the nv_flush_cache() callee is calling nv_flush_cache() because it changed memory attributes via change_page_attr() or set_pages_*() and not for some other reason. This appears to be the case: We can limit the scope of code that needs to be reviewed to nv-vm.c since nv_flush_cache() is a static function.
The code comments aplattner posted from nv-vm.c talks about change_page_attr(). Digging into arch/x86/mm/pageattr.c (3.0.x kernel), we find this function calls cpa_flush_*():
/*   
* On success we use clflush, when the CPU supports it to
* avoid the wbindv. If the CPU does not support it and in the
* error case we fall back to cpa_flush_all (which uses
* wbindv):
*/
if (!ret && cpu_has_clflush) {
if (cpa.flags & (CPA_PAGES_ARRAY | CPA_ARRAY)) {
cpa_flush_array(addr, numpages, cache,
cpa.flags, pages);
} else
cpa_flush_range(baddr, numpages, cache);
} else
cpa_flush_all(cache);


Seems like this code is sensitive to CPU capabilities. It appears that cpa_flish_range/array() call a lighter-weight method of cache invalidation: clflush() (see arch/x86/include/asm/system.h) instead of wbinvdt().

If we assume Linux is working properly, then there is no need to flush the cache in nv-vm.c::nv_flush_cache(). NVIDIA says it doesn't always work, hence the wbinvdt(). Unfortunately, nv-vm.c doesn't give us any more information.

Here are the code comments in nv-vm.c from an older 27x-era driver. They're a little different:
/*
* Cache flushes and TLB invalidation
*
* Allocating new pages, we may change their kernel mappings' memory types
* from cached to UC to avoid cache aliasing. One problem with this is
* that cache lines may still contain data from these pages and there may
* be then stale TLB entries.
*
* The Linux kernel's strategy for addressing the above has varied since
* the introduction of change_page_attr(): it has been implicit in the
* change_page_attr() interface, explicit in the global_flush_tlb()
* interface and, as of this writing, is implicit again in the interfaces
* replacing change_page_attr(), i.e. set_pages_*().
*
* In theory, any of the above should satisfy the NVIDIA graphics driver's
* requirements. In practise, none do reliably:
*
* - some Linux 2.4 kernels (e.g. vanilla 2.4.27) did not flush caches
* on CPUs with Self Snoop capability, but this feature does not
* interact well with AGP.
*
* - most Linux 2.6 kernels' implementations of the global_flush_tlb()
* interface fail to flush caches on all or some CPUs, for a
* variety of reasons.
*
* Due to the above, the NVIDIA Linux graphics driver is forced to perform
* heavy-weight flush/invalidation operations to avoid problems due to
* stale cache lines and/or TLB entries.
*/

Here, the comments state that the 2.6 kernel only needs a TLB flush. This implies to me that commenting out the call to the CACHE_FLUSH() macro in nv_flush_cache() should be safe.* I think this is a better solution than changing CACHE_FLUSH() into a noop. Why did the comments in nv-vm.c change? Did an engineer get overly zealous in cleaning up comments when AGP or 2.4 kernel support was dropped? Did NVIDIA learn of other instances where the 2.6 kernel also needed a cache flush? We'll never know.

I think the best that you can do is register a bug with NVIDIA and hope that they task an engineer to reevaluate the situation. This is such a low-level and fundamental part of memory management that I could see NVIDIA being very (extremely) hesitant to making any official changes unless wbinvdt() starts to create serious problems for important customers (AAA games on Linux). I'm not surprised that you've hit a bug that relates to latency: the Linux driver hasn't really had to support low-latency operations until recently. Hopefully SteamOS will help motivate a change.

In the meantime, I think the best that you can do is test your system with CACHE_FLUSH() commented out from nv_flush_cache() and hope for the best. I'll test it out on my system as well.

* It should be safe, assuming that the nv_flush_cache() callee is calling nv_flush_cache() because it changed memory attributes via change_page_attr() or set_pages_*() and not for some other reason. This appears to be the case: We can limit the scope of code that needs to be reviewed to nv-vm.c since nv_flush_cache() is a static function.

#15
Posted 01/09/2014 07:36 PM   
  1 / 2    
Scroll To Top