среда

[Bug 1798165] Re: Vulkan applications cause permanent memory leak with Intel GPU

Launchpad has imported 25 comments from the remote bug at
https://bugs.freedesktop.org/show_bug.cgi?id=107899.

If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2018-09-11T08:19:25+00:00 yurikoles wrote:

It was discoverd that ANV doesn't free up allocated memory, original
discussion: https://github.com/doitsujin/dxvk/issues/632

Affected configuration: ANV + DXVK + Wine.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/0

------------------------------------------------------------------------
On 2018-09-11T08:21:14+00:00 yurikoles wrote:

Created attachment 141521
Top 10 stacks with outstanding allocations:

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/1

------------------------------------------------------------------------
On 2018-09-11T14:23:55+00:00 2-jason-v wrote:

>From the github issue, it sounded like these allocations are never freed
even after closing the app; is this correct?

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/2

------------------------------------------------------------------------
On 2018-09-11T15:00:36+00:00 Leonardo Müller wrote:

Yes, this is correct. Even closing the application the memory is not
freed. If the application is opened again it does not use that memory it
used before, it allocates new memory to itself.

Note: I'm FurretUber on GitHub.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/3

------------------------------------------------------------------------
On 2018-09-14T15:21:10+00:00 Leonardo Müller wrote:

Note: Dota 2 is causing GPU hangs, I have to set INTEL_DEBUG=nohiz to
make it work.

It seems it's not a DXVK-only issue. Dota 2 has, after 12 hours,
allocated 719 MB of i915_request cache too. What I have noticed is that
it was not allocating any i915_request cache for 20 minutes, then it
started to allocate the memory too but at rates much slower than the
DXVK applications.

As a comparison, Forsaken Castle, in the same 20 minutes, allocated 213
MB.

I already tried to debug this (see the GitHub issue) using perf and the
only thing that looked strange in the DXVK application was one of the
variables having values way off compared to the value of the variable in
other applications. Using perf to debug @<i915_gem_do_execbuffer+3647>
and looking at the values of in_fence and fences:

Forsaken_Castle 16943 [003] 17521.309896: probe:i915_gem_do_execbuffer: (ffffffffc07690af) in_fence=0x0 fences=0xffff952366b7bb00
Xorg 2215 [000] 17521.309986: probe:i915_gem_do_execbuffer: (ffffffffc07690af) in_fence=0x0 fences=0x2d28
rolc.exe 15796 [001] 17521.311725: probe:i915_gem_do_execbuffer: (ffffffffc07690af) in_fence=0x0 fences=0x1d8


I can do more tests if needed.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/4

------------------------------------------------------------------------
On 2018-09-14T20:39:48+00:00 Chris Wilson wrote:

The accumulation of i915_request implies there is a fence leak. Assuming
it is not internal (an unmatched dma_fence_get/dma_fence_put), all
userspace owners would be tied to an fd and eventually one would notice
the fd exhaustion (after a few million depending on rlimit). But for the
fd to stick around requires the process to be kept alive, which would
imply the fence fd being passed to a display server. I don't think that
is how fences are handled under X, which makes the likelihood of it
being a singular userspace fence leak less likely.

'ls -1 /proc/$suspect/fd/ | wc -l' might be interesting to watch.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/5

------------------------------------------------------------------------
On 2018-09-15T22:06:47+00:00 2-jason-v wrote:

I suspect this is some sort of fence leak with syncobj which would
explain why only anv hits it (the GL driver doesn't use syncobj yet).

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/6

------------------------------------------------------------------------
On 2018-09-16T16:14:47+00:00 Richard Yao wrote:

> The accumulation of i915_request implies there is a fence leak.
Assuming it is not internal (an unmatched dma_fence_get/dma_fence_put),
all userspace owners would be tied to an fd and eventually one would
notice the fd exhaustion (after a few million depending on rlimit).

It is not tied to a file descriptor because I915_EXEC_FENCE_OUT is not
set in args->flags (that was worked out by working backward from a perf
trace). This means that DEFINE_DRM_GEM_FOPS->drm_release is never
called, and we never get dma_fence_put() from this (hypothetical) stack:

dma_fence_put
drm_syncobj_free
kref_put
drm_syncobj_put
drm_syncobj_release_handle
drm_syncobj_release
drm_release

The trace at github indirectly shows that out_fence_fd == -1:

https://github.com/doitsujin/dxvk/issues/632#issuecomment-420485691

Also, my system is also affected by this. I have a Xeon E3-1276v3. I am
running Gentoo with Linux 4.18.0-rc8, Xorg 1.19.5, mesa 18.1.6 and
vulkan-loader 1.1.77.0. I have killed the Xorg server and the
i915_request objects were not freed from the SLAB cache. This implies
that the objects are not tied to a file descriptor.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/7

------------------------------------------------------------------------
On 2018-09-16T16:25:42+00:00 Richard Yao wrote:

Disregard that remark. That path involves sync_file_fops() and it has a
matching dma_fence_get()/dma_fence_put() in
sync_file_create()/sync_file_release().

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/8

------------------------------------------------------------------------
On 2018-09-16T17:46:37+00:00 Richard Yao wrote:

Created attachment 141589
Dump of long lived i915_request object

I captured one of these long lived SLAB objects from my system using the
crash utility. I am not a graphics developer, but here is what stood out
to me:

1. The dma_fence refcount is 1.
2. The dma_fence segno is 82576060.
3. The dma_fence ops is i915_fence_ops.
4. The global_seqno is 82576036.

I'll post my capture of the referenced intel_engine_cs next, although
the interesting thing there is that the timeline seqno is 82620551,
which implies that this particular object is old.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/9

------------------------------------------------------------------------
On 2018-09-16T17:47:03+00:00 Richard Yao wrote:

Created attachment 141590
Dump of intel_engine_cs to match i915_request dump

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/10

------------------------------------------------------------------------
On 2018-09-17T17:09:14+00:00 Leonardo Müller wrote:

Created attachment 141603
slabtop, /proc/slabinfo, wc -l proposed and one screenshot

I'm sorry for not being more active on this report before.

The attached file has three text files and one image.

I tested playing Forsaken Castle while watching ls -1 /proc/$suspect/fd/
| wc -l (on my case, $suspect=24003), the file is called
proc24003fdwcl.txt. I create it by using:

while true ; do ls -1 /proc/24003/fd/ | wc -l >> proc24003fdwcl.txt ;
sleep 2; done

There is /proc/slabinfo and the output of slabtop -s c too. There is
iGVT-g load in the logs because I was using a Virtual Machine when
tested this (it's not Forsaken Castle with iGVT-g load, don't worry).

The screenshot is the end of the game demo, with an approximate of the
time I played it. Notice that in ~12 minutes (I needed some seconds to
set up the logging) it allocated around 110 MB of memory.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/11

------------------------------------------------------------------------
On 2018-09-18T08:16:01+00:00 2-jason-v wrote:

Created attachment 141619
DRM syncobj fence leak fix

Mind trying a kernel patch? I'm not in a position to experiment with
kernel patches at the moment but I think I found the bug.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/12

------------------------------------------------------------------------
On 2018-09-18T12:41:29+00:00 Richard Yao wrote:

I am able to patch my kernel and rebuild to test, but I will not have
access to the workstation that reproduces the problem for another 12
hours.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/13

------------------------------------------------------------------------
On 2018-09-18T14:40:53+00:00 Leonardo Müller wrote:

Created attachment 141641
Logs and screenshot

Great news! Looks like the memory leak is no longer happening or is
very, very slow (see later). I played Forsaken Castle demo for 10
minutes, a game I became skilled thanks to this bug, and the amount of
pages started at 100 objects and finished at 100 objects. The attached
file has dmesg, /proc/slabinfo and a screenshot.

Comparing to the previous situation, where 12 minutes playing allocated
110 MB, it's perfect. I'm not sure if it's placebo effect, but looked
like the game was running faster too, 125-130 FPS instead of 100-120
FPS.

Later I started Dota 2 (INTEL_DEBUG=nohiz used due to the hang and crash
bug) and the i915_request object count rose to 650, when I closed it
reduced to 350. I restarted X session and there is no sign of
i915_request on slabtop, /proc/slabinfo has:

i915_request 350 450 640 25 4 : tunables 0 0
0 : slabdata 18 18 0

The values seem pretty small but it did not reduce restarting X, not
sure how relevant this is.

This patch looks great.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/14

------------------------------------------------------------------------
On 2018-09-18T16:25:54+00:00 Richard Yao wrote:

We would need to dump the objects from the slab cache with crash to
confirm that the objects are long lived and should not be there, but it
sounds like there might be a small leak remaining. Unfortunately, I do
not have a handy method available for dumping all objects in a slab
cache via the crash utility. The best that i have done so far is get a
list of addresses of objects and then dumped them each manually.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/15

------------------------------------------------------------------------
On 2018-09-18T18:18:41+00:00 Leonardo Müller wrote:

I could test a bit further. I opened another DXVK game (DiRT 3 Complete
Edition) using INTEL_DEBUG=nohiz too, let one replay playing for one
hour and checked the i915_request again. It rose up to 575 objects and
when I closed it it kept that value.

What I noticed is that after I opened a Wine OpenGL game (rolc.exe) the
value reduced to 100 again, the same value that was at boot.

Looks like the leak as before no longer exists, as that cache was
cleaned by the OpenGL application. Before this patch the cache was never
cleaned, requiring a reboot.

The curious thing is that the Vulkan/DXVK applications aren't cleaning
part of the i915_request cache after they close, which the OpenGL
application is doing for them.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/16

------------------------------------------------------------------------
On 2018-09-19T17:55:54+00:00 Richard Yao wrote:

Jason, applying your patch against Linux 4.18.0-rc8 (yes, I know I
should upgrade) resolves the issue in Rise of Nations: Extended Edition.
In my cursory test, i915_request allocations do not exceed 163 and alt-
tabbing back to the KDE 5 desktop drops them to 15. Prior to the patch,
it would have thousands of allocations by now. I do not see the issue
that leozinho29_eu reported, although I did not try `INTEL_DEBUG=nohiz`.

Your patch looks good to me. Feel free to add my Reviewed-by and my
Tested-by. To be clear, I know what saying to add my Reviewed-by means.
I followed the "Reviewer's statement of oversight" before offering it:

https://www.kernel.org/doc/html/v4.18/process/submitting-patches.html
#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/17

------------------------------------------------------------------------
On 2018-09-19T19:13:19+00:00 Richard Yao wrote:

This might just be my lack of familiarity with the codebase, but why is
ANV affected by this while RADV is not? I do not have any hardware to
use to study how RADV works, but at a glance, it is relatively easy to
see the fence bits being handled in the i915 GEM code while I am unable
to tell at a glance how the amdgpu GEM code uses them. It seems to be
done in a very abstract way in the amdgpu_cs.c file and it is not clear
to me how that gets invoked by RADV. I am curious if the difference
might point to the possibility that ANV is overly aggressive at fencing.
I believe that DXVK should be making the same API calls on both.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/18

------------------------------------------------------------------------
On 2018-09-26T07:28:53+00:00 2-jason-v wrote:

I've submitted the kernel patch to the mailing list. Hopefully, it will
land fairly soon and we'll make sure it gets back-ported as far as
needed.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/19

------------------------------------------------------------------------
On 2018-09-27T07:49:06+00:00 2-jason-v wrote:

The fix has landed in drm-misc-fixes; it will propagate to a kernel
release near you shortly.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/20

------------------------------------------------------------------------
On 2018-10-15T08:28:03+00:00 Lakshminarayana-vudum wrote:

Yurii, do you still have the issue? Based on your confirmation, I can
close the bug.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/21

------------------------------------------------------------------------
On 2018-10-15T09:11:52+00:00 yurikoles wrote:

Hi Lakshmi!
I just forwarded issue here, I even don't have Linux desktop now. I had asked on same question also on GH.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/22

------------------------------------------------------------------------
On 2018-10-15T15:49:18+00:00 yurikoles wrote:

In the original report users say that this was fixed.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/23

------------------------------------------------------------------------
On 2018-10-16T06:16:18+00:00 Lakshminarayana-vudum wrote:

Thanks for your feedback Yurii. I consider this bug has been fixed.
Closing this bug.

Reply at:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1798165/comments/24


** Changed in: linux
Status: Unknown => Fix Released

** Changed in: linux
Importance: Unknown => Medium

** Bug watch added: github.com/doitsujin/dxvk/issues #632
https://github.com/doitsujin/dxvk/issues/632

--
You received this bug notification because you are subscribed to linux
in Ubuntu.
Matching subscriptions: Bgg, Bmail, Nb
https://bugs.launchpad.net/bugs/1798165

Title:
Vulkan applications cause permanent memory leak with Intel GPU

Status in Linux:
Fix Released
Status in linux package in Ubuntu:
In Progress
Status in linux source package in Bionic:
Fix Committed
Status in linux source package in Cosmic:
Fix Committed

Bug description:

== SRU Justification ==
Vulkan applications, as Dota 2 and DXVK games cause a memory leak where
memory is never freed and can cause a system crash if the applications are
used for long enough. Certain applications can make the leak be as high as
10 MB/minute.

This commit has been cc'd to upstream stable, but it has not landed in
Bionic or Cosmic as of yet.

Details about the upstream bug can be seen at:
https://github.com/doitsujin/dxvk/issues/632
https://bugs.freedesktop.org/show_bug.cgi?id=107899

== Fix ==
337fe9f5c1e7 ("drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set")

== Regression Potential ==
Low. This commit has been cc'd to stable, so it has had additional
upstream review.

== Test Case ==
A test kernel was built with this patch and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.


Vulkan applications, as Dota 2 and DXVK games cause a memory leak
where memory is never freed and can cause a system crash if the
applications are used for long enough. Certain applications can make
the leak be as high as 10 MB/minute.

Details about this bug can be seen at
https://github.com/doitsujin/dxvk/issues/632 and
https://bugs.freedesktop.org/show_bug.cgi?id=107899

This bug was fixed in 4.19-rc6 and was backported to 4.14 and 4.18.
The particular commit is:

commit a2cef7d049f07995406b403605119a54881daf15
Author: Jason Ekstrand <jason@jlekstrand.net>
Date: Wed Sep 26 02:17:03 2018 -0500

    drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set

    commit 337fe9f5c1e7de1f391c6a692531379d2aa2ee11 upstream.

    We attempt to get fences earlier in the hopes that everything will
    already have fences and no callbacks will be needed. If we do succeed
    in getting a fence, getting one a second time will result in a duplicate
    ref with no unref. This is causing memory leaks in Vulkan applications
    that create a lot of fences; playing for a few hours can, apparently,
    bring down the system.

    Cc: stable@vger.kernel.org
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107899
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Jason Ekstrand <jason@jlekstrand.net>
    Signed-off-by: Sean Paul <seanpaul@chromium.org>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180926071703.15257-1-jason.ekstrand@intel.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

On Ubuntu 18.04 with 4.15.0-36 it appears in slabtop as:
https://i.imgur.com/qMAvuwl.png

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-4.15.0-36-generic 4.15.0-36.39
ProcVersionSignature: Ubuntu 4.15.0-36.39-generic 4.15.18
Uname: Linux 4.15.0-36-generic x86_64
ApportVersion: 2.20.9-0ubuntu7.4
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: usuario 4655 F.... pulseaudio
 /dev/snd/seq: usuario 4640 F.... timidity
CurrentDesktop: XFCE
Date: Tue Oct 16 14:16:54 2018
HibernationDevice: RESUME=UUID=0946602f-3ca2-4379-9012-7a5171928de7
InstallationDate: Installed on 2017-06-13 (489 days ago)
InstallationMedia: Xubuntu 17.04 "Zesty Zapus" - Release amd64 (20170412)
MachineType: LENOVO 80UG
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-36-generic root=UUID=6b4ae5c0-c78c-49a6-a1ba-029192618a7a ro quiet ro kvm.ignore_msrs=1 kvm.halt_poll_ns=0 kvm.halt_poll_ns_grow=0 intel_iommu=on iommu=pt i915.enable_gvt=1 i915.fastboot=1 resume=UUID=0946602f-3ca2-4379-9012-7a5171928de7 mtrr_gran_size=2M mtrr_chunk_size=64M cgroup_enable=memory swapaccount=1 zswap.enabled=1 log_buf_len=16M usbhid.quirks=0x0079:0x0006:0x100000
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-36-generic N/A
 linux-backports-modules-4.15.0-36-generic N/A
 linux-firmware 1.173.1
SourcePackage: linux
UpgradeStatus: Upgraded to bionic on 2017-10-20 (361 days ago)
dmi.bios.date: 08/09/2018
dmi.bios.vendor: LENOVO
dmi.bios.version: 0XCN45WW
dmi.board.asset.tag: NO Asset Tag
dmi.board.name: Toronto 4A2
dmi.board.vendor: LENOVO
dmi.board.version: SDK0J40679 WIN
dmi.chassis.asset.tag: NO Asset Tag
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Lenovo ideapad 310-14ISK
dmi.modalias: dmi:bvnLENOVO:bvr0XCN45WW:bd08/09/2018:svnLENOVO:pn80UG:pvrLenovoideapad310-14ISK:rvnLENOVO:rnToronto4A2:rvrSDK0J40679WIN:cvnLENOVO:ct10:cvrLenovoideapad310-14ISK:
dmi.product.family: IDEAPAD
dmi.product.name: 80UG
dmi.product.version: Lenovo ideapad 310-14ISK
dmi.sys.vendor: LENOVO

To manage notifications about this bug go to:
https://bugs.launchpad.net/linux/+bug/1798165/+subscriptions

Комментариев нет:

Отправить комментарий