I see, thanks for the info.
I'll report my findings so far here, in case it turns out useful to some
future person landing on this bug later:
The call trace of the deadlocked btrfs-cleaner kthread is as follows.
Tainted: P OE 4.15.0-45-generic #48-Ubuntu
btrfs-cleaner D 0 7969 2 0x80000000
Call Trace:
__schedule+0x291/0x8a0
schedule+0x2c/0x80
btrfs_tree_read_lock+0xcc/0x120 [btrfs]
? wait_woken+0x80/0x80
find_parent_nodes+0x295/0xe90 [btrfs]
? _cond_resched+0x19/0x40
btrfs_find_all_roots_safe+0xb0/0x120 [btrfs]
? btrfs_find_all_roots_safe+0xb0/0x120 [btrfs]
btrfs_find_all_roots+0x61/0x80 [btrfs]
btrfs_qgroup_trace_extent_post+0x37/0x60 [btrfs]
[...]
I'm not including the bottom of the call trace because it varies: the common part does start from btrfs_qgroup_trace_extent_post and up however. The caller of btrfs_qgroup_trace_extent_pos can be either
- btrfs_qgroup_trace_extent+0xee/0x110 [btrfs], or
- btrfs_add_delayed_tree_ref+0x1c6/0x1f0 [btrfs], or
- btrfs_add_delayed_data_ref+0x30a/0x340 [btrfs]
This happens on 4.15 (Ubuntu flavor), on 4.18 (Ubuntu flavor "HWE"),
4.20.13 (vanilla).
On 4.20.0 (vanilla) and 5.0-rc8 (vanilla), there is also a deadlock
under similar conditions, but the call trace of the deadlocked btrfs-
transaction kthread looks different:
Tainted: P OE 4.20.0-042000-generic #201812232030
btrfs-transacti D 0 8665 2 0x80000000
Call Trace:
__schedule+0x29e/0x840
? btrfs_free_path+0x13/0x20 [btrfs]
schedule+0x2c/0x80
btrfs_commit_transaction+0x715/0x840 [btrfs]
? wait_woken+0x80/0x80
transaction_kthread+0x15c/0x190 [btrfs]
kthread+0x120/0x140
? btrfs_cleanup_transaction+0x560/0x560 [btrfs]
? __kthread_parkme+0x70/0x70
ret_from_fork+0x35/0x40
Other userspace threads are locked at the same time.
So we seem to be dealing with at least 2 different deadlock cases which
seem to happen with lots of subvolumes and/or snapshots, and quota
enabled. All of this disappears with quota disabled.
For the record, the main btrfs qgroups dev seems to have a lot of
pending changes / fixes coming around this, expected for 5.1 or 5.2.
Stay tuned...
I have disabled quota for now. I only enable it for a short period of
time when I need to get size information about my subvols and snapshots.
--
You received this bug notification because you are subscribed to linux
in Ubuntu.
Matching subscriptions: Bgg, Bmail, Nb
https://bugs.launchpad.net/bugs/1765998
Title:
FS access deadlock with btrfs quotas enabled
Status in linux package in Ubuntu:
Triaged
Status in linux source package in Bionic:
Triaged
Bug description:
I'm running into an issue on Ubuntu Bionic (but not Xenial) where
shortly after boot, under heavy load from many LXD containers starting
at once, access to the btrfs filesystem that the containers are on
deadlocks.
The issue is quite hard to reproduce on other systems, quite likely
related to the size of the filesystem involved (4 devices with a total
of 8TB, millions of files, ~20 subvolumes with tens of snapshots each)
and the access pattern from many LXD containers at once. It definitely
goes away when disabling btrfs quotas though. Another prerequisite to
trigger this bug may be the container subvolumes sharing extents (from
their parent image or due to deduplication).
I can only reliably reproduce it on a production system that I can only do very limited testing on, however I have been able to gather the following information:
- Many threads are stuck, trying to aquire locks on various tree roots, which are never released by their current holders.
- There always seem to be (at least) two threads executing rmdir syscalls which are creating the circular dependency: One of them is in btrfs_cow_block => ... => btrfs_qgroup_trace_extent_post => ... => find_parent_nodes and wants to acquire a lock that was already aquired by btrfs_search_slot of the other rmdir.
- Reverting this patch seems to prevent it from happening: https://patchwork.kernel.org/patch/9573267/
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1765998/+subscriptions
Комментариев нет:
Отправить комментарий