понедельник

[Bug 2146310] Re: NFSv4 client hang in OPEN reclaim path waiting for RPC completion

** Attachment added: "flock_test.py"
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146310/+attachment/5956857/+files/flock_test.py

** Description changed:

NFSv4 client stuck during state recovery on Ubuntu 22.04 (kernel 5.15)

1. Environment

- Client OS: Ubuntu 22.04
- Kernel version: 5.15.0-113-generic
- NFS protocol: NFSv4.0

Mount options: 10.59.62.51:/ on /nfs type nfs4
(rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.59.254.244,local_lock=none,addr=10.59.62.51)

------------------------------------------------------------------------

2. Problem Description

We observed two types of abnormal behaviors related to NFSv4 client
state recovery.

------------------------------------------------------------------------

Case 1: No recovery after NFS4ERR_STALE_CLIENTID

Expected behavior (per RFC7530): - Client should re-establish client
identity via SETCLIENTID / SETCLIENTID_CONFIRM - Client should reclaim
state (open/lock)

Actual behavior: - Client keeps retrying normal requests - No recovery
process is triggered - No SETCLIENTID observed

------------------------------------------------------------------------

Case 2: Client stuck during reclaim after lease expiration

Scenario

1. Client stops sending RENEW due to network issue
2. Server considers the lease expired
3. After network recovery:
- - Client sends RENEW
- - Server responds with NFS4ERR_EXPIRED
+     - Client sends RENEW
+     - Server responds with NFS4ERR_EXPIRED
4. Client starts recovery:
- - SETCLIENTID succeeds
- - SETCLIENTID_CONFIRM succeeds
+     - SETCLIENTID succeeds
+     - SETCLIENTID_CONFIRM succeeds
5. Client enters reclaim phase(with open rpc reclaim=false)
-

Client gets stuck during reclaim phase.

- Stack trace:
- [<0>] rpc_wait_bit_killable
+ Stack trace:
+ [<0>] rpc_wait_bit_killable
[<0>] __rpc_wait_for_completion_task
- [<0>] nfs4_run_open_task
- [<0>] nfs4_open_recover_helper
+ [<0>] nfs4_run_open_task
+ [<0>] nfs4_open_recover_helper
[<0>] nfs4_open_recover
- [<0>] nfs4_do_open_expired
- [<0>] nfs40_open_expired
+ [<0>] nfs4_do_open_expired
+ [<0>] nfs40_open_expired
[<0>] __nfs4_reclaim_open_state
- [<0>] nfs4_reclaim_open_state
- [<0>] nfs4_do_reclaim
+ [<0>] nfs4_reclaim_open_state
+ [<0>] nfs4_do_reclaim
[<0>] nfs4_state_manager
-

------------------------------------------------------------------------

3. Reproduction Steps

1. Mount NFS filesystem (see above)
- 2. Run workload scripts:
- - create_and_open.sh
- - flock_test.py
+ 2. Run workload scripts(attachment below):
+     - create_and_open.sh
+     - flock_test.py
3. Restart NFS server during workload to cause the client lease to expire
4. Issue reproduces reliably

------------------------------------------------------------------------

Additional stack traces

create_and_open.sh
```
[<0>] rpc_wait_bit_killable+0x25/0xb0 [sunrpc]
[<0>] __rpc_wait_for_completion_task+0x2d/0x40 [sunrpc]
[<0>] nfs4_do_close+0x2d7/0x380 [nfsv4]
[<0>] __nfs4_close.constprop.0+0x11f/0x1f0 [nfsv4]
[<0>] nfs4_close_sync+0x13/0x20 [nfsv4]
[<0>] nfs4_close_context+0x35/0x60 [nfsv4]
[<0>] __put_nfs_open_context+0xc7/0x150 [nfs]
[<0>] nfs_file_clear_open_context+0x4c/0x60 [nfs]
[<0>] nfs_file_release+0x3e/0x50 [nfs]
[<0>] __fput+0x9c/0x280
[<0>] ____fput+0xe/0x20
[<0>] task_work_run+0x6a/0xb0
[<0>] exit_to_user_mode_loop+0x157/0x160
[<0>] exit_to_user_mode_prepare+0xa0/0xb0
[<0>] syscall_exit_to_user_mode+0x27/0x50
[<0>] do_syscall_64+0x63/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x67/0xd1
```

-
------------------------------------------------------------------------

flock_test.py
```
[<0>] rpc_wait_bit_killable+0x25/0xb0 [sunrpc]
[<0>] __rpc_wait_for_completion_task+0x2d/0x40 [sunrpc]
[<0>] _nfs4_do_setlk+0x290/0x410 [nfsv4]
[<0>] nfs4_proc_setlk+0x78/0x160 [nfsv4]
[<0>] nfs4_retry_setlk+0x1dd/0x250 [nfsv4]
[<0>] nfs4_proc_lock+0x9d/0x1b0 [nfsv4]
[<0>] do_setlk+0x64/0x100 [nfs]
[<0>] nfs_lock+0xb3/0x180 [nfs]
[<0>] do_lock_file_wait+0x4f/0x120
[<0>] fcntl_setlk+0x127/0x2e0
[<0>] do_fcntl+0x4ce/0x5a0
[<0>] __x64_sys_fcntl+0xa9/0xd0
[<0>] x64_sys_call+0x1f5c/0x1fa0
[<0>] do_syscall_64+0x56/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x67/0xd1
```

-
------------------------------------------------------------------------
-

[10.59.62.51-man]
```
[<0>] rpc_wait_bit_killable+0x25/0xb0 [sunrpc]
[<0>] __rpc_wait_for_completion_task+0x2d/0x40 [sunrpc]
[<0>] nfs4_run_open_task+0x152/0x1e0 [nfsv4]
[<0>] nfs4_open_recover_helper+0x155/0x210 [nfsv4]
[<0>] nfs4_open_recover+0x22/0xd0 [nfsv4]
[<0>] nfs4_do_open_reclaim+0x128/0x220 [nfsv4]
[<0>] nfs4_open_reclaim+0x42/0xa0 [nfsv4]
[<0>] __nfs4_reclaim_open_state+0x25/0x110 [nfsv4]
[<0>] nfs4_reclaim_open_state+0xd1/0x2c0 [nfsv4]
[<0>] nfs4_do_reclaim+0x12f/0x230 [nfsv4]
[<0>] nfs4_state_manager+0x6d9/0x870 [nfsv4]
[<0>] nfs4_run_state_manager+0xa8/0x1a0 [nfsv4]
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x1f/0x30
```

------------------------------------------------------------------------

4. Kernel Version Comparison

Affected:
Ubuntu 22.04 5.15.0-113-generic

Not affected:
Ubuntu 20.04 5.4.0-48-generic
Ubuntu 22.04 6.8.0-60-generic
Ubuntu 24.04 6.8.0-31-generic
Centos 7.9 4.19.188-10.el7.ucloud.x86_64
Centos 7.9 3.10.0-1062.9.1.el7.x86_64
Centos 8.3 4.18.0-240.1.1.el8_3.x86_64

-
------------------------------------------------------------------------

5. Questions

1. Is it expected that no recovery is triggered after
- NFS4ERR_STALE_CLIENTID?
+     NFS4ERR_STALE_CLIENTID?
2. During reclaim, should OPEN be sent with reclaim=true?
3. Could reclaim=false cause reclaim failure?
4. Why is client stuck in rpc_wait_bit_killable?
5. Is this a known issue in kernel 5.15?
6. Are there any related patches or fixes?

--
You received this bug notification because you are subscribed to linux
in Ubuntu.
Matching subscriptions: Bgg, Bmail, Nb
https://bugs.launchpad.net/bugs/2146310

Title:
NFSv4 client hang in OPEN reclaim path waiting for RPC completion

Status in linux package in Ubuntu:
New

Bug description:
NFSv4 client stuck during state recovery on Ubuntu 22.04 (kernel 5.15)

1. Environment

- Client OS: Ubuntu 22.04
- Kernel version: 5.15.0-113-generic
- NFS protocol: NFSv4.0

Mount options: 10.59.62.51:/ on /nfs type nfs4
(rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.59.254.244,local_lock=none,addr=10.59.62.51)

------------------------------------------------------------------------

2. Problem Description

We observed two types of abnormal behaviors related to NFSv4 client
state recovery.

------------------------------------------------------------------------

Case 1: No recovery after NFS4ERR_STALE_CLIENTID

Expected behavior (per RFC7530): - Client should re-establish client
identity via SETCLIENTID / SETCLIENTID_CONFIRM - Client should reclaim
state (open/lock)

Actual behavior: - Client keeps retrying normal requests - No recovery
process is triggered - No SETCLIENTID observed

------------------------------------------------------------------------

Case 2: Client stuck during reclaim after lease expiration

Scenario

1. Client stops sending RENEW due to network issue
2. Server considers the lease expired
3. After network recovery:
    - Client sends RENEW
    - Server responds with NFS4ERR_EXPIRED
4. Client starts recovery:
    - SETCLIENTID succeeds
    - SETCLIENTID_CONFIRM succeeds
5. Client enters reclaim phase(with open rpc reclaim=false)

Client gets stuck during reclaim phase.

Stack trace:
[<0>] rpc_wait_bit_killable
[<0>] __rpc_wait_for_completion_task
[<0>] nfs4_run_open_task
[<0>] nfs4_open_recover_helper
[<0>] nfs4_open_recover
[<0>] nfs4_do_open_expired
[<0>] nfs40_open_expired
[<0>] __nfs4_reclaim_open_state
[<0>] nfs4_reclaim_open_state
[<0>] nfs4_do_reclaim
[<0>] nfs4_state_manager

------------------------------------------------------------------------

3. Reproduction Steps

1. Mount NFS filesystem (see above)
2. Run workload scripts(attachment below):
    - create_and_open.sh
    - flock_test.py
3. Restart NFS server during workload to cause the client lease to expire
4. Issue reproduces reliably

------------------------------------------------------------------------

Additional stack traces

create_and_open.sh
```
[<0>] rpc_wait_bit_killable+0x25/0xb0 [sunrpc]
[<0>] __rpc_wait_for_completion_task+0x2d/0x40 [sunrpc]
[<0>] nfs4_do_close+0x2d7/0x380 [nfsv4]
[<0>] __nfs4_close.constprop.0+0x11f/0x1f0 [nfsv4]
[<0>] nfs4_close_sync+0x13/0x20 [nfsv4]
[<0>] nfs4_close_context+0x35/0x60 [nfsv4]
[<0>] __put_nfs_open_context+0xc7/0x150 [nfs]
[<0>] nfs_file_clear_open_context+0x4c/0x60 [nfs]
[<0>] nfs_file_release+0x3e/0x50 [nfs]
[<0>] __fput+0x9c/0x280
[<0>] ____fput+0xe/0x20
[<0>] task_work_run+0x6a/0xb0
[<0>] exit_to_user_mode_loop+0x157/0x160
[<0>] exit_to_user_mode_prepare+0xa0/0xb0
[<0>] syscall_exit_to_user_mode+0x27/0x50
[<0>] do_syscall_64+0x63/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x67/0xd1
```

------------------------------------------------------------------------

flock_test.py
```
[<0>] rpc_wait_bit_killable+0x25/0xb0 [sunrpc]
[<0>] __rpc_wait_for_completion_task+0x2d/0x40 [sunrpc]
[<0>] _nfs4_do_setlk+0x290/0x410 [nfsv4]
[<0>] nfs4_proc_setlk+0x78/0x160 [nfsv4]
[<0>] nfs4_retry_setlk+0x1dd/0x250 [nfsv4]
[<0>] nfs4_proc_lock+0x9d/0x1b0 [nfsv4]
[<0>] do_setlk+0x64/0x100 [nfs]
[<0>] nfs_lock+0xb3/0x180 [nfs]
[<0>] do_lock_file_wait+0x4f/0x120
[<0>] fcntl_setlk+0x127/0x2e0
[<0>] do_fcntl+0x4ce/0x5a0
[<0>] __x64_sys_fcntl+0xa9/0xd0
[<0>] x64_sys_call+0x1f5c/0x1fa0
[<0>] do_syscall_64+0x56/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x67/0xd1
```

------------------------------------------------------------------------

[10.59.62.51-man]
```
[<0>] rpc_wait_bit_killable+0x25/0xb0 [sunrpc]
[<0>] __rpc_wait_for_completion_task+0x2d/0x40 [sunrpc]
[<0>] nfs4_run_open_task+0x152/0x1e0 [nfsv4]
[<0>] nfs4_open_recover_helper+0x155/0x210 [nfsv4]
[<0>] nfs4_open_recover+0x22/0xd0 [nfsv4]
[<0>] nfs4_do_open_reclaim+0x128/0x220 [nfsv4]
[<0>] nfs4_open_reclaim+0x42/0xa0 [nfsv4]
[<0>] __nfs4_reclaim_open_state+0x25/0x110 [nfsv4]
[<0>] nfs4_reclaim_open_state+0xd1/0x2c0 [nfsv4]
[<0>] nfs4_do_reclaim+0x12f/0x230 [nfsv4]
[<0>] nfs4_state_manager+0x6d9/0x870 [nfsv4]
[<0>] nfs4_run_state_manager+0xa8/0x1a0 [nfsv4]
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x1f/0x30
```

------------------------------------------------------------------------

4. Kernel Version Comparison

Affected:
Ubuntu 22.04 5.15.0-113-generic

Not affected:
Ubuntu 20.04 5.4.0-48-generic
Ubuntu 22.04 6.8.0-60-generic
Ubuntu 24.04 6.8.0-31-generic
Centos 7.9 4.19.188-10.el7.ucloud.x86_64
Centos 7.9 3.10.0-1062.9.1.el7.x86_64
Centos 8.3 4.18.0-240.1.1.el8_3.x86_64

------------------------------------------------------------------------

5. Questions

1. Is it expected that no recovery is triggered after
    NFS4ERR_STALE_CLIENTID?
2. During reclaim, should OPEN be sent with reclaim=true?
3. Could reclaim=false cause reclaim failure?
4. Why is client stuck in rpc_wait_bit_killable?
5. Is this a known issue in kernel 5.15?
6. Are there any related patches or fixes?

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146310/+subscriptions

Комментариев нет:

Отправить комментарий