[РЕШЕНО] Ошибка № ... : [Bug 2146310] Re: NFSv4 client hang in OPEN reclaim path waiting for RPC completion

** Description changed:

- Hi,
+ NFSv4 client stuck during state recovery on Ubuntu 22.04 (kernel 5.15)

- We are seeing an NFSv4.0 client hang on Linux kernel 5.15 (Ubuntu
- 22.04).
+ 1. Environment

- The issue starts when the server returns NFS4ERR_EXPIRED. The client
- then enters recovery, but reclaim never completes.
+ - Client OS: Ubuntu 22.04
+ - Kernel version: 5.15.0-113-generic
+ - NFS protocol: NFSv4.0

- The state manager thread is stuck with the following stack:
+ Mount options: 10.59.62.51:/ on /nfs type nfs4
+ (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.59.254.244,local_lock=none,addr=10.59.62.51)

+ ------------------------------------------------------------------------
+
+ 2. Problem Description
+
+ We observed two types of abnormal behaviors related to NFSv4 client
+ state recovery.
+
+ ------------------------------------------------------------------------
+
+ Case 1: No recovery after NFS4ERR_STALE_CLIENTID
+
+ Expected behavior (per RFC7530): - Client should re-establish client
+ identity via SETCLIENTID / SETCLIENTID_CONFIRM - Client should reclaim
+ state (open/lock)
+
+ Actual behavior: - Client keeps retrying normal requests - No recovery
+ process is triggered - No SETCLIENTID observed
+
+ ------------------------------------------------------------------------
+
+ Case 2: Client stuck during reclaim after lease expiration
+
+ Scenario
+
+ 1. Client stops sending RENEW due to network issue
+ 2. Server considers the lease expired
+ 3. After network recovery:
+ - Client sends RENEW
+ - Server responds with NFS4ERR_EXPIRED
+ 4. Client starts recovery:
+ - SETCLIENTID succeeds
+ - SETCLIENTID_CONFIRM succeeds
+ 5. Client enters reclaim phase(with open rpc reclaim=false)
+
+
+ Client gets stuck during reclaim phase.
+
+ Stack trace:
+ [<0>] rpc_wait_bit_killable
+ [<0>] __rpc_wait_for_completion_task
+ [<0>] nfs4_run_open_task
+ [<0>] nfs4_open_recover_helper
+ [<0>] nfs4_open_recover
+ [<0>] nfs4_do_open_expired
+ [<0>] nfs40_open_expired
+ [<0>] __nfs4_reclaim_open_state
+ [<0>] nfs4_reclaim_open_state
+ [<0>] nfs4_do_reclaim
+ [<0>] nfs4_state_manager
+
+
+ ------------------------------------------------------------------------
+
+ 3. Reproduction Steps
+
+ 1. Mount NFS filesystem (see above)
+ 2. Run workload scripts:
+ - create_and_open.sh
+ - flock_test.py
+ 3. Restart NFS server during workload to cause the client lease to expire
+ 4. Issue reproduces reliably
+
+ ------------------------------------------------------------------------
+
+ Additional stack traces
+
+ create_and_open.sh
```
- rpc_wait_bit_killable
- __rpc_wait_for_completion_task
- nfs4_run_open_task
- nfs4_open_recover_helper
- nfs4_open_recover
- nfs4_do_open_expired
- nfs40_open_expired
- __nfs4_reclaim_open_state
- nfs4_reclaim_open_state
- nfs4_do_reclaim
- nfs4_state_manager
+ [<0>] rpc_wait_bit_killable+0x25/0xb0 [sunrpc]
+ [<0>] __rpc_wait_for_completion_task+0x2d/0x40 [sunrpc]
+ [<0>] nfs4_do_close+0x2d7/0x380 [nfsv4]
+ [<0>] __nfs4_close.constprop.0+0x11f/0x1f0 [nfsv4]
+ [<0>] nfs4_close_sync+0x13/0x20 [nfsv4]
+ [<0>] nfs4_close_context+0x35/0x60 [nfsv4]
+ [<0>] __put_nfs_open_context+0xc7/0x150 [nfs]
+ [<0>] nfs_file_clear_open_context+0x4c/0x60 [nfs]
+ [<0>] nfs_file_release+0x3e/0x50 [nfs]
+ [<0>] __fput+0x9c/0x280
+ [<0>] ____fput+0xe/0x20
+ [<0>] task_work_run+0x6a/0xb0
+ [<0>] exit_to_user_mode_loop+0x157/0x160
+ [<0>] exit_to_user_mode_prepare+0xa0/0xb0
+ [<0>] syscall_exit_to_user_mode+0x27/0x50
+ [<0>] do_syscall_64+0x63/0xb0
+ [<0>] entry_SYSCALL_64_after_hwframe+0x67/0xd1
```

- Meanwhile:
- - The server repeatedly returns NFS4ERR_EXPIRED
- - The client does not successfully reclaim state
- - IO continues and repeatedly fails

- RPC stats show:
- - ~30M calls
- - very low retransmissions (94)
+ ------------------------------------------------------------------------

- This suggests the issue is unlikely to be caused by network loss or
- server unresponsiveness.
+ flock_test.py
+ ```
+ [<0>] rpc_wait_bit_killable+0x25/0xb0 [sunrpc]
+ [<0>] __rpc_wait_for_completion_task+0x2d/0x40 [sunrpc]
+ [<0>] _nfs4_do_setlk+0x290/0x410 [nfsv4]
+ [<0>] nfs4_proc_setlk+0x78/0x160 [nfsv4]
+ [<0>] nfs4_retry_setlk+0x1dd/0x250 [nfsv4]
+ [<0>] nfs4_proc_lock+0x9d/0x1b0 [nfsv4]
+ [<0>] do_setlk+0x64/0x100 [nfs]
+ [<0>] nfs_lock+0xb3/0x180 [nfs]
+ [<0>] do_lock_file_wait+0x4f/0x120
+ [<0>] fcntl_setlk+0x127/0x2e0
+ [<0>] do_fcntl+0x4ce/0x5a0
+ [<0>] __x64_sys_fcntl+0xa9/0xd0
+ [<0>] x64_sys_call+0x1f5c/0x1fa0
+ [<0>] do_syscall_64+0x56/0xb0
+ [<0>] entry_SYSCALL_64_after_hwframe+0x67/0xd1
+ ```

- Additionally, we have verified that:
- - Network connectivity is stable
- - The NFS server is operating normally (no restart or failover observed)

- Importantly:
- - We do observe that RENEW/SEQUENCE-related traffic is being sent from the client
- - However, the client still ends up with an expired lease (NFS4ERR_EXPIRED)
+ ------------------------------------------------------------------------

- This raises the question whether the lease renewal is not being properly
- processed or completed on the client side.

- Given that we are using NFSv4.1 (where lease renewal is implicit via
- SEQUENCE), we would like to understand:
+ [10.59.62.51-man]
+ ```
+ [<0>] rpc_wait_bit_killable+0x25/0xb0 [sunrpc]
+ [<0>] __rpc_wait_for_completion_task+0x2d/0x40 [sunrpc]
+ [<0>] nfs4_run_open_task+0x152/0x1e0 [nfsv4]
+ [<0>] nfs4_open_recover_helper+0x155/0x210 [nfsv4]
+ [<0>] nfs4_open_recover+0x22/0xd0 [nfsv4]
+ [<0>] nfs4_do_open_reclaim+0x128/0x220 [nfsv4]
+ [<0>] nfs4_open_reclaim+0x42/0xa0 [nfsv4]
+ [<0>] __nfs4_reclaim_open_state+0x25/0x110 [nfsv4]
+ [<0>] nfs4_reclaim_open_state+0xd1/0x2c0 [nfsv4]
+ [<0>] nfs4_do_reclaim+0x12f/0x230 [nfsv4]
+ [<0>] nfs4_state_manager+0x6d9/0x870 [nfsv4]
+ [<0>] nfs4_run_state_manager+0xa8/0x1a0 [nfsv4]
+ [<0>] kthread+0x127/0x150
+ [<0>] ret_from_fork+0x1f/0x30
+ ```

- 1. Under what conditions could the client still hit NFS4ERR_EXPIRED despite ongoing renew/SEQUENCE activity and a healthy server/network?
- 2. Is it possible that RPC completion, session slot handling, or sequence handling issues could prevent the lease from being effectively renewed?
- 3. Could this be a known issue in the NFSv4.1 recovery or session handling path in 5.15?
+ ------------------------------------------------------------------------

- It appears the client is stuck in the OPEN reclaim path waiting for RPC
- completion, and recovery cannot make forward progress.
+ 4. Kernel Version Comparison

- Are there known fixes or patches in newer kernels (e.g., 5.19 or 6.x)
- that address this class of issue?
+ Affected:
+ Ubuntu 22.04 5.15.0-113-generic

- Any pointers or suggestions would be greatly appreciated.
+ Not affected:
+ Ubuntu 20.04 5.4.0-48-generic
+ Ubuntu 22.04 6.8.0-60-generic
+ Ubuntu 24.04 6.8.0-31-generic
+ Centos 7.9 4.19.188-10.el7.ucloud.x86_64
+ Centos 7.9 3.10.0-1062.9.1.el7.x86_64
+ Centos 8.3 4.18.0-240.1.1.el8_3.x86_64

- Thanks
+
+ ------------------------------------------------------------------------
+
+ 5. Questions
+
+ 1. Is it expected that no recovery is triggered after
+ NFS4ERR_STALE_CLIENTID?
+ 2. During reclaim, should OPEN be sent with reclaim=true?
+ 3. Could reclaim=false cause reclaim failure?
+ 4. Why is client stuck in rpc_wait_bit_killable?
+ 5. Is this a known issue in kernel 5.15?
+ 6. Are there any related patches or fixes?

** Attachment added: "create_and_open.sh"
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146310/+attachment/5956856/+files/create_and_open.sh

--
You received this bug notification because you are subscribed to linux
in Ubuntu.
Matching subscriptions: Bgg, Bmail, Nb
https://bugs.launchpad.net/bugs/2146310

Title:
NFSv4 client hang in OPEN reclaim path waiting for RPC completion

Status in linux package in Ubuntu:
New

Bug description:
NFSv4 client stuck during state recovery on Ubuntu 22.04 (kernel 5.15)

1. Environment

- Client OS: Ubuntu 22.04
- Kernel version: 5.15.0-113-generic
- NFS protocol: NFSv4.0

Mount options: 10.59.62.51:/ on /nfs type nfs4
(rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.59.254.244,local_lock=none,addr=10.59.62.51)

------------------------------------------------------------------------

2. Problem Description

We observed two types of abnormal behaviors related to NFSv4 client
state recovery.

------------------------------------------------------------------------

Case 1: No recovery after NFS4ERR_STALE_CLIENTID

Expected behavior (per RFC7530): - Client should re-establish client
identity via SETCLIENTID / SETCLIENTID_CONFIRM - Client should reclaim
state (open/lock)

Actual behavior: - Client keeps retrying normal requests - No recovery
process is triggered - No SETCLIENTID observed

------------------------------------------------------------------------

Case 2: Client stuck during reclaim after lease expiration

Scenario

1. Client stops sending RENEW due to network issue
2. Server considers the lease expired
3. After network recovery:
    - Client sends RENEW
    - Server responds with NFS4ERR_EXPIRED
4. Client starts recovery:
    - SETCLIENTID succeeds
    - SETCLIENTID_CONFIRM succeeds
5. Client enters reclaim phase(with open rpc reclaim=false)

Client gets stuck during reclaim phase.

Stack trace:
[<0>] rpc_wait_bit_killable
[<0>] __rpc_wait_for_completion_task
[<0>] nfs4_run_open_task
[<0>] nfs4_open_recover_helper
[<0>] nfs4_open_recover
[<0>] nfs4_do_open_expired
[<0>] nfs40_open_expired
[<0>] __nfs4_reclaim_open_state
[<0>] nfs4_reclaim_open_state
[<0>] nfs4_do_reclaim
[<0>] nfs4_state_manager

------------------------------------------------------------------------

3. Reproduction Steps

1. Mount NFS filesystem (see above)
2. Run workload scripts(attachment below):
    - create_and_open.sh
    - flock_test.py
3. Restart NFS server during workload to cause the client lease to expire
4. Issue reproduces reliably

------------------------------------------------------------------------

Additional stack traces

create_and_open.sh
```
[<0>] rpc_wait_bit_killable+0x25/0xb0 [sunrpc]
[<0>] __rpc_wait_for_completion_task+0x2d/0x40 [sunrpc]
[<0>] nfs4_do_close+0x2d7/0x380 [nfsv4]
[<0>] __nfs4_close.constprop.0+0x11f/0x1f0 [nfsv4]
[<0>] nfs4_close_sync+0x13/0x20 [nfsv4]
[<0>] nfs4_close_context+0x35/0x60 [nfsv4]
[<0>] __put_nfs_open_context+0xc7/0x150 [nfs]
[<0>] nfs_file_clear_open_context+0x4c/0x60 [nfs]
[<0>] nfs_file_release+0x3e/0x50 [nfs]
[<0>] __fput+0x9c/0x280
[<0>] ____fput+0xe/0x20
[<0>] task_work_run+0x6a/0xb0
[<0>] exit_to_user_mode_loop+0x157/0x160
[<0>] exit_to_user_mode_prepare+0xa0/0xb0
[<0>] syscall_exit_to_user_mode+0x27/0x50
[<0>] do_syscall_64+0x63/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x67/0xd1
```

------------------------------------------------------------------------

flock_test.py
```
[<0>] rpc_wait_bit_killable+0x25/0xb0 [sunrpc]
[<0>] __rpc_wait_for_completion_task+0x2d/0x40 [sunrpc]
[<0>] _nfs4_do_setlk+0x290/0x410 [nfsv4]
[<0>] nfs4_proc_setlk+0x78/0x160 [nfsv4]
[<0>] nfs4_retry_setlk+0x1dd/0x250 [nfsv4]
[<0>] nfs4_proc_lock+0x9d/0x1b0 [nfsv4]
[<0>] do_setlk+0x64/0x100 [nfs]
[<0>] nfs_lock+0xb3/0x180 [nfs]
[<0>] do_lock_file_wait+0x4f/0x120
[<0>] fcntl_setlk+0x127/0x2e0
[<0>] do_fcntl+0x4ce/0x5a0
[<0>] __x64_sys_fcntl+0xa9/0xd0
[<0>] x64_sys_call+0x1f5c/0x1fa0
[<0>] do_syscall_64+0x56/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x67/0xd1
```

------------------------------------------------------------------------

[10.59.62.51-man]
```
[<0>] rpc_wait_bit_killable+0x25/0xb0 [sunrpc]
[<0>] __rpc_wait_for_completion_task+0x2d/0x40 [sunrpc]
[<0>] nfs4_run_open_task+0x152/0x1e0 [nfsv4]
[<0>] nfs4_open_recover_helper+0x155/0x210 [nfsv4]
[<0>] nfs4_open_recover+0x22/0xd0 [nfsv4]
[<0>] nfs4_do_open_reclaim+0x128/0x220 [nfsv4]
[<0>] nfs4_open_reclaim+0x42/0xa0 [nfsv4]
[<0>] __nfs4_reclaim_open_state+0x25/0x110 [nfsv4]
[<0>] nfs4_reclaim_open_state+0xd1/0x2c0 [nfsv4]
[<0>] nfs4_do_reclaim+0x12f/0x230 [nfsv4]
[<0>] nfs4_state_manager+0x6d9/0x870 [nfsv4]
[<0>] nfs4_run_state_manager+0xa8/0x1a0 [nfsv4]
[<0>] kthread+0x127/0x150
[<0>] ret_from_fork+0x1f/0x30
```

------------------------------------------------------------------------

4. Kernel Version Comparison

Affected:
Ubuntu 22.04 5.15.0-113-generic

Not affected:
Ubuntu 20.04 5.4.0-48-generic
Ubuntu 22.04 6.8.0-60-generic
Ubuntu 24.04 6.8.0-31-generic
Centos 7.9 4.19.188-10.el7.ucloud.x86_64
Centos 7.9 3.10.0-1062.9.1.el7.x86_64
Centos 8.3 4.18.0-240.1.1.el8_3.x86_64

------------------------------------------------------------------------

5. Questions

1. Is it expected that no recovery is triggered after
    NFS4ERR_STALE_CLIENTID?
2. During reclaim, should OPEN be sent with reclaim=true?
3. Could reclaim=false cause reclaim failure?
4. Why is client stuck in rpc_wait_bit_killable?
5. Is this a known issue in kernel 5.15?
6. Are there any related patches or fixes?

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2146310/+subscriptions

[РЕШЕНО] Ошибка № ...

понедельник

[Bug 2146310] Re: NFSv4 client hang in OPEN reclaim path waiting for RPC completion

Комментариев нет:

Отправить комментарий