Hi jmorete, Thanks for taking the time to submit a bug report and helping to improve Ubuntu! Please try the latest mainline build and see if you are able to reproduce the same issue while running that kernel: https://kernel.ubuntu.com/mainline/v7.1.2/. For instructions on how to use mainline builds, refer to this wiki page: https://wiki.ubuntu.com/Kernel/MainlineBuilds. ** Changed in: linux (Ubuntu) Status: New => Incomplete -- You received this bug notification because you are subscribed to linux in Ubuntu. Matching subscriptions: Bgg, Bmail, Nb https://bugs.launchpad.net/bugs/2157924 Title: iscsi_tcp 40-50% sequential read performance regression (5.15 vs 6.8/7.0) due to release_sock serialization bottleneck Status in linux package in Ubuntu: Incomplete Bug description: 1) === System Metadata === OS Release: Description: Ubuntu 26.04 LTS Release: 26.04 Kernel Version: 7.0.0-22-generic CPU Model: Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz Loaded iscsi modules: iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi Network controller: 5e:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] 5e:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] 2) # apt-cache policy linux-image-generic linux-image-generic: Installed: 7.0.0-22.22 3) Expected same or better performance over iscsi LUN volume mounts. 4) Performance got 40% to 50% worst than the baseline test on Ubuntu 22.04.5 LTS with kernel 5.15 # Kernel Bug Report: iscsi_tcp Sequential Throughput Regression (5.15 → 6.8 / 7.0) ## Summary A significant sequential I/O throughput regression has been identified in the `iscsi_tcp` kernel module when comparing Ubuntu kernel 5.15 (Ubuntu 22.04) against kernels 6.8 (Ubuntu 24.04) and 7.0 (Ubuntu 26.04). Sequential read throughput drops by approximately 40-50% on the newer kernels under identical hardware, network, and storage backend conditions. All externally-tunable parameters have been exhaustively tested and eliminated as the cause. ## Affected Versions | Distribution | Kernel Version | Status | |-------------|---------------|--------| | Ubuntu 22.04 LTS | 5.15.0-143-generic | Working (baseline) | | Ubuntu 24.04 LTS | 6.8.0-88-generic | **Regressed** | | Ubuntu 26.04 LTS | 7.0.0-22-generic | **Regressed** | ## Hardware Configuration (identical across all hosts) - **CPU**: 96 cores (dual-socket Intel/AMD server) - **Memory**: 750 GiB - **NIC**: Mellanox ConnectX (25 GbE, dual-port, LACP bond, mlx5 driver) - **Storage Backend**: NetApp ONTAP SAN (iSCSI, ontap-san-economy driver via Trident CSI 25.06) - **Network**: Jumbo frames (MTU 9000), VLAN-tagged storage network - **Multipath**: dm-multipath with ALUA, service-time path selector, 2 paths per LUN ## Test Environment - **Volume**: 100 GiB iSCSI LUN provisioned via Trident CSI (PVC with `Filesystem` volumeMode) - **Test Pod**: Kubernetes pod with fio 3.6, volume mounted at `/mnt/data` - **fio Parameters**: `--direct=1 --ioengine=libaio --iodepth=64 --numjobs=1 --runtime=30 --time_based --group_reporting --directory=/mnt/data` - **Block Sizes Tested**: 128K, 256K, 1M ## Results ### Sequential Read Throughput (256K block size) | Kernel | Bandwidth | IOPS | Avg Latency | |--------|-----------|------|-------------| | 5.15.0-143 | **1542 MiB/s** | 6168 | 10.2 ms | | 6.8.0-88 | **740 MiB/s** | 2958 | 21.3 ms | | 7.0.0-22 | **1034 MiB/s** | 4136 | 15.4 ms | ### Sequential Read Throughput (1M block size) | Kernel | Bandwidth | IOPS | Avg Latency | |--------|-----------|------|-------------| | 5.15.0-143 | **1444 MiB/s** | 1444 | 43.6 ms | | 6.8.0-88 | **815 MiB/s** | 814 | 77.3 ms | | 7.0.0-22 | **851 MiB/s** | 851 | 73.9 ms | ### Regression Magnitude - **Kernel 6.8 vs 5.15**: 44-52% throughput reduction (sequential reads) - **Kernel 7.0 vs 5.15**: 33-41% throughput reduction (sequential reads) ## iSCSI / SCSI Parameters (verified identical across all kernels) | Parameter | Value | |-----------|-------| | `can_queue` (scsi_host) | 113 | | `cmd_per_lun` (scsi_host) | 32 | | `sg_tablesize` (scsi_host) | 4096 | | `queue_depth` (per LUN) | 32 | | `max_hw_sectors_kb` | 32767 | | iSCSI `MaxRecvDataSegmentLength` | 262144 | | iSCSI `FirstBurstLength` | 65536 (negotiated by target) | | iSCSI `MaxBurstLength` | 1048576 | | iSCSI `MaxOutstandingR2T` | 1 | | iSCSI `ImmediateData` | Yes | | iSCSI `InitialR2T` | Yes (negotiated by target) | | TCP congestion control | BBR | | MTU | 9000 | | TCP rmem_max / wmem_max | 134217728 | ## Eliminated Causes The following parameters/settings were systematically tuned on the 7.0 kernel with no measurable impact on throughput: | Tuning Attempted | Result | |-----------------|--------| | Disable WBT (`wbt_lat_usec=0`) | No change | | Increase read-ahead (`read_ahead_kb=16384`) | No change (expected: direct IO) | | IO scheduler: `none` (passthrough) | No change | | IO scheduler: `mq-deadline` (match 5.15 default) | No change | | Reduce `max_sectors_kb` to 64 (match 5.15 value) | No change | | Increase `nr_requests` to 512 | No change | | Enable `recv_from_iscsi_q=Y` (kernel 7.0 parameter) | No change | | Increase `netdev_budget` (1200→4800) and `netdev_budget_usecs` (8000→32000) | No change | | Renice `iscsid` process to -20 | No change | | Enable RPS on storage VLAN interface (`rps_cpus=ffffffff`) | No change | | Enable RFS (`rps_sock_flow_entries=32768`) | No change | | Enable `quickack` on iSCSI storage routes | No change | | Set `tcp_low_latency=1` | No change | | Increase `gro_max_size` / `gso_max_size` | Failed (not supported on VLAN interface) | | Multiple fio jobs (numjobs=2) | No change / slightly worse | ## Analysis ### Per-IO Latency Comparison With `queue_depth=32` and `iodepth=64` (saturating the device queue), throughput is governed by: ``` throughput = queue_depth / avg_latency_per_IO ``` For 256K sequential reads: - **Kernel 5.15**: 32 / 0.0052s = ~6150 IOPS → 1538 MiB/s (matches observed) - **Kernel 6.8**: 32 / 0.0108s = ~2962 IOPS → 741 MiB/s (matches observed) - **Kernel 7.0**: 32 / 0.0077s = ~4156 IOPS → 1039 MiB/s (matches observed) The per-IO latency for the same 256K read operation is: - **5.15**: ~5.2 ms average - **6.8**: ~10.8 ms average (2.1x higher) - **7.0**: ~7.7 ms average (1.5x higher) ### TCP Connection Health (not the bottleneck) TCP socket statistics captured during testing confirm the network path is not limiting: - All connections show healthy `cwnd`, full-speed `delivery_rate`, and sub-0.2ms RTT - The 25 GbE NIC is operating well below capacity (~6-12 Gbps observed vs 25 Gbps available) - No retransmissions or congestion events during testing ### CPU Utilization (not the bottleneck) - Softirq CPU usage remains low during testing - RPS/multi-queue distribution does not improve throughput - The `iscsi_tcp` workqueue threads run at nice -20 (highest priority) - Switching between softirq and workqueue processing (`recv_from_iscsi_q`) has no effect ### Conclusion The regression is internal to the `iscsi_tcp` / `libiscsi` kernel module data path. The per-SCSI-command processing latency is 50-110% higher on kernels 6.8 and 7.0 compared to 5.15, for identical iSCSI PDU sizes, network conditions, and queue depths. This suggests changes in the iSCSI receive/transmit path, SCSI mid-layer command completion, or interaction with the block layer's multi-queue infrastructure introduced between 5.15 and 6.8 are adding overhead per I/O operation. ## Steps to Reproduce ### Prerequisites - Two bare-metal hosts: one running kernel 5.15 (Ubuntu 22.04), one running kernel 6.8+ (Ubuntu 24.04 or 26.04) - iSCSI target (e.g., NetApp ONTAP, LIO, or targetcli) accessible via 10GbE+ network with jumbo frames - `open-iscsi` package installed with default configuration - `dm-multipath` configured (or single-path is sufficient to reproduce) - Kubernetes with Trident CSI is NOT required; direct iSCSI LUN attachment reproduces the issue ### Reproduction Steps 1. **Provision a 100 GiB iSCSI LUN** on the target and present it to both hosts. 2. **Discover and login** on both hosts: ```bash iscsiadm -m discovery -t sendtargets -p <target_ip>:3260 iscsiadm -m node --login ``` 3. **Identify the device**: ```bash # For multipath: multipath -ll # Note the dm-X device # For single path: lsblk --scsi ``` 4. **Create a filesystem and mount**: ```bash mkfs.ext4 /dev/dm-0 # or /dev/sdX for single path mkdir /mnt/iscsi-test mount /dev/dm-0 /mnt/iscsi-test ``` 5. **Run fio benchmark** (identical on both hosts): ```bash # 256K Sequential Read fio --name=seq-read-256k \ --rw=read \ --bs=256k \ --size=1G \ --numjobs=1 \ --iodepth=64 \ --direct=1 \ --ioengine=libaio \ --runtime=30 \ --time_based \ --group_reporting \ --directory=/mnt/iscsi-test # 1M Sequential Read fio --name=seq-read-1M \ --rw=read \ --bs=1M \ --size=1G \ --numjobs=1 \ --iodepth=64 \ --direct=1 \ --ioengine=libaio \ --runtime=30 \ --time_based \ --group_reporting \ --directory=/mnt/iscsi-test ``` 6. **Compare results**: The host running kernel 6.8+ will show 40-50% lower sequential read bandwidth compared to the host running kernel 5.15. ### Verification Confirm the test is hitting the iSCSI device (not page cache or overlay): ```bash # During the test, verify disk utilization: iostat -x 1 | grep dm-0 # Should show ~99% util # Verify the mount is on iSCSI: lsblk -o NAME,TYPE,TRAN,SIZE,MOUNTPOINT | grep -A5 dm-0 ``` ## Additional Diagnostics ### 1. Transparent Huge Pages (THP) **Default state**: `enabled=madvise`, `defrag=madvise` | THP Setting | 256K Seq Read (kernel 7.0) | Change | |-------------|---------------------------|--------| | madvise (default) | 1034 MiB/s | baseline | | never | 882 MiB/s | -15% (worse) | **Conclusion**: THP is not contributing to the regression. Disabling it actually slightly reduces throughput, likely due to losing THP benefits for fio's memory allocations. ### 2. NUMA Locality Impact **Hardware topology**: - 96 cores: node 0 = even CPUs (0,2,4,...,94), node 1 = odd CPUs (1,3,5,...,95) - NIC `ens3f0np0` (active iSCSI bond slave): **NUMA node 0** - NIC `ens5f1np1` (secondary bond slave): **NUMA node 1** **NUMA-pinned fio results** (256K sequential read, direct on `/dev/dm-0`): | CPU Pinning | Throughput | Avg Latency | Delta | |-------------|-----------|-------------|-------| | Node 0 (NIC-local) | **1148 MiB/s** | 13.9 ms | baseline | | Node 1 (cross-socket) | **943 MiB/s** | 16.9 ms | -18% | **Finding**: The iSCSI MSI-X interrupt (IRQ 338, `mlx5_comp62`) is pinned to **CPU29 (NUMA node 1)** despite the NIC residing on **NUMA node 0**. The server has two 25 GbE Mellanox NICs, one per NUMA node, bonded via 802.3ad LACP (`xmit_hash_policy=layer2`). The storage VLAN runs on top of this bond. Due to LACP hashing, iSCSI traffic flows through the node-0 NIC but the receive interrupt is processed on a node-1 CPU, adding ~3ms per-IO latency from cross-socket memory access. However, even the optimally-pinned node-0 result (1148 MiB/s) is still **25% below kernel 5.15** (1542 MiB/s), confirming the regression is not solely NUMA-related. ### 3. perf Profiling (Kernel CPU Time Breakdown) Captured via `perf record -a -g -- sleep 10` during 256K sequential reads at iodepth=64: ``` Top-level call stack (kernel 7.0.0-22-generic): 47.97% ret_from_fork_asm └─ 47.96% kthread └─ 46.05% worker_thread └─ 44.97% process_one_work ├─ 42.77% iscsi_xmitworker ← TX path │ └─ 42.74% iscsi_data_xmit │ └─ 42.24% iscsi_xmit_task │ └─ 41.79% iscsi_tcp_task_xmit │ └─ 41.74% iscsi_sw_tcp_pdu_xmit │ └─ 41.62% iscsi_sw_tcp_xmit_segment │ └─ 41.34% sock_sendmsg │ └─ 41.08% tcp_sendmsg │ ├─ 33.73% release_sock ← CRITICAL │ │ └─ 33.57% __release_sock │ │ └─ 33.42% tcp_v4_do_rcv │ │ └─ 33.29% tcp_rcv_established │ │ └─ 32.24% tcp_data_queue │ │ └─ 31.84% tcp_data_ready │ │ └─ 31.78% iscsi_sw_tcp_data_ready │ │ ├─ 28.03% tcp_read_sock │ │ │ └─ 23.45% iscsi_sw_tcp_recv │ │ │ └─ 23.15% iscsi_tcp_recv_skb │ │ │ └─ 2.35% iscsi_tcp_segment_recv │ │ └─ 4.32% native_queued_spin_lock_slowpath ← LOCK CONTENTION │ └─ (tcp_sendmsg_locked, etc.) └─ 1.78% blk_mq_run_work_fn ``` **Critical Findings**: 1. **TX/RX path serialization via `release_sock`**: 33.73% of total CPU time is spent inside `release_sock()` during the transmit path. When the xmit worker sends a SCSI command via `tcp_sendmsg`, the socket lock release triggers processing of all queued incoming data in the **same thread context** — this includes `iscsi_sw_tcp_data_ready` → `iscsi_tcp_recv_skb` (23.15% of CPU). 2. **Spinlock contention**: `native_queued_spin_lock_slowpath` accounts for **4.32%** of total CPU time — indicating measurable contention on the socket lock between the transmit workqueue and network softirq receive path. 3. **Single-threaded bottleneck**: The entire iSCSI IO path (TX command + RX data) is serialized through a single `iscsi_xmitworker` workqueue thread. Data receive happens as a side-effect of the transmit path's lock release, not in parallel. ### 4. Interrupt Footprint During a 5-second sample of active 256K sequential reads: | IRQ | Queue | CPU | Delta (5s) | Rate | |-----|-------|-----|-----------|------| | 338 | `mlx5_comp62@0000:5e:00.0` | CPU29 | +15,979 | 3,196/s | | 325 | `mlx5_comp49@0000:5e:00.0` | CPU3 | +8 | ~0/s | **Finding**: All iSCSI receive traffic is concentrated on a single MSI-X vector (IRQ 338 → CPU29). This is expected for a single TCP flow (RSS hashes to one queue), but confirms that the entire iSCSI data path is single-CPU-bound. The interrupt is NOT bottlenecking on CPU0 — it's correctly distributed via MSI-X, but still limited to one core. ### 5. Lock Contention (lockstat) `CONFIG_LOCK_STAT` is **not enabled** in the Ubuntu 7.0.0-22-generic kernel. This data point is not available without a custom kernel build with `CONFIG_LOCK_STAT=y`. ### 6. blktrace / iostat Latency Decomposition Captured via `blktrace` and `iostat -xmt` on dm-0 and underlying sdb during active 256K sequential reads: **Raw blktrace** (10s capture on `/dev/dm-0`): Completions arrive every 30-40 µs interval on the SCSI device, confirming fast backend response. **iostat latency breakdown** (steady-state averages over 10 samples): | Device | r_await (ms) | aqu-sz | Throughput | |--------|-------------|--------|-----------| | dm-0 (multipath) | **13.3 ms** | 243 | ~1.15 GB/s | | sdb (SCSI/iSCSI) | **1.7 ms** | 31 | ~1.15 GB/s | **Latency decomposition**: | Component | Time | % of Total | |-----------|------|-----------| | DM/block queue wait (Q2D) | **11.6 ms** | 87% | | SCSI device service time (D2C) | **1.7 ms** | 13% | | **Total r_await** | **13.3 ms** | 100% | **Interpretation**: The actual iSCSI network + storage backend time is only 1.7 ms per 64KB read. The remaining 87% of per-IO latency is spent **waiting in the device-mapper queue** because the SCSI device queue depth is limited to 32 (`cmd_per_lun=32`). With `iodepth=64` from fio, the DM layer queues 243 outstanding IOs but can only dispatch 32 at a time to the underlying SCSI device. On kernel 5.15, the same `cmd_per_lun=32` limit exists but achieves 1542 MiB/s at 256K, implying either: - The DM/blk-mq queue management has higher overhead in 6.8+ (longer Q2D time per IO) - The SCSI device service time was lower on 5.15 (different TCP/socket handling in `release_sock`) - Or both, as the perf profile showing 33% of time in `release_sock` correlates with the serialized TX/RX pattern adding per-IO overhead **Comparative iostat from kernel 5.15 (U22) under identical test**: | Metric | U22 (kernel 5.15) | U26 (kernel 7.0) | Ratio | |--------|-------------------|-------------------|-------| | IOPS (256K) | **~49,000** | ~18,200 | U22 2.7× more | | Throughput | **~3,050 MiB/s** | ~1,150 MiB/s | U22 2.7× faster | | r_await (dm) | **5.5 ms** | 13.3 ms | U26 2.4× slower per IO | | aqu-sz (dm) | 268-286 | 243 | Similar queue depth | The per-IO latency through the DM layer is **2.4× higher on kernel 7.0** (13.3 ms vs 5.5 ms), directly explaining the throughput difference. Both kernels saturate at 100% device utilization with similar application queue depths (~260-280), confirming the bottleneck is per-IO processing efficiency in the kernel iSCSI/SCSI stack, not device or network capacity. --- ## Suggested Investigation Areas 1. **`release_sock()` in `tcp_sendmsg` processing RX data inline (kernel 7.0)**: The perf profile shows that 33.73% of time is spent in `release_sock` → `tcp_v4_do_rcv` → `iscsi_sw_tcp_data_ready` during the **transmit** path. This serializes TX and RX. Investigate whether kernel 5.15 handled the receive callback differently (e.g., via softirq/tasklet rather than inline in `release_sock`). 2. **Socket lock contention in `iscsi_sw_tcp_data_ready`**: The 4.32% `native_queued_spin_lock_slowpath` indicates the socket lock is contended. In kernel 5.15, the receive path may have used `sk->sk_data_ready` differently or with less contention. 3. **`iscsi_xmitworker` single-threaded design**: All SCSI command dispatch and completion happens through one workqueue worker. If kernel 6.8+ changed workqueue scheduling (e.g., unbound → bound, or different WQ flags), this would add latency. 4. **SCSI mid-layer `blk-mq` tag allocation on NUMA**: On 96-core dual-socket systems, cross-NUMA blk-mq tag allocation adds measurable latency. The 18% NUMA penalty observed may be amplified by changes in how the SCSI mid-layer allocates tags in 6.8+. 5. **TCP small-queue (TSQ) or pacing changes**: `tcp_sendmsg` in newer kernels may hold the socket lock longer due to TSQ or pacing changes, increasing the window during which `release_sock` processes RX data inline. ## Environment Details ``` # Modules involved: iscsi_tcp 24576 libiscsi_tcp 32768 libiscsi 81920 scsi_transport_iscsi 176128 # Module parameters (kernel 7.0): iscsi_tcp: max_lun, recv_from_iscsi_q, debug_iscsi_tcp ``` To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2157924/+subscriptions
Комментариев нет:
Отправить комментарий