[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Resolved: ConnectX4-LX Tainted Kernel and Slow Performance



Hi Adrian,

As promised, I researched the Connect4-LX NIC issues with Debian that we experienced during our initial S7-2 kernel testing. Recommend you pick up this patch, fixes numerous Mellanox ConnectX4-LX related Tainted Kernel messages and extremely slow performance on the S2-2.   Modified my 6.17.2 Linux kernel on the S7-2 to test this patch, and it works.  Also, I believe this patch may resolve this issue for other architectures/systems as well. No more tainted kernel messages.

The problem: Incorrect comparision calculation with "WARN_ON(PAGE_SIZE != 2 * DEFAULT_FRAG_SIZE); "

tonyr@s7-debian:~$ getconf PAGE_SIZE
8192

Example:

DEFAULT_FRAG_SIZE is 2048

"WARN_ON(8192 != 2 * 2048); "  == 4096 doesn't match 8192 and creates an issue.

To resolve:

-#define DEFAULT_FRAG_SIZE (2048)
+#define DEFAULT_FRAG_SIZE (PAGE_SIZE / 2)

Tests Performed:

iperf3

ssh and sftp large file transfers

No Tainted kernel messages and fast file transfers

PS. - The Connextx4-LX 10/25GB NIC also works with both Solaris 11.4 and Debian for sparc64, so it definitely recommended.

Test-By: Tony Rodriguez   unixpro1970@gmail.com


Regards,

Tony Rodriguez


On 10/14/25 10:18 AM, Tony Rodriguez wrote:
https://lkml.org/lkml/2025/9/2/1794


Date    Tue, 2 Sep 2025 11:56:06 -0700
From    Saeed Mahameed <>
Subject    Re: [PATCH] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size
On 02 Sep 21:00, Mingrui Cui wrote:
>When page size is 4K, DEFAULT_FRAG_SIZE of 2048 ensures that with 3
>fragments per WQE, odd-indexed WQEs always share the same page with
>their subsequent WQE. However, this relationship does not hold for page
>sizes larger than 8K. In this case, wqe_index_mask cannot guarantee that
>newly allocated WQEs won't share the same page with old WQEs.
>
>If the last WQE in a bulk processed by mlx5e_post_rx_wqes() shares a
>page with its subsequent WQE, allocating a page for that WQE will
>overwrite mlx5e_frag_page, preventing the original page from being
>recycled. When the next WQE is processed, the newly allocated page will
>be immediately recycled.
>
>In the next round, if these two WQEs are handled in the same bulk,
>page_pool_defrag_page() will be called again on the page, causing
>pp_frag_count to become negative.
>
>Fix this by making DEFAULT_FRAG_SIZE always equal to half of the page
>size.
>
>Signed-off-by: Mingrui Cui <mingruic@outlook.com>
CC:  Dragos Tatulea <dtatulea@nvidia.com>

Dragos is on a mission to improve page_size support in mlx5.

Dragos, please look into this, I am not sure making DEFAULT_FRAG_SIZE
dependant on PAGE_SIZE is the correct way to go,
see mlx5e_build_rq_frags_info()

I believe we don't do page flipping for > 4k pages, but I might be wrong,
anyway the code also should throw a warn_on:

/* The last fragment of WQE with index 2*N may share the page with the
  * first fragment of WQE with index 2*N+1 in certain cases. If WQE
  * 2*N+1
  * is not completed yet, WQE 2*N must not be allocated, as it's
  * responsible for allocating a new page.
  */
if (frag_size_max == PAGE_SIZE) {
    /* No WQE can start in the middle of a page. */
    info->wqe_index_mask = 0;
} else {
    /* PAGE_SIZEs starting from 8192 don't use 2K-sized fragments,
     * because there would be more than MLX5E_MAX_RX_FRAGS of
     * them.
     */
    WARN_ON(PAGE_SIZE != 2 * DEFAULT_FRAG_SIZE);
    /* Odd number of fragments allows to pack the last fragment of
     * the previous WQE and the first fragment of the next WQE
     * into
     * the same page.
     * As long as DEFAULT_FRAG_SIZE is 2048, and
     * MLX5E_MAX_RX_FRAGS
     * is 4, the last fragment can be bigger than the rest only
     * if
     * it's the fourth one, so WQEs consisting of 3 fragments
     * will
     * always share a page.
     * When a page is shared, WQE bulk size is 2, otherwise
     * just 1.
     */
    info->wqe_index_mask = info->num_frags % 2;
}

Looking at the above makes me think that this patch is correct, but a more careful look is needed to be taken, a Fixes tag is also required and target
'net' branch.

Thanks,
Saeed.

>---
> drivers/net/ethernet/mellanox/mlx5/core/en/params.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
>index 3cca06a74cf9..d96a3cbea23c 100644
>--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
>+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
>@@ -666,7 +666,7 @@ static void mlx5e_rx_compute_wqe_bulk_params(struct mlx5e_params *params,
>     info->refill_unit = DIV_ROUND_UP(info->wqe_bulk, split_factor);
> }
>
>-#define DEFAULT_FRAG_SIZE (2048)
>+#define DEFAULT_FRAG_SIZE (PAGE_SIZE / 2)
>
> static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
>                      struct mlx5e_params *params,
>--
>2.43.0
>
>



Reply to: