NFS-Kernel-Server issues with tiered storage on the same host

To: debian-kernel@lists.debian.org
Subject: NFS-Kernel-Server issues with tiered storage on the same host
From: Patrik Schindler <poc@pocnet.net>
Date: Sat, 3 Oct 2020 17:54:41 +0200
Message-id: <[🔎] 58B97417-3C35-4E48-B5B0-83ADD977185C@pocnet.net>

Hello,

given are ordinary (HP brand) file servers (not VMs) serving as NFS backing store for eight ESXi servers, serving around 120 VMs running Linux and Windows. The servers feature Xeon CPUs with 64 GiB of RAM. Infrastructure is 10Gbe with 9k jumbo frames. Over the day, load is usually less than 10.

The backing store has SSDs and SAS disks, as mdadm based RAID5 to two volumes, a fast (SSD) and a slower (SAS) one. Each md has an ext4 filesystem and is mounted.

The NFS-Kernel Server exports these mount points to the said ESXi servers.

We're using VEEAM as backup solution for the VMs themselves. VEEAM does massive parallel accesses of vmdk resources to get the backups done as fast as possible.

This leads to frequent messages like this in Syslog:

rpc-srv/tcp: nfsd: sent only 68468 when sending 131204 bytes - shutting down socket

Unfortunately, ESXi doesn't support NFS over UDP.

Extensive research and tests showed that requests to the slower SAS storage are blocking threads, thus preventing them to serve requests to the faster SSD storage. When not mixing storage tiers on the same server, 64 Threads are just fine, for both SSDs and SAS. When mixing threads, we need to go up to 1024 threads to prevent these messages completely. I'm not sure what implications this very high count of threads might pose. Once, when the machine was heavily loaded with I/O, raising the thread count via /proc made the ssh session become unresponsive for almost a minute.

Of course, the NFS-Server can't know about the possible speed of backing store handling incoming requests. The only thing we can try is to split backup jobs, so backups of VMs is done in one batch for SSD-only VMs, and one for mixed SAS ("cheap bulk storage") and SSD (so OS upgrades are going fast) VMs. Not very convenient, and even less "suitable for everyday use".

As far as I understand, if a thread is already waiting for I/O, requests are dispatched to the next free thread. If all threads are exhausted, and waiting for I/O to complete, what exactly happens when new requests come in? I can't guess from the error message from above.

I'm not entirely sure if this is the right place to ask. But then, I have to start somewhere. :-)

Which other possibilities do you see to mitigate this excessive amount of threads? I guess the most important point is to separate requests between storage tiers. Ideas (not really refined!)…

- A given thread may move work back to the dispatcher after a certain timeout and request the next batch of work. There's a chance that that next batch is for SSDs and thus can be served within the timeout value. So, requests for SAS stack up in the dispatcher but the whole thing stays responsive for fast-serviceable SSD requests.

- Probably making the nfs-kernel-server mountpoint-aware (multi-queue approach)? While parsing /etc/exports, entries for the same mountpoint will be assigned an initial percentage of the available thread pool (thus threads being shared between mountpoints, but for different exports). This prevents work for different storage tiers being mixed. How to distribute threads? Static, evenly? Dynamic, according to what?

- Do nothing about it. The problem is rare in the real world, generates a lot of work for the kernel devs, introduces changes probably affecting other installs in a bad way and mechanical disks are already a thing of the past.

Thoughts, anyone?

:wq! PoC

PGP-Key: DDD3 4ABF 6413 38DE - https://www.pocnet.net/poc-key.asc

Reply to:

Prev by Date: Bug#963746: nfs-common: Random Segmentation Violations of rpc.gssd Daemon
Next by Date: Bug#971000: Acknowledgement (linux-image-5.8.0-2-amd64: Periodic networking failure with recent kernels in Testing)
Previous by thread: Bug#963746: nfs-common: Random Segmentation Violations of rpc.gssd Daemon
Next by thread: Bug#971000: Acknowledgement (linux-image-5.8.0-2-amd64: Periodic networking failure with recent kernels in Testing)
Index(es):
- Date
- Thread