[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#987599: linux-image-5.10.0-6-amd64: regression, diskstats show unrealistic values



Package: src:linux
Version: 5.10.28-1
Severity: normal

Hello,

after upgrading some of our servers to Debian testing I noticed that their reported disk utilization increased significantly. Since there was no other indication of them being under load and since I had a prior knowledge of similar bug reported few years ago, I investigated the source of the data and discovered that the values shown are inflated.

Steps to reproduce:

Using FIO I tried to find out some kind of maximum IOPS the drives (two SSDs in RAID1) can perform:

[global]
name=fio-rand-write
filename=fio-rand-write
rw=randwrite
bs=4K
direct=1
numjobs=1
time_based=1
runtime=900

[file1]
size=1G
ioengine=libaio
iodepth=16

FIO output while running was showing about 30k IOPS:

Jobs: 1 (f=1): [w(1)][0.8%][w=116MiB/s][w=29.7k IOPS][eta 14m:53s]

Drive utilization was 100% as expected, 10th column (time spent doing I/O in milliseconds) was increasing by 1000 each second

# while sleep 1 ; do cat /sys/block/sda/stat  ; done
4845 2532 264597 1489 5509215 237957 56950332 268397 0 334780 274445 11847 0 161240352 3823 17304 734 4845 2532 264597 1489 5529013 237960 57108732 269200 0 335780 275249 11847 0 161240352 3823 17306 734 4845 2532 264597 1489 5548918 237960 57267972 270008 0 336784 276056 11847 0 161240352 3823 17306 734 4845 2532 264597 1489 5568868 237960 57427572 270815 0 337788 276863 11847 0 161240352 3823 17306 734

After that I added IOPS limit to FIO configuration (limit max reading, writing, discarding IOPS):

rate_iops=100,100,100

This took effect in performance reported by FIO

Jobs: 1 (f=1), 0-100 IOPS: [w(1)][1.2%][w=400KiB/s][w=100 IOPS][eta 14m:49s]

And increments in time spent doing I/O dropped to 400ms each second.

# while sleep 1 ; do cat /sys/block/sda/stat  ; done
5067 2602 265364 1511 6587946 239193 65578139 314735 0 428856 320973 11931 0 161241296 3847 20712 879 5067 2602 265364 1511 6588046 239193 65578939 314742 0 429256 320981 11931 0 161241296 3847 20712 879 5067 2602 265364 1511 6588146 239193 65579739 314750 0 429656 320989 11931 0 161241296 3847 20712 879

I tried to increase IOPS limit by factor of 2.5 to 250 and at that point the drive load went back to full (ie. time spent doing I/O in milliseconds incrementing by 1000 each second.) Both of those values seem way off, it isn't likely that an SSD would need 4ms to write a small chunk of data. That's rotating drive level of performance.

All drives I tested showed the same behaviour, regardless of them being single drives or part of an MD array. Drives are two different brands.

This is a regression from Debian Stable kernel, where insignificant load (from the perspective of an SSD) didn't increase the "time spent doing I/O" counter at all. Which is IMO the sensible thing to do as you usually don't care if the drive reports zero load when the actual load is slightly more than zero. Current (testing) kernel version on the other hand causes "drives fully loaded" alerts when the drives are in reality almost idle.

Meaning of the 10th column is taken from https://www.kernel.org/doc/Documentation/iostats.txt

This issue is similar to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=927184 , but unlike that bug - which could be worked around by choosing mq-deadline scheduler - it affects mq-deadline as well. This is why I reported this separately as a regression.

(I removed PCI information included by reportbug since I don't think this is a hardware-related issue. If needed, I can provide it.)


-- Package-specific info:
** Version:
Linux version 5.10.0-6-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Debian 5.10.28-1 (2021-04-09)

** Command line:
BOOT_IMAGE=/boot/vmlinuz-5.10.0-6-amd64 root=UUID=bc4aa9fd-9c63-445e-aeae-2147798879f4 ro amd_iommu=on intel_iommu=on net.ifnames=0

** Not tainted

** Kernel log:
[ 4.185348] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
[    4.191508] PTP clock support registered
[    4.194579] ccp 0000:42:00.1: SEV API:0.24 build:3
[ 4.264204] checking generic (b6000000 300000) vs hw (b6000000 1000000)
[    4.264207] fb0: switching to astdrmfb from EFI VGA
[    4.268024] Console: switching to colour dummy device 80x25
[ 4.268247] ast 0000:c4:00.0: [drm] P2A bridge disabled, using default configuration
[    4.268250] ast 0000:c4:00.0: [drm] AST 2500 detected
[    4.268257] ast 0000:c4:00.0: [drm] Analog VGA only
[ 4.268259] ast 0000:c4:00.0: [drm] dram MCLK=800 Mhz type=1 bus_width=16 [ 4.268804] [TTM] Zone kernel: Available graphics memory: 131929092 KiB [ 4.268807] [TTM] Zone dma32: Available graphics memory: 2097152 KiB
[    4.268808] [TTM] Initializing pool allocator
[    4.268812] [TTM] Initializing DMA pool allocator
[ 4.269010] [drm] Initialized ast 0.1.0 20120228 for 0000:c4:00.0 on minor 0
[    4.273003] fbcon: astdrmfb (fb0) is primary device
[    4.284407] Console: switching to colour frame buffer device 128x48
[    4.286804] ast 0000:c4:00.0: [drm] fb0: astdrmfb frame buffer device
[ 4.295928] tg3 0000:c6:00.0 eth0: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address ac:1f:6b:e5:e1:7e [ 4.296009] tg3 0000:c6:00.0 eth0: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1]) [ 4.296060] tg3 0000:c6:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1] [ 4.296102] tg3 0000:c6:00.0 eth0: dma_rwctrl[00000001] dma_mask[64-bit] [ 4.312469] tg3 0000:c6:00.1 eth1: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address ac:1f:6b:e5:e1:7f [ 4.312529] tg3 0000:c6:00.1 eth1: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1]) [ 4.312582] tg3 0000:c6:00.1 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1] [ 4.312625] tg3 0000:c6:00.1 eth1: dma_rwctrl[00000001] dma_mask[64-bit] [ 4.320397] input: PC Speaker as /devices/platform/pcspkr/input/input3 [ 4.329707] RAPL PMU: API unit is 2^-32 Joules, 1 fixed counters, 163840 ms ovfl timer
[    4.329951] RAPL PMU: hw unit of domain package 2^-16 Joules
[    4.336927] pstore: ignoring unexpected backend 'efi'
[    4.339850] cryptd: max_cpu_qlen set to 1000
[    4.350152] AVX2 version of gcm_enc/dec engaged.
[    4.350175] AES CTR mode by8 optimization enabled
[    4.543242] kvm: Nested Virtualization enabled
[    4.543417] SVM: kvm: Nested Paging enabled
[    4.543430] SVM: Virtual VMLOAD VMSAVE supported
[    4.543444] SVM: Virtual GIF supported
[    4.548630] MCE: In-kernel MCE decoding enabled.
[    4.551186] EDAC amd64: F17h_M30h detected (node 0).
[    4.560369] EDAC amd64: Node 0: DRAM ECC enabled.
[    4.561135] EDAC amd64: MCT channel count: 8
[ 4.562760] EDAC MC0: Giving out device to module amd64_edac controller F17h_M30h: DEV 0000:00:18.3 (INTERRUPT)
[    4.566733] EDAC MC: UMC0 chip selects:
[    4.566735] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[    4.567578] EDAC amd64: MC: 2:     0MB 3:     0MB
[    4.569941] EDAC MC: UMC1 chip selects:
[    4.569942] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[    4.570596] EDAC amd64: MC: 2:     0MB 3:     0MB
[    4.574730] EDAC MC: UMC2 chip selects:
[    4.574730] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[    4.575524] EDAC amd64: MC: 2:     0MB 3:     0MB
[    4.576192] EDAC MC: UMC3 chip selects:
[    4.576193] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[    4.576827] EDAC amd64: MC: 2:     0MB 3:     0MB
[    4.578239] EDAC MC: UMC4 chip selects:
[    4.578239] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[    4.578800] EDAC amd64: MC: 2:     0MB 3:     0MB
[    4.582730] EDAC MC: UMC5 chip selects:
[    4.582730] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[    4.583439] EDAC amd64: MC: 2:     0MB 3:     0MB
[    4.586536] EDAC MC: UMC6 chip selects:
[    4.586536] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[    4.587094] EDAC amd64: MC: 2:     0MB 3:     0MB
[    4.590729] EDAC MC: UMC7 chip selects:
[    4.590730] EDAC amd64: MC: 0: 16384MB 1: 16384MB
[    4.591410] EDAC amd64: MC: 2:     0MB 3:     0MB
[    4.591964] EDAC amd64: using x16 syndromes.
[ 4.592496] EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED)
[    4.593049] AMD64 EDAC driver v3.5.0
[ 4.658759] ipmi_si dmi-ipmi-si.0: The BMC does not support clearing the recv irq bit, compensating, but the BMC needs to be fixed. [ 4.734200] ipmi_si dmi-ipmi-si.0: IPMI message handler: Found new BMC (man_id: 0x002a7c, prod_id: 0x1b2b, dev_id: 0x20)
[    4.776164] ipmi_si dmi-ipmi-si.0: IPMI kcs interface initialized
[    4.781864] ipmi_ssif: IPMI SSIF Interface driver
[ 6.762142] EXT4-fs (md0): re-mounted. Opts: discard,errors=remount-ro
[    7.103635] loop: module loaded
[    7.156394] tun: Universal TUN/TAP device driver, 1.6
[ 7.205055] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[    7.214580] Bridge firewalling registered
[    7.274324] VFIO - User Level meta-driver version: 0.3
[    7.590700] br0: port 1(eth0) entered blocking state
[    7.591734] br0: port 1(eth0) entered disabled state
[    7.594832] device eth0 entered promiscuous mode
[ 11.342993] tg3 0000:c6:00.0 eth0: Link is up at 1000 Mbps, full duplex [ 11.344061] tg3 0000:c6:00.0 eth0: Flow control is on for TX and on for RX
[   11.345061] tg3 0000:c6:00.0 eth0: EEE is disabled
[   11.346066] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   11.347147] br0: port 1(eth0) entered blocking state
[   11.348115] br0: port 1(eth0) entered forwarding state
[   11.357391] IPv6: ADDRCONF(NETDEV_CHANGE): br0: link becomes ready
[   12.677107] br1: port 1(eth1) entered blocking state
[   12.678069] br1: port 1(eth1) entered disabled state
[   12.679060] device eth1 entered promiscuous mode
[ 16.220975] tg3 0000:c6:00.1 eth1: Link is up at 1000 Mbps, full duplex [ 16.221879] tg3 0000:c6:00.1 eth1: Flow control is off for TX and off for RX
[   16.222719] tg3 0000:c6:00.1 eth1: EEE is disabled
[   16.223562] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[   16.224486] br1: port 1(eth1) entered blocking state
[   16.225316] br1: port 1(eth1) entered forwarding state
[   16.234489] IPv6: ADDRCONF(NETDEV_CHANGE): br1: link becomes ready
[   18.474085]  sdj:

** Model information
sys_vendor: Supermicro
product_name: Super Server
product_version: 0123456789
chassis_vendor: Supermicro
chassis_version: 0123456789
bios_vendor: American Megatrends Inc.
bios_version: 1.2
board_vendor: Supermicro
board_name: H12SSW-iN
board_version: 1.01

** Loaded modules:
nf_conntrack_netlink
nft_chain_nat
nf_nat
nft_reject_inet
nf_reject_ipv4
nf_reject_ipv6
nft_reject
nf_log_ipv6
nf_log_ipv4
nf_log_common
nft_log
nft_limit
nft_counter
nft_ct
nls_ascii
nls_cp437
vfat
fat
nf_tables
nfnetlink
sch_hfsc
sch_fq_codel
efivarfs
vfio_pci
vfio_virqfd
vfio_iommu_type1
vfio
nf_conntrack_ftp
br_netfilter
bridge
stp
llc
nf_conntrack
nf_defrag_ipv6
nf_defrag_ipv4
vhost_net
tun
vhost
vhost_iotlb
tap
bonding
loop
ipmi_ssif
amd64_edac_mod
edac_mce_amd
amd_energy
kvm_amd
kvm
irqbypass
crc32_pclmul
ghash_clmulni_intel
aesni_intel
libaes
crypto_simd
cryptd
glue_helper
rapl
efi_pstore
pcspkr
ast
drm_vram_helper
drm_ttm_helper
ttm
drm_kms_helper
tg3
cec
libphy
drm
ptp
pps_core
i2c_algo_bit
sp5100_tco
ccp
watchdog
sg
rng_core
joydev
i2c_piix4
k10temp
acpi_ipmi
ipmi_si
button
ipmi_devintf
ipmi_msghandler
acpi_cpufreq
ext4
crc16
mbcache
jbd2
raid10
raid456
libcrc32c
crc32c_generic
async_raid6_recov
async_memcpy
async_pq
async_xor
xor
async_tx
evdev
hid_generic
usbhid
hid
raid6_pq
raid0
multipath
linear
raid1
md_mod
sd_mod
t10_pi
crc_t10dif
crct10dif_generic
crct10dif_pclmul
crct10dif_common
crc32c_intel
xhci_pci
xhci_hcd
ahci
libahci
libata
usbcore
scsi_mod
usb_common

** PCI devices:
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex [1022:1480] Subsystem: Super Micro Computer Inc Starship/Matisse Root Complex [15d9:1b2b] Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-


** USB devices:
not available


-- System Information:
Debian Release: bullseye/sid
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 5.10.0-6-amd64 (SMP w/64 CPU threads)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /bin/dash
Init: sysvinit (via /sbin/init)

Versions of packages linux-image-5.10.0-6-amd64 depends on:
ii  initramfs-tools [linux-initramfs-tool]  0.140
ii  kmod                                    28-1
ii  linux-base                              4.6

Versions of packages linux-image-5.10.0-6-amd64 recommends:
pn  apparmor             <none>
pn  firmware-linux-free  <none>

Versions of packages linux-image-5.10.0-6-amd64 suggests:
pn  debian-kernel-handbook  <none>
ii  grub-efi-amd64          2.04-17
pn  linux-doc-5.10          <none>

Versions of packages linux-image-5.10.0-6-amd64 is related to:
pn  firmware-amd-graphics     <none>
pn  firmware-atheros          <none>
pn  firmware-bnx2             <none>
pn  firmware-bnx2x            <none>
pn  firmware-brcm80211        <none>
pn  firmware-cavium           <none>
pn  firmware-intel-sound      <none>
pn  firmware-intelwimax       <none>
pn  firmware-ipw2x00          <none>
pn  firmware-ivtv             <none>
pn  firmware-iwlwifi          <none>
pn  firmware-libertas         <none>
pn  firmware-linux-nonfree    <none>
pn  firmware-misc-nonfree     <none>
pn  firmware-myricom          <none>
pn  firmware-netxen           <none>
pn  firmware-qlogic           <none>
pn  firmware-realtek          <none>
pn  firmware-samsung          <none>
pn  firmware-siano            <none>
pn  firmware-ti-connectivity  <none>
pn  xen-hypervisor            <none>

-- no debconf information


Reply to: