Bug#833170: linux-image-3.16.0-4-amd64: Reproducable XFS filesystem corruption, possibly connected with ACLs

To: Will Aoki <waoki@umnh.utah.edu>, 833170@bugs.debian.org
Subject: Bug#833170: linux-image-3.16.0-4-amd64: Reproducable XFS filesystem corruption, possibly connected with ACLs
From: Salvatore Bonaccorso <carnil@debian.org>
Date: Sun, 23 Aug 2020 17:43:31 +0200
Message-id: <[🔎] 20200823154331.GA1273666@eldamar.local>
Reply-to: Salvatore Bonaccorso <carnil@debian.org>, 833170@bugs.debian.org
In-reply-to: <20160801172528.GF10297@umnh.utah.edu>
References: <20160801172528.GF10297@umnh.utah.edu> <20160801172528.GF10297@umnh.utah.edu>

Hi,

On Mon, Aug 01, 2016 at 11:25:28AM -0600, Will Aoki wrote:
> Package: src:linux
> Version: 3.16.7-ckt25-2+deb8u3
> Severity: important
> 
> I hit a nasty filesystem corruption bug while restoring some backups. I'm able
> to reliably reproduce this on every system where I tried to restore to XFS and
> on a brand-new VM created for testing (which I can share as an OVF, although
> it's pretty big).
> 
> Everything's running on hardware with ECC RAM, so memory errors are unlikely.
> I've reproduced it on VMs spread across three different storage arrays from two
> different vendors. Everything has at least two virtual CPUs.
> 
> All systems tested use XFS on LVM.
> 
> In my test VM, the only packages not from Debian stable are custom stow and tar
> packages to fix bugs in the versions in the stable release and a backport of
> burp because the verson in Debian was very old. These are all userland tools
> and none should be able to cause filessytem corruption.
> 
> I'm using xfsprogs from Debian jessie. On one affected system, I tried version
> 4.3.0+nmu1 and saw no difference in what xfs_repair found.
> 
> 
> At this point, my test case uses the burp backup software to create the I/O
> activity which triggers this bug. I have not been able to make tar trigger this
> problem.
> 
> 
> Steps to reproduce:
> 
> 1: Create new XFS filesystem & mount it on /srv/src
> 
> 2: Create some directories in /srv/src & set ACLs (including default ACLs) on
>    them
> 
> 3: Generate deep tree of files in each of the directories from step #2. For
>    testing, I used a script which created random files & subdirectories. Total
>    bulk was about 2.5 gigabytes.
> 
> 4: Take backup of /srv/src with burp:
> 
>    # burp -a b
> 
> 5: Unmount /srv/src
> 
> 6: Create new XFS filesystem & mount it on /srv/src
> 
> 7: Run restore to /srv/src:
> 
>    # burp -a r -r ^/srv/src
> 
>    Do not suspend the restore process: the bug appears to require sustained I/O
>    to trigger. In trials where I suspended it multiple times during a restore,
>    corruption did not surface.
> 
> 
> Expected outcome (observed when restoring to e.g. ext4):
> 
> 1: Can create files (permissions notwithstanding) in every directory under
>    /srv/src
> 
> 2: Default ACL on every directory is the same as the backup utility wrote
> 
> 3: If filesystem is unmounted and xfs_repair is run on it, no errors will be
>    found
> 
> 
> Actual outcome (observed when restoring to XFS):
> 
> 1: Some files & directories cannot be written. The easiest way to find problem
>    directories them is:
> 
>    # find . -type d -exec touch {}/asdf \;
>    touch: cannot touch ‘./aaaaa/-BIz/asdf’: Cannot allocate memory
>    touch: cannot touch ‘./aaaaa/-BIz/Zp.NyvX0guz./asdf’: Cannot allocate memory
>    touch: cannot touch ‘./aaaaa/-BIz/Zp.NyvX0guz./TWDU/asdf’: Cannot allocate memory
>    [etc]
> 
>    Giving VMs more RAM has no effect on this. Clearing the ACL on the directory
>    has no effect.
> 
>    Affected directories are not always the same between different runs.
> 
> 2: The default ACL has not been restored to problem directories. Directories
>    which I can write to have had the default ACL restored.
> 
> 3: If filesystem is unmounted and xfs_repair is run on it, many errors are
>    reported:
> 
>    # xfs_repair -n /dev/mapper/xfsbugtest--vg-dst 2>&1 | head -90
>    Phase 1 - find and verify superblock...
>    Phase 2 - using internal log
>            - scan filesystem freespace and inode maps...
>            - found root inode chunk
>    Phase 3 - for each AG...
>            - scan (but don't clear) agi unlinked lists...
>            - process known inodes and perform inode discovery...
>            - agno = 0
>    Too many ACL entries, count -2010719080
>    entry contains illegal value in attribute named SGI_ACL_FILE or SGI_ACL_DEFAULT
>    bad security value for attribute entry 1 in attr block 0, inode 133
>    problem with attribute contents in inode 133
>    would clear attr fork
>    bad nblocks 2 for inode 133, would reset to 1
>    bad anextents 1 for inode 133, would reset to 0
>    Too many ACL entries, count -2010719080
>    entry contains illegal value in attribute named SGI_ACL_FILE or SGI_ACL_DEFAULT
>    bad security value for attribute entry 1 in attr block 0, inode 134
>    problem with attribute contents in inode 134
>    would clear attr fork
>    bad nblocks 2 for inode 134, would reset to 1
>    bad anextents 1 for inode 134, would reset to 0
>    [...]
>    bad nblocks 1 for inode 52741928, would reset to 0
>    bad anextents 1 for inode 52741928, would reset to 0
>            - process newly discovered inodes...
>    Phase 4 - check for duplicate blocks...
>            - setting up duplicate extent list...
>            - check for inodes claiming duplicate blocks...
>            - agno = 0
>            - agno = 1
>            - agno = 2
>            - agno = 3
>    No modify flag set, skipping phase 5
>    Phase 6 - check inode connectivity...
>            - traversing filesystem ...
>            - traversal finished ...
>            - moving disconnected inodes to lost+found ...
>    Phase 7 - verify link counts...
>    No modify flag set, skipping filesystem flush and exiting.
> 
>    On my production VMs, running xfs_repair without '-n' typically left many
>    files (the highest was 148k) in /lost+found and left many directories
>    without ACLs.
> 
> 
> 
> xfs_info output on a corrupted filesystem on the test VM:
> 
> meta-data=/dev/mapper/xfsbugtest--vg-dst isize=256    agcount=4, agsize=655360 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=0        finobt=0
> data     =                       bsize=4096   blocks=2621440, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=0
> log      =internal               bsize=4096   blocks=2560, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> 
> xfs_metadump of filesystem is at ftp://ftp.umnh.utah.edu/general-temporary/xfs/corrupted.metadump
> 
> 
> Giant (5.9 GB uncompressed) trace-cmd output is at ftp://ftp.umnh.utah.edu/general-temporary/xfs/trace_report.xz

Is this issue reproducible with current supported Debian versions? If
not we might want to close this bug as Jessie respectively v3.16.y is
EOL'ed.

Regards,
Salvatore

Reply to:

Prev by Date: Processed: reassign 968712 to src:linux
Next by Date: Bug#968908: linux-image-5.7.0-2-amd64: amdgpu regression fails to load firmware for RX580
Previous by thread: Processed: reassign 968712 to src:linux
Next by thread: Bug#968908: linux-image-5.7.0-2-amd64: amdgpu regression fails to load firmware for RX580
Index(es):
- Date
- Thread