Bug#833170: linux-image-3.16.0-4-amd64: Reproducable XFS filesystem corruption, possibly connected with ACLs
Hi,
On Mon, Aug 01, 2016 at 11:25:28AM -0600, Will Aoki wrote:
> Package: src:linux
> Version: 3.16.7-ckt25-2+deb8u3
> Severity: important
>
> I hit a nasty filesystem corruption bug while restoring some backups. I'm able
> to reliably reproduce this on every system where I tried to restore to XFS and
> on a brand-new VM created for testing (which I can share as an OVF, although
> it's pretty big).
>
> Everything's running on hardware with ECC RAM, so memory errors are unlikely.
> I've reproduced it on VMs spread across three different storage arrays from two
> different vendors. Everything has at least two virtual CPUs.
>
> All systems tested use XFS on LVM.
>
> In my test VM, the only packages not from Debian stable are custom stow and tar
> packages to fix bugs in the versions in the stable release and a backport of
> burp because the verson in Debian was very old. These are all userland tools
> and none should be able to cause filessytem corruption.
>
> I'm using xfsprogs from Debian jessie. On one affected system, I tried version
> 4.3.0+nmu1 and saw no difference in what xfs_repair found.
>
>
> At this point, my test case uses the burp backup software to create the I/O
> activity which triggers this bug. I have not been able to make tar trigger this
> problem.
>
>
> Steps to reproduce:
>
> 1: Create new XFS filesystem & mount it on /srv/src
>
> 2: Create some directories in /srv/src & set ACLs (including default ACLs) on
> them
>
> 3: Generate deep tree of files in each of the directories from step #2. For
> testing, I used a script which created random files & subdirectories. Total
> bulk was about 2.5 gigabytes.
>
> 4: Take backup of /srv/src with burp:
>
> # burp -a b
>
> 5: Unmount /srv/src
>
> 6: Create new XFS filesystem & mount it on /srv/src
>
> 7: Run restore to /srv/src:
>
> # burp -a r -r ^/srv/src
>
> Do not suspend the restore process: the bug appears to require sustained I/O
> to trigger. In trials where I suspended it multiple times during a restore,
> corruption did not surface.
>
>
> Expected outcome (observed when restoring to e.g. ext4):
>
> 1: Can create files (permissions notwithstanding) in every directory under
> /srv/src
>
> 2: Default ACL on every directory is the same as the backup utility wrote
>
> 3: If filesystem is unmounted and xfs_repair is run on it, no errors will be
> found
>
>
> Actual outcome (observed when restoring to XFS):
>
> 1: Some files & directories cannot be written. The easiest way to find problem
> directories them is:
>
> # find . -type d -exec touch {}/asdf \;
> touch: cannot touch ‘./aaaaa/-BIz/asdf’: Cannot allocate memory
> touch: cannot touch ‘./aaaaa/-BIz/Zp.NyvX0guz./asdf’: Cannot allocate memory
> touch: cannot touch ‘./aaaaa/-BIz/Zp.NyvX0guz./TWDU/asdf’: Cannot allocate memory
> [etc]
>
> Giving VMs more RAM has no effect on this. Clearing the ACL on the directory
> has no effect.
>
> Affected directories are not always the same between different runs.
>
> 2: The default ACL has not been restored to problem directories. Directories
> which I can write to have had the default ACL restored.
>
> 3: If filesystem is unmounted and xfs_repair is run on it, many errors are
> reported:
>
> # xfs_repair -n /dev/mapper/xfsbugtest--vg-dst 2>&1 | head -90
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
> - scan filesystem freespace and inode maps...
> - found root inode chunk
> Phase 3 - for each AG...
> - scan (but don't clear) agi unlinked lists...
> - process known inodes and perform inode discovery...
> - agno = 0
> Too many ACL entries, count -2010719080
> entry contains illegal value in attribute named SGI_ACL_FILE or SGI_ACL_DEFAULT
> bad security value for attribute entry 1 in attr block 0, inode 133
> problem with attribute contents in inode 133
> would clear attr fork
> bad nblocks 2 for inode 133, would reset to 1
> bad anextents 1 for inode 133, would reset to 0
> Too many ACL entries, count -2010719080
> entry contains illegal value in attribute named SGI_ACL_FILE or SGI_ACL_DEFAULT
> bad security value for attribute entry 1 in attr block 0, inode 134
> problem with attribute contents in inode 134
> would clear attr fork
> bad nblocks 2 for inode 134, would reset to 1
> bad anextents 1 for inode 134, would reset to 0
> [...]
> bad nblocks 1 for inode 52741928, would reset to 0
> bad anextents 1 for inode 52741928, would reset to 0
> - process newly discovered inodes...
> Phase 4 - check for duplicate blocks...
> - setting up duplicate extent list...
> - check for inodes claiming duplicate blocks...
> - agno = 0
> - agno = 1
> - agno = 2
> - agno = 3
> No modify flag set, skipping phase 5
> Phase 6 - check inode connectivity...
> - traversing filesystem ...
> - traversal finished ...
> - moving disconnected inodes to lost+found ...
> Phase 7 - verify link counts...
> No modify flag set, skipping filesystem flush and exiting.
>
> On my production VMs, running xfs_repair without '-n' typically left many
> files (the highest was 148k) in /lost+found and left many directories
> without ACLs.
>
>
>
> xfs_info output on a corrupted filesystem on the test VM:
>
> meta-data=/dev/mapper/xfsbugtest--vg-dst isize=256 agcount=4, agsize=655360 blks
> = sectsz=512 attr=2, projid32bit=1
> = crc=0 finobt=0
> data = bsize=4096 blocks=2621440, imaxpct=25
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0 ftype=0
> log =internal bsize=4096 blocks=2560, version=2
> = sectsz=512 sunit=0 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
>
>
> xfs_metadump of filesystem is at ftp://ftp.umnh.utah.edu/general-temporary/xfs/corrupted.metadump
>
>
> Giant (5.9 GB uncompressed) trace-cmd output is at ftp://ftp.umnh.utah.edu/general-temporary/xfs/trace_report.xz
Is this issue reproducible with current supported Debian versions? If
not we might want to close this bug as Jessie respectively v3.16.y is
EOL'ed.
Regards,
Salvatore
Reply to: