[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [PATCH] ext4/jbd2: don't wait (forever) for stale tid caused by wraparound



On Monday, 18 March 2013 at 1:54 PM, Theodore Ts'o wrote:
Thanks for reporting this. I thought we had fixed this in 3.0.
Before then, when we had a tid wrap, it would result in kjournald
spinning forever. I suspect this was your "spontaneous reboots" that
you mentioned you mentioned when you were using 2.6.39 --- did you
have a hardware or softward watchdog timer enabled by any chance?
Thank you for your prompt attention on this.  It's greatly appreciated!

We believe our previous spontaneous reboots were caused by https://bugzilla.kernel.org/show_bug.cgi?id=16991 which was resolved by our move to a 3.2 kernel (we were on a 2.6.38-bpo kernel ^1).  We do not presently use any watchdogs.
Since we didn't have a good way of reproducing the problem at the
time, I didn't realize that the problem had not been fully fixed;
since while jbd2_log_start_commit() would no longer cause kjournald to
spin forwever, a subsequent call to jbd2_log_wait_commit() with a
stale transaction id would wait for a very long time (possibly until
the heat death of the universe :-)
This would mirror what we've seen, although our ops guys haven't been waiting around for any universes to die :)
I think a patch like this should fix things; I've run a stress test
with a hack to increment the transaction id by 1 << 24 after each
commit, to more quickly cause an tid wrap, and the regression tests
seem to be passing without complaint.
Excellent news.  Again, thank you for your help in this regard.

@Ben - could you let me know what your preferred course of action would be here?  As I'm sure you can understand, I do not wish to maintain a forked kernel from Debian upstream.  Is this something you would be prepared to integrate into the 3.2 BPO kernels?

Best regards,

George

1. We moved to 2.6.38 in order to get access to the packet steering patches which were put into the kernel in ~.33 or .34 from memory.  This gave us quite a nice performance bump to our storage speed and hence we didn't want to lose it by going back to .32 to get the 200d uptime bug fix.

Reply to: