[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#764162: Regression with kernel 3.16.7-ckt2-1



Dear Ian,

First of all, thank you very much for the reply. I wasn't able to reply
earlier due to end of year activities, but now I will try to be as speedy as
feasible.

On Dec 31 2014, Ian Campbell wrote:
> On Wed, 2014-12-31 at 06:08 -0200, Rogério Brito wrote:
> > I have a Kurobox Pro that I use as a NAS and I was affected by the network
> > corruption when the TSO was enabled in versions 3.16 before the version with
> > the "workaround" on the mv643xx_eth (not having seen the code, from a user's
> > perspective, this workaround was more like a fix than a dirty hack).
> 
> The workaround was just turning off the feature.

Exactly. This is what I did with ethtool.

> Please can you clarify which of these kernels did/didn't work (or for
> which you have no data):
>       * 3.16.7-1 (has the bug)

I had the bug with this and I even put the last 3.14 that I had available
here on hold and, I was running all the time

,----
| flash-kernel --force 3.14-2-orion5x
`----

To prevent problems in the case of a power outage here and my wife booting
the NAS, as there are some educative programs that my little son watches
every day.

I even thought that the days of that device were counted, given that some
newer userspace is likely to require newer kernel versions and that this
device's live would be cut short (before I knew what the problem was---I was
only seeing the symptoms).

I did not report the problem because I thought that I would have little
success in explaining the problem (and doing git bisects on this thing would
be seriously would take so many weeks).  I was so happy that I wasn't the
only person seeing corruption with the 3.16.7-1 kernel!

>       * 3.16.7-2 (with the hack/workaround of disabling TSO by default)

With this, I had *no* problems and I was relieved that things went back to
work just fine, without data corruption. (I also use this NAS as a way to
backup some of my data---if there is silent data corruption, then I would be
in trouble).

>       * 3.16.7-ckt2-1 (with the supposed proper fix, 2c2a9cb from
>         upstream, backported via the -ckt tree)

This brought me back the problematic situation of the 3.16.7-1. To avoid
forcing flash-kernel with the command above, I tried to turn TSO off and I
see no signs of corruption.

> FWIW I am running 3.16.7-ckt2-1 on my kirkwood based ts-419 right now
> and it seems fine. It's possible that your system has a separate issue
> or is somehow more susceptible to the original (Which IIRC was cache
> based, so could affect different platforms differently).

I have not read the code of the commit nor the context of fix,
unfortunately.

> Please can you also confirm that flash-kernel has been run and is
> picking up the correct kernel image, i.e. it hasn't installed an old
> kernel for you or something like that. "uname -v" includes the actual
> running version.

Sure. Here you go:

,----[ uname -a ]
| Linux lattes 3.16.0-4-orion5x #1 Debian 3.16.7-ckt2-1 (2014-12-08) armv5tel GNU/Linux
`----

> > Can we get a fix for this in time for jessie?
> 
> If one can be found of course we will try and apply it.

Thank you very much for being open to this possibility.

> Since I can't reproduce it would be useful if you could take this issue
> to the upstream developers who were involved in the original bug report
> and work with them directly to find a cure.

I may try, but I am not confident that I will have any success. :(

> If we can't find one then I suppose we will fall back to just disabling
> TSO by default on these systems.

Yes. In absence of further data, between data corruption an a performance
hit, the choice is quite easy.


Thanks,

-- 
Rogério Brito : rbrito@{ime.usp.br,gmail.com} : GPG key 4096R/BCFCAAAA
http://cynic.cc/blog/ : github.com/rbrito : profiles.google.com/rbrito
DebianQA: http://qa.debian.org/developer.php?login=rbrito%40ime.usp.br


Reply to: