Re: "external abort on linefetch (0x814)" on Kirkwood 6282 SoC
On 24-05-18 14:30, Andrew Lunn wrote:
> On Thu, May 24, 2018 at 12:40:06PM +0300, Timo Jyrinki wrote:
>> 2018-04-25 14:16 GMT+03:00 Martin Michlmayr <firstname.lastname@example.org>:
>>> Timo Jyrinki is happy to run some tests. He's affected and has a
>>> serial console. The bug is still there in the 4.9 kernel we're
>>> shipping with Debian kernel.
>>> Andrew, what information or access do you need so this can be tracked
>> Yesterday I tried booting with mem=512M added to the u-boot's setenv
>> bootargs, and wasn't able to reproduce the problem. Booting again
>> without the parameter it was there again. I repeated a couple of times
>> with same results, although sometimes it took some time for the
>> problem to occur in the normal 1GB RAM use case so I'm not 100% sure
>> of how bullet proof the workaround is. I tried to use at least some
>> memory by starting Debian installer fetching, logging into it via ssh
>> Could someone else try it out? Double-check the parameter worked with
>> 'free'. I'm tempted to make a backup of my current / + flash
>> partitions and dist-upgrade to stretch. On that note, what would be
>> the easiest way to set the mem=512M as the default for normal boots?
>> Andrew wasn't able to reproduce the problem on his 6282 machine. Would
>> it be that he has QNAP TS-219P+ or similar that has only 512MB RAM?
> Hi Timo
> root@qnap:~# cat /proc/meminfo
> MemTotal: 511516 kB
> So lets think about what this could mean...
> Is the 1G implemented using two RAM chips? Do you have photos of your
> board? Can you identify the chips? Does u-boot say anything useful
> about the RAM?
> Could the u-boot you have not be correctly initialising the second RAM
> chip? Are you using the stock QNAP/marvell u-boot, or have you
> upgraded u-boot?
> Is there a hole in the address range between the two RAMs? The kernel
> should be able to handle that, but i don't know if you have to tell
> it, or if it can figure it out itself. Can you see anything about this
> in the kernel logs, or u-boot?
> Do we see the physical address being accessed when we get the abort?
> Is it in the top 1/2 of the RAM? Could it be a DMA operation which has
> gone over the boarder between the end of the first RAM and the
> beginning of the second RAM? Seems a bit unlikely....
Timo's remark about memory triggered me.
I am not convinced it is related to u-boot or memory chips. Specifically
because kernel lenny 4.3.0-0.bpo.1-kirkwood (4.3.5-1~bpo8+1) does not
have these issues. For me the issues started after the flavour change
from kirkwood to marvell.
I tried running strecth 4.16.0-0.bpo.1-marvell (4.16.5-1~bpo9+1) with
mem=512M which was stable for more than 24 hours. Comparing dmesg output
one interesting line was missing in the 512M version:
HighMem zone: 65536 pages, LIFO batch:15
With mem=768M also kernel boots with no bug and error reports. 768M is
the border where (according to dmesg) HighMem starts. With no mem= (i.e.
using the full 1024M) just booting already prints a lot of error
messages for me.
I think changes in handling HighMem between kirkwood and marvell
flavours are the cause, though have no way other than the test above to
confirm. Maybe information displayed in the error messages can help
confirm issue is related to HighMem?
When there is anything I can test please let me know.