Bug#925334: vc4: CMA fills up and screen not updated anymore on raspi3
* Fabian Pietsch (Wed, 03 Apr 2019 12:30:01 +0200):
> As suggested, I tested more compact cmdline.txt, though:
> | root=/dev/mmcblk0p2
> | console=tty0 root=/dev/mmcblk0p2
> | console=tty0 root=/dev/mmcblk0p2 cma=128M
> With the first two, cma defaulted to 64M and already the lightdm login
> screen stops updating after a few seconds to minutes. With the 3rd,
> the bug initially didn't happen so I used X a bit, then logged out and
> let the system fall idle; the bug then seems to have happened
> 9868 seconds after boot (according to dmesg --follow).
Another round (system mostly idle except for manually renaming enx[...]
to eth0 after boot and restarting wicd, which came with lxde meta-package
to manage the networking, to get a correct system time via NTP) ->
3960 seconds after boot (in dmesg). That seems to suggest to me that
the cmdline.txt built by raspi3-firmware was not the issue, here.
Still don't know whether it's possible or sensible to disable the
"weird" initial cmdline passed by the firmware, though.
In any case, the tile binning error ...
| kernel: vc4_v3d 3fc00000.v3d: Failed to allocate memory for tile binning: -12. You may need to enable CMA or give it more memory.
... together with a preceding, e.g., ...
| kernel: [drm:vc4_bo_create [vc4]] *ERROR* Failed to allocate from CMA:
| kernel: [drm] V3D: 23468kb BOs (10)
| kernel: [drm] V3D shader: 144kb BOs (35)
| kernel: [drm] dumb: 8148kb BOs (4)
... is usually surrounded (in journal) by nothing but many:
| kernel: alloc_contig_range: [36400, 37500) PFNs busy
What I'm trying to say is that there seems to be no log-noticeable
concurrent activity going on. Watching the CMA use via /proc/meminfo
suggests that it's much more than half free, most of the time.
The tile binning error seems to be entirely random, at this point.
Looking at the source, vc4_allocate_bin_bo() seems to use a strategy
of successively allocating memory until it randomly finds a block that
fits certain requirements. Maybe randomly sometimes there is no such
block available, leading to it failing with -ENOMEM. It's not clear
to me, though, when and why the function is called, seemingly randomly
on an idle system. It reads to me as an initialization, not something
that is randomly repeated. (?)
Again, it would be nice if the resulting device error state could
somehow be reset / the function retried with more/different CMA free
at a later point during the same boot. Perhaps that's already possible
(maybe even in a general way?) but I don't know how.
(Trying to unbind the driver (vc4_v3d) via sysfs led to a kernel oops
(IIRC paging fault?), and an attempt to bind it again was rejected
without any noticeable action.)
Fabian "canvon" Pietsch - https://www.canvon.de/