Profile of m68k kernel
Hi,
I did quick profile of (my minimal) m68k kernel bootup in Hatari.
For that, one needs file that tells Hatari to start profiling
immediately, and when to stop (to a breakpoint):
--- profile-boot.ini ---
profile on
b pc = sys_brk
------------------------
And then add "--parse profile-boot.ini" to Hatari options.
When init is started i.e. debugger gets invoked on "sys_brk"
function breakpoint, one can save profile with:
> profile save profile.txt
This file can then post-processed with:
$ hatari_profile -st -a System.map profile.txt
Producing:
------------------------
...
Time spent in profile = 26.63618s.
...
Used cycles:
6.62% 28281452 memset
6.39% 27307156 bit_putcs
6.08% 25960166 atafb_mfb_linefill
5.08% 21687056 arch_cpu_idle
3.43% 14647394 atafb_imageblit
2.50% 10664268 memmap_init_zone
2.38% 10149374 memcpy
1.72% 7348390 format_decode
1.61% 6888748 update_wall_time
1.60% 6832538 idr_get_free
...
Executed instructions:
12.51% 6937613 bit_putcs
12.17% 6749440 atafb_mfb_linefill
8.09% 4485638 memset
4.11% 2276859 atari_keyb_init
3.34% 1851300 atafb_imageblit
3.27% 1815464 fbcon_redraw.isra.13
2.75% 1527042 memmap_init_zone
2.62% 1450424 strlen
2.11% 1170158 memcpy
...
Instruction cache misses:
4.37% 436093 atafb_mfb_linefill
3.70% 369549 memset
3.65% 363946 memmap_init_zone
3.33% 332754 atafb_imageblit
2.85% 284450 bit_putcs
2.52% 251835 fbcon_redraw.isra.13
2.26% 225335 memcpy
...
Data cache hits:
22.65% 2250343 bit_putcs
11.67% 1158746 atafb_mfb_linefill
9.82% 975549 atari_keyb_init
5.02% 498853 atafb_imageblit
...
Visits/calls:
10.64% 106920 console_conditional_schedule
8.67% 87120 atafb_mfb_linefill
4.80% 48284 memset
...
------------------------
Adding "-pg" options to hatari_profile command line, tries to
propagate costs up the function call-tree, and generate also
callgraphs out of that. However, those callgraphs are so
large (10K nodes) for whole bootup that they aren't really
readable, so I'm not attaching them.
Text only output showed following functions to cause most costs:
1.53% 19.99% 0.41s 5.32s fbcon_redraw.isra.13
1.04% 19.50% 0.28s 5.19s fbcon_putcs
6.39% 16.94% 1.70s 4.51s bit_putcs
Columns are:
1. own cost (of instructions just in this function)
2. + costs from other functions called from this = total
3. own cost in seconds
4. total cost in seconds
5. name of function
According to partial callgraph, call sequence for them is following:
-----------------------------------------------------
con_scroll
|
fbcon_scroll
|
| (1.5/20%)
fbcon_redraw -- (0.6%) console_conditional_schedule
|
| (1/19.5%)
fbcon_putcs ---(6.4/16.9%)
| \
| (1.1/1.6%) ,--- bit_putcs
get_color / |
| (0.2%) | (3.4/9.6%)
| (0.4%) / atafb_imageblit
fb_get_color_depth |
| (6.1%)
atafb_mfb_linefill
-----------------------------------------------------
I-cache misses in memset() seemed a bit high, so I looked into that.
The whole function should fit into I-cache, so caller determines
where cache misses are.
Here's memset inner loop:
------------------------
<instr>% (<instr sum>, <cycles sum>, <i-cache misses>, <d-cache hits>)
...
$249efa : neg.w d3 0.09% (48030, 148, 12, 0)
$249efc : jmp $249f10(pc,d3.w*2) 0.09% (48030, 514884, 9178, 0)
$249f00 : move.l d0,(a1)+ 0.42% (234146, 1882622, 44, 0)
$249f02 : move.l d0,(a1)+ 0.42% (234274, 1875396, 52, 0)
$249f04 : move.l d0,(a1)+ 0.42% (234490, 1877226, 70, 0)
$249f06 : move.l d0,(a1)+ 0.43% (235846, 1887664, 24, 0)
$249f08 : move.l d0,(a1)+ 0.43% (236500, 1971834, 2344, 0)
$249f0a : move.l d0,(a1)+ 0.44% (242029, 1938204, 40, 0)
$249f0c : move.l d0,(a1)+ 0.44% (242421, 1940448, 32, 0)
$249f0e : move.l d0,(a1)+ 0.50% (279036, 2234144, 46, 0)
$249f10 : dbra d2,$249f00 0.51% (282176, 15370, 246, 0)
$249f14 : clr.w d2 0.09% (48030, 0, 0, 0)
------------------------
=> I-cache misses (3rd number in parenthesis) looks fine, misses
were actually in the longish amount of code outside of this loop.
Different number of instructions executed within loop are because
"jmp" before it jumps into middle of loop.
Comparing the execution counts (1st number in parenthesis) for each
instruction address in above loop, and rest of memset code (dealing
with unaligned memset start & end addresses), one can get pretty
accurate picture of what sized memsets calls there were during bootup,
and how well aligned they were.
Assuming I interpreted above "jmp" correctly, one can see from the first
move.l instruction that there were 234146 memset calls which had at
least 32 byte long, long-aligned area to set.
And I think that from dbra and last move.l counts, one could see
that there were 282176-279036=3140 memset calls that didn't fill
even a 4-byte sized long-aligned area.
- Eero
After I've had time to get full Debian kernel working, I'll provide
similar info for that.
Reply to: