[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Profile of m68k kernel



Hi,

I did quick profile of (my minimal) m68k kernel bootup in Hatari.

For that, one needs file that tells Hatari to start profiling
immediately, and when to stop (to a breakpoint):
--- profile-boot.ini ---
profile on
b pc = sys_brk
------------------------

And then add "--parse profile-boot.ini" to Hatari options.

When init is started i.e. debugger gets invoked on "sys_brk"
function breakpoint, one can save profile with:
	> profile save profile.txt

This file can then post-processed with:
	$ hatari_profile -st -a System.map profile.txt

Producing:
------------------------
...
Time spent in profile = 26.63618s.
...
Used cycles:
   6.62%    28281452   memset
   6.39%    27307156   bit_putcs
   6.08%    25960166   atafb_mfb_linefill
   5.08%    21687056   arch_cpu_idle
   3.43%    14647394   atafb_imageblit
   2.50%    10664268   memmap_init_zone
   2.38%    10149374   memcpy
   1.72%     7348390   format_decode
   1.61%     6888748   update_wall_time
   1.60%     6832538   idr_get_free
...
Executed instructions:
  12.51%     6937613   bit_putcs
  12.17%     6749440   atafb_mfb_linefill
   8.09%     4485638   memset
   4.11%     2276859   atari_keyb_init
   3.34%     1851300   atafb_imageblit
   3.27%     1815464   fbcon_redraw.isra.13
   2.75%     1527042   memmap_init_zone
   2.62%     1450424   strlen
   2.11%     1170158   memcpy
...
Instruction cache misses:
   4.37%      436093   atafb_mfb_linefill
   3.70%      369549   memset
   3.65%      363946   memmap_init_zone
   3.33%      332754   atafb_imageblit
   2.85%      284450   bit_putcs
   2.52%      251835   fbcon_redraw.isra.13
   2.26%      225335   memcpy
...
Data cache hits:
  22.65%     2250343   bit_putcs
  11.67%     1158746   atafb_mfb_linefill
   9.82%      975549   atari_keyb_init
   5.02%      498853   atafb_imageblit
...
Visits/calls:
  10.64%      106920   console_conditional_schedule
   8.67%       87120   atafb_mfb_linefill
   4.80%       48284   memset
...
------------------------

Adding "-pg" options to hatari_profile command line, tries to
propagate costs up the function call-tree, and generate also
callgraphs out of that.  However, those callgraphs are so
large (10K nodes) for whole bootup that they aren't really
readable, so I'm not attaching them.

Text only output showed following functions to cause most costs:
  1.53%   19.99%   0.41s  5.32s  fbcon_redraw.isra.13
  1.04%   19.50%   0.28s  5.19s  fbcon_putcs
  6.39%   16.94%   1.70s  4.51s  bit_putcs

Columns are:
1. own cost (of instructions just in this function)
2. + costs from other functions called from this = total
3. own cost in seconds
4. total cost in seconds
5. name of function


According to partial callgraph, call sequence for them is following:
-----------------------------------------------------
  con_scroll
    |
  fbcon_scroll
    |
    | (1.5/20%)
  fbcon_redraw -- (0.6%) console_conditional_schedule
    |
    | (1/19.5%)
  fbcon_putcs ---(6.4/16.9%)
    |                    \
    | (1.1/1.6%)   ,--- bit_putcs
   get_color      /       |
    |          (0.2%)     | (3.4/9.6%)
    | (0.4%)    /      atafb_imageblit
fb_get_color_depth        |
                          | (6.1%)
                    atafb_mfb_linefill

-----------------------------------------------------


I-cache misses in memset() seemed a bit high, so I looked into that.

The whole function should fit into I-cache, so caller determines
where cache misses are.

Here's memset inner loop:
------------------------
<instr>% (<instr sum>, <cycles sum>, <i-cache misses>, <d-cache hits>)
...
$249efa : neg.w   d3                 0.09% (48030, 148, 12, 0)
$249efc : jmp     $249f10(pc,d3.w*2) 0.09% (48030, 514884, 9178, 0)
$249f00 : move.l  d0,(a1)+           0.42% (234146, 1882622, 44, 0)
$249f02 : move.l  d0,(a1)+           0.42% (234274, 1875396, 52, 0)
$249f04 : move.l  d0,(a1)+           0.42% (234490, 1877226, 70, 0)
$249f06 : move.l  d0,(a1)+           0.43% (235846, 1887664, 24, 0)
$249f08 : move.l  d0,(a1)+           0.43% (236500, 1971834, 2344, 0)
$249f0a : move.l  d0,(a1)+           0.44% (242029, 1938204, 40, 0)
$249f0c : move.l  d0,(a1)+           0.44% (242421, 1940448, 32, 0)
$249f0e : move.l  d0,(a1)+           0.50% (279036, 2234144, 46, 0)
$249f10 : dbra    d2,$249f00         0.51% (282176, 15370, 246, 0)
$249f14 : clr.w   d2                 0.09% (48030, 0, 0, 0)
------------------------

=> I-cache misses (3rd number in parenthesis) looks fine, misses
were actually in the longish amount of code outside of this loop.

Different number of instructions executed within loop are because
"jmp" before it jumps into middle of loop.

Comparing the execution counts (1st number in parenthesis) for each instruction address in above loop, and rest of memset code (dealing
with unaligned memset start & end addresses), one can get pretty
accurate picture of what sized memsets calls there were during bootup,
and how well aligned they were.

Assuming I interpreted above "jmp" correctly, one can see from the first
move.l instruction that there were 234146 memset calls which had at
least 32 byte long, long-aligned area to set.

And I think that from dbra and last move.l counts, one could see
that there were 282176-279036=3140 memset calls that didn't fill
even a 4-byte sized long-aligned area.


	- Eero

After I've had time to get full Debian kernel working, I'll provide
similar info for that.


Reply to: