[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Memory usage: buffer and cache



On Sun, 2004-10-24 at 10:46 +0200, Jan Kesten wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi all!
> 
> |> I'm not sure but what do you mean when you say "Linux doesn't see
> |> the CPU cache."
> 
> I think he meant that the CPU cache isn't directly accessible for linux
> as for any other OS. It is inside the chip and there it stores recently
> accessed memory pages (and their addresses) to speed up following
> accessed to that memory page (indeed since cpu cache runs at cpu clock
> speed and is sram it is much much more faster than fetching a memory
> page from main memory or even from a disk).

CPU cache holds "data" or a single instruction. It doesn't contain
memory addresses because that would still mean we have to fetch the
actual data. A "page" is a concept that the OS uses but a CPU has no
concept of paging though set associative cache acts much like paging in
the OS on a conceptual level.

> 
> |> On my Intel Centrino laptop the /proc/cpuinfo file does point out
> |> that the processor has a 2mb capacity of L2 Cache.
> 
> Right, it also will have some layer 1 cache. Both are on chip, if a
> requested memory page isn't in layer 1 cache at next the cpu (not a OS)
> will look into layer 2 cache. 2MB doesn't seem very large in view of
> sizes of main memory - but often programms have some variables (data)
> that is read/written quite often (a counter in a loop for a simple
> example) and these variables could fit into one memory page. So if this
> page is in the cache, while running the cpu doesn't need to fetch pages
> from memory too often.

This is referred to as "temporal locality" which is the idea that if an
item is accessed once then chances are it will be accessed again.
Although CPU cache does consider temporal locality it works more
strongly (at CPU level) using the idea of "spatial locality". Spatial
locality is the idea that if we access some data in memory, chances are
we will need the surrounding data.

Spatial locality is so predominant in CPU caches because of the linear
sequential fashion in which instructions are arranged. The CPU has three
main stages; fetch, decode, and execute. The fetch cycle is expensive
because it needs to access external memory. Spatial locality tells us
that if we are going out for an instruction we might as well bring back
the next few and store them on the chip (L1 cache).

Now all would be perfect except for those darn branch instructions (if -
then - else). This branching causes cache misses because we suddenly
jumped to some arbitrary instruction we were not anticipating. This may
seem like not such a big deal but it's a fundamental computer science
problem that computer engineers have spent years trying to solve. Things
are a bit more complicated than just a simple cache miss.

If we were to operate on an instruction from start to finish we would
first fetch it then decode it and finally execute it. Then move to the
next instruction and repeat. New processors use what is known as
"pipelining". The concept of a pipeline is that we work on the
instruction as if we had an assembly line. As soon as we fetch an
instruction we pass it on to the next stage and go fetch another. The
reason has to do with clock timing. It may take a complex instruction 10
or 12 clock cycles to complete. 1 micro operation (micro instruction)
takes 1 clock cycle and it may take several instructions to complete
(i.e. push and pop). For that reason the concept of pipelines was
introduced.

If it takes a number of clock cycles to complete an instruction than you
can see that, on average, we probably complete an instruction about
every 10th clock cycle. With a pipeline we actually complete an
instruction every clock cycle. This may seem odd but picture a 100 foot
hose. Lets say that we need to fill a glass using this hose. You stand
at the faucet and I stand at the glass. I shout to you to turn the water
on and when the glass is full I shout for you to turn the water off. I
go get another glass and we repeat the process until 100 glasses have
been filled. 

That seems silly though. Where is all our time being lost? It is lost on
moving the water from the faucet to the glass (filling the hose). It may
take a few seconds until water from the faucet is actually leaving the
other end of the hose. 

A pipeline solves this type of problem. Once we have the pipeline
filled, we are theoretically executing an instruction on every clock
edge. As one instruction comes into the front of the pipe, an
instruction is pushed out the back. Anything coming out the back is a
completed instruction. Just think of an assembly line in some factory.

Now back to our cache problem. We can see that initially filling the
pipeline is the most expensive. Once it's full we are in good shape. In
order to fill it we are actually using CPU cache or spatial locality, to
assume that the next instruction in is the one proceeding the last. When
a branch instruction causes a cache miss it means we are most likely
going to have to flush our pipeline because what we assumed to be the
next sequence of instructions has changed. This is a major problem.

The P4 has two 31 stage pipelines so that it can perform branch
predictions. This means it has more than one choice to make. Newer
versions of GCC use branch prediction hinting to tell the CPU that a
branch may be coming.

L1 cache is relatively small because it is on the CPU. L2 cache is
cheaper and larger because it is located on a separate chip. The general
flow of cache fill is that we fill L2 cache and then from that cache we
fill L1 with a smaller subset. 

Most modern CPUs don't use a direct mapped cache. Direct mapped cache
means that each cache lookup points to a single piece of data. The
Pentium based processors use a set associative cache which means that
one cache mapping will point to several locations. When we do the cache
lookup we see a set number of cache datum. This may seem odd but it
works in the same fashion as paging in the OS. Think about how we map
page tables. One entry in the table points to and entire page (4k of
data) yet we are in search of a single piece of data. Set associative
cache mapping is identical to page table mapping where a certain amount
of bits are used as an index into the table and then more bits are used
to index into the page itself and so on... 

Lastly you have a form of address mapping cache called  a "Translation
Lookaside Buffer" or TLB. The TLB caches virtual to physical address
mappings so that the MMU (memory management unit) on the CPU doesn't
have to keep performing this operation. Remember that the OS deals with
virtual addressing so that when it communicates with the CPU, the CPU
must convert that virtual address into a physical address. The TLB
caches this data for fast reference.  

I've heard a few of you state that the OS doesn't see these type of
caches. Although it has no direct control over them it definitely
considers them and the value they offer. Things like processor affinity
exist to reduce cache misses (SMP does not necessarily share certain
caches). The compiler also plays a major part in CPU cache. Most
optimizations exist to increase cache performance on the CPU.

This probably more than most of you care to know about CPU cache but I
wanted to clear up some common misconceptions as well as test my
retention. My computer architecture professor would be proud ;-)  

-- 
Eric Gaumer <gaumerel@ecs.fullerton.edu>

Attachment: signature.asc
Description: This is a digitally signed message part


Reply to: