[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

32-bit memory limits IN DETAIL (Was: perspectives on 32 bit vs 64 bit)



This seems to come up every now and then, so let me explain.
None of this is new information, butt can be a bit confusing. 

First, i386 memory addressing.

The i386 is unlike all other processors in that there are two levels
of address translation that take place.

First, we have a 16-bit segment + 32-bit offset VIRTUAL address.
Now, 3 bits of that segment are sort of "taken" (2 bits of RPL and
1 local/global bit), so you really only get 8192 segments per process.

This VIRTUAL address is then translated into a 32-bit LINEAR address by
checking the offset against the segment limit and adding the segment base.

Then this 32-bit LINEAR address is fed to a standard page-based MMU,
producing a 32- or 36-bit PHYSICAL address.

Most processors go VIRTUAL --page tables--> PHYSICAL.
i386 goes VIRTUAL --segments--> LINEAR --page tables--> PHYSICAL.


The bottleneck is the 32-bit LINEAR address space.  A process can have
at most 2^32 bytes addressible at any one time without the operating system
rewriting the page tables.

Note first of all that, if you actively use more than one segment at a
time (such as for code, stack and data), this limits your maximum segment
size to less than 2^32 bytes each, since the TOTAL of the sumultaneously
accessible segments has to fit within 2^32 bytes.  So, for example,
if you had two segments of 4G, you could not have them both resident at
the same time, and so you could not get a MOV instruction from one to
the other to complete.  (And the MOV instruction itself would have
to go somewhere.)

Thus, you can not actually reach the 2^45-byte addressing limit that
"up to 2^13 segments of up to 2^32 bytes each" implies.


Secondly, even if you do "demand segmentation", bringing segments into
and out of the 32-bit LINEAR address space, this still requires that
the operating system rewrite the page tables (and invalidate the TLB entried)
in response to segment faults in order to access the relevant bits of 
PHYSICAL memory.

This is exactly the SAME operating system and hardware overhead as
using mmap or mremap to remap bits of a linear address space.  The only
difference would be if it were much easier for the user program to deal
with segments than to deal with explicit dynamic mmaps.  And it's not
at all clear that it is.


For these reasons, 32-bit x86 operating systems tend to ignore the
segmentation features and just use paging.  It just isn't worth the
complexity, and for multi-platform operating systems like Linux, it
isn't worth the portability hassles.  In fact, this has in turn led to
x86 designers de-emphasizing segment register loading speed, so "large
model" programs that use multiple segments take a significant speed hit.


Now, for why the Linux kernel takes 1 GB of virtual address space...

Every time a user-space program does a read() or write() call, or
makes any similar system call that moves a buffer of data, the kernel
has to copy between the user buffers and its own private file cache.

For this to be possible, the two source and destination buffers must
be in the same VIRTUAL address space.  And for it to be remotely efficient,
they have to be in the same LINEAR address space as well.

Now, it is possible to have a separate kernel address space, and demand-map
user-space buffers into it to do the copying.  That's what the 4G+4G patches
do.  But that means that on EVERY system call, you have to change the
page tables around, which results in flushing the TLB and a lot of
overhead.

The default Linux config arranges for the kernel's address space and the
user's address space to both be present at the same time.  Page table
entries have a permission bit that lets them be inaccessible to user
mode but accessed from kernel mode without having to reload the TLB.
This is very fast.  But it results in the classic "split" between 3G of
user address space and 1G of kernel address space.

It could be done different ways, but *any alternative would be much slower*
for typical programs that don't need more than 3G of address space.


The things that's causing a real problem is that common physical
memory sizes are approaching the 4G address space.  Thus, it's no
longer guaranteed that the 1G of kernel space is big enough to hold
all of physical memory, so kernel access to some parts of it has to be
"bank-switched" (the CONFIG_HIGHMEM options).  By careful design, this
has been kept reasonably fast, but there is overhead.

Because the kernel address space has to hold more than just RAM (in
particular, it also has to hold memory-mapped PCI devices like video
cards), if you have 1G of physical memory, the kernel will by default
only use 896M of it, leaving 128M of kernel address space for PCI devices.

A different user/kernel split can help there.  I use 2.75/1.25G on 1G RAM
machines, but if you use PAE or NX, the split has to be on a 1G boundary.


But these are all workarounds.  The real solution is to use a larger
virtual address space so that the original, efficient technique of mapping
both the user's virtual address space and the kernel's address space
(basically a copy of physical memory) will both fit.



Reply to: