Session 22: Industrial-strength memory management

Textbook: Sections 4.5.1, 4.5.2, and 4.6.3

Multiprocess paging
  working sets
  local v global allocation
x86 memory management
  segment tables
  paging

Multiprocess paging

Thus far, we've been talking about paging assuming a single process. But, as with swapping, paging efficiently in a multiprocessing OS requires some attention to multiple processes.

Working sets

A running process tends to have a working set of pages - that is, a small set of pages that it is currently using. This set tends to be small relative to the overall memory allocated to the process. For example, a Web browser might cache several Web pages in memory. But only one Web page would actually be in use at any point in time. The memory for this Web page would be in the working set, while the other cached pages would not be. The concept of working set is a little fuzzy - exactly how frequently does the page have to be accessed to be in the working set? It's a matter of your expectations of what's a reasonable frequency for a page fault.

If a process cannot have all of its working set in memory, the process will frequently incur page faults: there is always a page from its working set out of memory, and every time that page is accessed, another page from the working set has to be ejected. We would say that the process is thrashing.

In prepaging, the OS tracks the working set of every process in the system. Before the multiprocessing part of the OS switches a process into the CPU, it first ensures that all the pages of its working set are in memory. (Incidentally, when the OS loads the page, it would probably want to mark the page as being accessed recently too, so that it isn't immediately ejected.)

In contrast, demand paging simply waits until the page is accessed to load the page into memory. Since servicing a page fault takes some time, demand paging can be less efficient than prepaging. On the other hand, prepaging is less efficient than demand paging if the working set loads some pages that aren't accessed.

Local v global allocation

A related issue that arises with multiple processes is deciding how the OS chooses how many pages to allocate to each process. It's tempting, perhaps, to allocate each process an equal share of the pages. But that doesn't make much sense, since some processes are tiny relative to others. Processes whose working set is larger than the equal share will thrash, while others won't even use all of their allocation.

Probably the simplest easy technique is to treat all pages as equal, regardless of process. This is somewhat problematic, as a process that has been in the ready queue for quite a while could easily have all the oldest pages in memory - and as a result, they would all be ejected, even if that process is toward the front of the ready queue.

A more sophisticated approach would be to try to balance the pages allocated to each process so that each process has a similar page fault frequency.

Overall, it's a complicated and interesting issue that arises in industrial-strength operating systems.

Pentium memory management

Ok, so let's look at a real-world example: What the Pentium chip does. Previous versions of the Intel architecture have used different schemes, and of course the Pentium has to be backward-compatible with all of them. But we'll only look at the scheme used by the Pentium itself.

Segment tables

The Pentium has segment registers on its chip. The most important among these are cs (the code segment registers) and ds (the data segment register). (There are four others so that programs can work across segments, but they're not nearly as important.)

These segment registers are 16-bit registers, that look like the following.

bits 0-1: privilege level bit 2 : GDT/LDT bits 3-15: segment index

The segment index is actually an index into the segment descriptor table, which contains information about each possible segment.

The Intel chip supports two segment descriptor table: the LDT (local descriptor table) and the GDT (global descriptor table). The idea is that the GDT is shared among all processes, while the LDT will change for each process.

When the CPU loads an instruction, it uses the cs to determine which segment to load. The LDT/GDT bit says which descriptor table to use, and the segment index say which descriptor in that table to use.

A segment descriptor for the Intel chip contains a lot of information, including the following.

32 bits give the base of the memory address space, which is added to the eip to get actual memory address being accessed.
20 bits give the limit of the memory address. Normally, this will be in 4K increments. If the upper twenty bits of the eip exceed the limit given in the descriptor, a segmentation fault is raised, as the eip is beyond the end of the segment.
1 bit says whether the limit part of the descriptor is in 4K increments or in single bytes. This enables short segments to have a precise limit.
1 bit says whether the segment is currently in memory. This enables the chip to provide some form of segment swapping, similar to paging.
2 bits give the privilege level of the segment. The CPU maintains 4 privilege levels, always running in one of the four levels. As it runs, it can access any segment at the same or higher privilege level as the CPU's current level.

Paging

When paging is enabled, the Pentium can support a virtual address space of 4GB. It manages the large tables using the page directories we studied earlier, breaking each memory address into three pieces as follows.

__TABLE___ ___PAGE___ ___OFFSET___
 10 bits    10 bits     12 bits

So each page is 2¹²=4K large. Each page table entry and page directory entry has 2¹⁰=1024 entries. Since it happens that each page table entry and page directory entry is 4 bytes long, a directory and a table both occupy exactly one page.

Page table and page directory entries look like the following.

bits 12-31: upper 20 bits of page's actual memory address (the lower 12 bits this address are all 0).
bit 6: the referenced bit.
bit 5: the dirty bit.
bit 4: disables caching of the page.
bit 3: says that any writes to the page while in the cache must immediately be saved in memory (write-through caching).
bit 2: don't worry about this one.
bit 1: whether the page is writeable.
bit 0: whether the page is loaded.

You could have each program use a different page directory, so that each program gets its own virtual address space. But the CPU only handles one directory at a time, so you can't have a different address space for each segment. (How do you switch address spaces? I looked hard, but I couldn't find a precise answer.)

It's entirely possible (and common) to use both segments and registers. If so, the processor first decodes a memory reference into a virtual address using the segment descriptor, and then it decodes this virtual address into an actual address using the page directory and table.