Virtual Memory

For virtual memory we want to be less restrictive and remove the assumptions we previously made. If our logical memory space of the process is more than physical memory or if the program is executed on a computer with less phyiscal memory, keeping those assumptions will cause problems. It can also prevent multiprocessing from happening.

So, let's try to run a process without it being in memory. The idea here is we have secondary storage which capacity is much greater than physical memory. By splitting the logical address space into small chunks, some reside in physical memory while others are stored on secondary storage. The most popular approach is an extension of the paging scheme.

Paging

vmPaging

The basic idea of paging remains unchanged where we use a page table to convert the virual address to physical address. However, there is now 2 page types - memory resident where the page is in physical memory and non-memory resident where the page is in secondary storage. We use a is memory resident? bit in the page table entry (sometimes called valid bit) to identify.

The CPU can only access memory resident pages. When the CPU tries to access a non-memory resident page, it leads to a page fault and the OS needs to bring a non-memory resident page into physical memory.

Lookup

When accessing a page x, we follow the general steps:

accessPage

Page Fault

illustration

Page 2Page 3

pageFault1

When OS asks for page 2, it will be a quick lookup since it is already in memory

pageFault2

When OS asks for apge 3, since it is not phyiscal memory, we ned to bring the page in, and update the page table. It then looks foro page 3 again, which now is in the physical memory and retrieves it.

So, the process looks something like this, illustrating how we can include more processes in memory since only parts of it need to be in memory.

vmAccess

Of course, this comes at a tradeoff of longer access time. If memory access results in a page fault most of the time, it is also known as thrashing. To prevent this, we can take advantage of locality. When we bring in a page, it is likely the next page we access is the same instruction or the next.

Locality

Most programs would spend most of their time on a relatively small part of code and in a time period, accesses are made to a relatively small part of data only.

Recall from cs2100 we have Temporal locality where memory address used is likely to be used again and Spatial locality where memory address close to a used address is likely to be used. In the context of pages, after a page is loaded to physical memory, it is likely to be accessed in near future and it contains contiguous locations that is likely to be accessed in near future.

So with virtual memory, we completely separate logical memory address from physical memory so the amouont of physical memory no longer restricts hte size of logical memory address. There is more efficient use of physical memory since page currently not needed can be on secondary stage. This allows more processes to reside in memory, improving CPU utilization as there are more processes to choose from.

Demand Paging

When a process is launched, we may consider allocating all text and data pages in RAM, but that can lead to large startup cost and we need to reduce the footprint of process in physical memory so that more processes can be memory resident.

To achieve this, the idea is that processes start with no memory resident page and we only load a page when there is a page fault. This leads to fast startup time for new procesess due to small memory footprint, but processes can appear sluggish at the start due to multiple page faults which could have a cascading effect on other processes (thrashing).

Comparison

Old schemeDemand paging scheme

demandPaging

In the old scheme, we would load all 5 pages of process p in memory (if there is space), update the page table and proceed

StartLoad

demandPaging3

In the demand paging scheme, we don't load them first. So the process starts and all the processes are in secondary storage.

demandPaging2

When the process needs page 3, we bring in page 3 and update page table, same with page 0.

As seen, we now only load the pages the process is actually using, leaving us more space in the physical memory.

Virtual Memory Management

This brings us to the basic concept of virtual memory in terms of what needs to be loaded. But if we take a step back, at a higher level there are 3 fundemental problems that arise from what we have done so far. How do we structure the page table for efficiency (especially with huge page tables with large logical memory space), which page should be replaced when needed (due to the limited number of resident memory pages) and how do we distribute the limited frames among processes.

To address these, we have page table structures, page replacement algorithms and frame allocation policies.

Page Table Structure

Page table information is kept with the process information and take up physical memory space. Modern computer systems provide huge logical memory space (4GiB(32-bit) is the norm but 8TiB is possible now). With huge logical memory space, we have a huge number of pages and with each page having a page table entry, there is a huge page table.

This leads to high overhead and fragmented page table, since page table occupies several memory pages.

Direct Paging

With direct paging, we keep all entries in a single table (1 large page table to hold information about logical processes). With \(2^p\) pages in logical memory space, we hace P bits to specify 1 unique page and \(2^p\) page table entries iwth a physical frame number + additional information bits.

Example

Given virtual address of 32 bits, page size 4KiB (\(2^{12}\)) -> 12 bits for offset
So P = 32 - 12 = 20, so \(2^{20}\) pages
Size of PTE = 4 bytes
Page Table Size = \(2^{20} * 4\) bytes = 4MiB

2-Level Paging

Remember that a process may not use the entire virtual memory space, so a full page table is a waste. We only need to hold parts of the process being used, so we split the full page table into regions, where only a few are used. As memory usage grows, new regions can be allocated (Similar to splitting logical memory space into pages). This is also known as the hierarchical paging system.

Thus, we also need a directory to keep track of the regions with a analogue of page table pointing to pages.

We start by splitting the page table into smaller page tables, each with a page table number. If the original table has \(2^p\) entries, with \(2^m\) smaller page tables, M bits is needed to uniquely identify 1 page table, where each smaller page table contains \(2^{p-m}\) entries. To keep track of smaller page tables, a single page directory is needed where the page directory contains \(2^m\) indices to locate each of the smaller page tables.

Comparison

Original2-level

Originally, our page system looked like this

no2lvl

We had the page number and offset, where we used page number to look up page table, get the frame and addded the offset to get physical address.

Now, we don't want everything in memory and only parts we want. So we break the page table into different parts.

2lvl

We have the page directory(aka top level page table) to keep track of which part is where, so it points to each part. This is usually in memory to speed things up. So here, we look up the page directory first before proceeding.

Advantage

Having 2-level paging gives a huge advantage in terms of space. Going back to the earlier example and assuming only 3 page tables are in use,

Example

Given virtual address of 32 bits, page size 4KiB (\(2^{12}\)) -> 12 bits for offset
So P = 32 - 12 = 20, so \(2^{20}\) pages
Size of PTE = 4 bytes
Page Table Size = \(2^{20} * 4\) bytes = 4MiB

in this paging scheme, we have the (overhead) size for the main directory and the 3 page tables that we need. Assume each page table are of size \(2^{11}\) (\(2^m\))entries Since we have 12 as offset and \(2^{20}\) pages, the directory will have \(2^{20 - 11}\) = \(2^9\) entries.

2lvlAdv

As seen, the overhead is much smaller than earlier. Note we can have empty entries in the page directory where the corresponding page tables need not be allocated.

Inverted Page Table

Here, we think of paging in a slightly different way. Keeping track of page tables is good, but fundementally we want to know what is in physical memory. So, we try to keep track of frames in physical memory rather than pages (which are virtual).

Page table is a per-process information (with M processes in memory, there are M independent page tables). However, only N phyiscal memory frames can be occupied. Every process needs to keep track of its pages but only some pages are in memory (out of M page tables only N entries valid) and physical memory is much smaller than virtual memory. It is a huge waste since N is much smaller than the overhead of M page tables.

The idea here is to keep a single mapping of physical frame to <pid, page#>. the page# is not unique among processes but together they can uniquely identify a memory page.

In a normal page table, the entries are ordered by page number and to lookup page X, we simply access the \(x^th\) entry. For inverted since it is ordered by frame number we search the whole table. So although we get a huge saving (1 table for all processes), there is slow tranlsation.

illustration

invTable

We have PID and page number in the table. After lookup, we get the frame number we get the physical address so we can access the frame.

Page Replacement Algorithms

When there is no free physical memory frame during a page fault, we need to evict(free) a memory page. To do this, we have to check if a page is clean (not modified, no need to write back) or dirty (modified, need to write back).

Some algorithms that help us find a suitable replacement page include Optimum (OPT), FIFO, Least Recently Used (LRU), Second-Chance (Clock) etc.

Modeling Memory References

In actual memory reference, the logical memory is (page number + offset). However, for page replacement algorithms we are only interested in the page number. Thus memory references are often modeled as memory reference strings (i.e. a sequence of page numbers), a sequence of page number being accessed. We also have frames where we fix the number of frames we have.

Evaluation

We have the following formula for Memory Access Time:

\[ T_{access} = (1 - p) * T_{mem} + p * T_{page_fault}\]

\(p\) = probability of a page fault
\(T_{mem}\) = access time for a memory resident page
\(T{page_fault}\) = access time if a page fault occurs

Since \(T_{page_fault} >> T_{mem}\), we need to reduce \(p\) to keep \(T_{access}\) reasonable.

A good algorithm should reduce the total number of page faults.

Optimal Page Replacement

The idea behind this is we replace the page that will not be used again for the longest period of time (theoretically like shortest job first). This guaranrees a minimum number of page faults. Unfortunately, this is not realizable since it needs future knowledge of memory references. However this is still useful as it acts as a base of comparison for other algorithms and the closer they are to OPT, the better.

opt

Following the algorithm, we get 6 page faults.

FIFO

The general idea is basically memory pages are evicted based on their loading time, where we evict the oldest page. To do this, the OS maintains a queue of resident page numbers and removes the first page in queue if a replacement is needed. During a page fault trap, we update the queue. This is simple to implement and no hardware support is needed.

algoFIFIO

Here, we get 9 page faults. Notice it now looks at the "Loaded at Time".

Something interesting here is an anomalie. Normaly, when we have an algorithm that has page faults like this, we may want to increase the number of frames to reduce page faults. But since FIFO has no understanding of temporal locality, when we increase the number of frames, the number of page faults increase as well. This is known as Belady's Algorithm.

To address this, we try other algorithms that give us better understanding on locality like with time of access etc.

Least Recently Used (LRU)

The idea for this is to make use of temporal locality - we replace the page that has not been used in the longest time. This follows the expectation that a page will be reused in a short time window while those not used for some time most likely will not be used again.

This attempts to approximate the OPT algorithm and generally gives good results and does not suffer from Belady's Anomaly.

algoLRU

With this, we get 7 page faults. Notice we now look at the "Last Used Time", which means we have to update it when we read the page again.

Implementation

However, this isn't easy to implement. Since we need to keep track of the "last access time", we need substantial hardware support. We have 2 approximations we can try

Use a Counter

Calculating absolute time may not be accurate, so we use a logical "time" coutner which is incremented for every memory reference. The page table entry has a time-of-use field which stores teh time counter value whenever a reference occurs and we replace the page with the smallest time-of-use.

However, this means we need to search through all pages and time-of-use is always increasing, risking the possibility of overflow. Also, if there are pages that have been in memory for a long time, it may have a boosted value compared to a new page entering.

Use a "Stack"

This isn't exactly a stack (we are just stacking up pages) but we maintain a stack of page numbers where if x is referenced, it is removed from stack (for existing entry) and pushed on top of the stack. Since the ones at the top are the ones accessed, we replace the page at the bottom of the stack, so we don't have to search through all the entries.

However, this is not a pure stack and entries can be removed from anywhere in the stack. This is also hard to implement in hardware.

CLOCK

This is a modified FIFO to give a second chance to pages that are accessed. Each page table entry now maintains a reference bit which is 1 if the page is accessed and 0 if not. The oldest FIFO page is selected and if reference bit = 0, it is replaced else it is given a second chance, where the bit is cleared to 0 and next FIFO page is selected.

pseudocode
page = circularQueue.getOldest();
while (true) {
    if (page.refBit == 0) {
        replace(page);
        break;
    }
    page.refBit = 0;
    page = circularQueue.getOldest();
}

clockPic

When we load, we move the pointer to the next location (if no loading, don't mobe pointer). Notice if all reference bits are 1, it basically becomes FIFO. When loading, its reference bit remains 0 but when it is read while in the queue, its reference bit changes to 1.

algoClock

Now, we managed to reduce it to 6 page faults!

Frame Allocation

Now, we look at how frames are allocated - which processes get how many frames? If there are N physical frames with M processes competing for them, what is the best way to distribute? The simplest way is with: - equal allocation - each process gets N / M frames - proportional allocation - processes are different in size (and thus memory usage) - let \(size_p\) = process size - let \(size_{total}\) = total size of all processes - each process gets \(size_p\) / \(size_{total}\) * N frames

Local and Global

For local replacement, the victim page are selected among pages of the process that causes the page fault. This is the implicit assumption for page replacement algorithms. For global, the victim page can be chosen from all phyiscal frames. Process P can take a frame from Q by evicting Q's frame during replacement.

Local

Here, frames allocated to a process remain constant making performance stable between multiple runs. However, if frames allocated is not enough, it can hinder the progress of a process.

Global

This allows self-adjustment between processes so those that needs more frames can get from other processes. But badly behaved processes can affect others and frames allocated to a process can be different between runs. This means there may not be consistency in terms of running, sometimes it could be faster/slower.

Thrashing

At the end of the day, we want to reduce thrashing (which occurs due to insufficient physical frames). It incurs heavy I/O to bring non-resident pages into RAM, making it slower.

It is hard to find the right number of frames too. If global replacement is used, a thrashing process can steak pages form other processes casuing other process to trash too, aka Cascading Thrashing. If local replacement is used, thrashing can be limited to 1 process but that single process can hog the I/O and degrade the performance of other processes.

To determine the right number of frames, we notice that the set of pages referenced by a process is relatively constant in a period of time (known as locality) but as time passes, the set of pages can change.

Working Set Model

Based on the observation of locality, in a new locality, a process will cause page fault for the set of pages. But with the set of pages in frame, there is no/few page faults until the process transits to a new locality.

This prings us to the working set model. Within an interval of time (Working Set Window △), there is a set of pages we are likely interested in. This gives us W(t, △) which is the active pages in interval at time t. To reduce the possibility of page fault, we allocate enough frames for pages in W.

illustration

Over time, our page access looks something like this. Over time, we want our working set to also look similar. Notice that sometimes there can be dramatic changes and sometimes it is more constant.

wsm

The changing region is known as Transient region where the working set changes in size while there is a Stable region where the working set is about the same for a long time. At the stable region, you will likely have less page faults, since those are the pages you need.

The accuracy of the model is directly affected by the choice of △. Too small and we may miss pages in the current locality while too large and it may contain pages from a different locality.

Example

wsmEg