|
|
Multiprocessing Support for Hobby OSes Explainedby Ben L. TitzerThe latest version of this tutorial may be found at http://www.redpants.org/docs/mpx86.php Reference Materials
IntroductionMany hobby operating system projects start out with very modest goals of being able to boot off of a floppy and load a kernel written in a high level language like C or C++. Some progress further, to the point that they can manage virtual memory and multiple processes, but very few of these operating systems ever get to the point that they support multi-processing with more than one CPU. The reason for this is a general lack of good information on how to accomplish the necessary steps of detecting and initializing other processors in the system. The design of a multi-processing operating system must be made very carefully and many situations must be taken into account to avoid race conditions that undermine the stability and correctness of a multi-processing OS. Basic locking primitives are needed that protect kernel data structures from concurrent access in situations that can result in corruption, which inevitably lead to instability in the OS kernel itself. This document touches briefly on locking mechanisms, but does not go deeply into the design decisions of a multi-processing operating system. It is meant for the hobby OS developer that understands virtual memory and multithreading and would like to take their OS project to the next level by beginning to add multiprocessing support. 1 - Multiprocessing in NutshellHow does multiprocessing work? The most basic simplification is that multiple processors can execute code simultaneously and independent of each other. Instead of one processor in a system, there are more than one, from as few as two up to thousands. These processors can either share the same system memory or have separate private memories that only they can access. There can also be configurations in which processors are "clustered" where there may be many physical memories with several processors each. Systems that share system memory between all processors so that all processors see the same physical memory are called UMA for Uniform Memory Access. They are more often called SMP systems for Symmetric Multiprocessing. Systems that have separate, private physical memories are called NUMA for Non-uniform Memory Access. SMP architectures are generally used where the number of processors accessing the same physical memory is at most a dozen or a few dozen. This is because of the law of diminishing returns: as each processor is added, it has to compete with the other processors in the system for memory bandwidth, and so the speed increase from adding more processors becomes much less than linear. NUMA architectures, where there is no central memory for the processors to contend over, offers much greater scalability, often into the thousands of processors. NUMA have the disadvantage of larger memory requirements (because the OS and applications are duplicated in many separate memories) and because coordinating the system's execution requires extra communication overhead. SMP and NUMA each have their specific uses. NUMA is used for systems on the scale of super-computers and on tasks that have a high degree of parallel data that is not interdependent. SMP is more useful in smaller systems that operate on interdependent data, such as a PC workstation or a server. This document only focuses on one uniform memory access architecture, that of the Intel Pentium family of processors, since the Intel platform is the most common among hobby OSes, and SMP multiprocessing machines with Intel architecture processors are relatively commonplace. 1.1 - Basics of an SMP SystemSMP systems share the same physical memory between all the processors in the system. There is one copy of the OS kernel that manages resources such as memory and devices. The OS kernel can schedule processes to run on different CPUs without the need to copy any of the process's state from one part of physical memory to the next. Since all CPUs see identical physical memory, they are all equally capable of running any particular process or interacting with the hardware devices. They are also equally capable of running the OS kernel code. 1.2 - Communication in an SMP System via Shared MemoryProcessors in the system can communicate to each other by one of two methods. The first is to communicate by reading and writing from the same addresses in physical memory to signal that some condition has been meant or that one processor should perform some task. An example of two processors communicating by reading and writing the same address in memory is as follows: processor 1:
volatile int *what_to_do = SHARED_ADDRESS; // point to some memory
*what_to_do = DO_NOTHING; // default to do nothing
// wait for other processor to set *what_to_do
while ( *what_to_do == DO_NOTHING ) ;
switch ( *what_do_do )
{
...
}
processor 2:
volatile int *what_to_do = SHARED_ADDRESS; // point to some memory *what_to_do = DO_SOMETHING_ELSE; // notify other processor In this example, processor 1 and processor 2 communicate by reading and writing from address SHARED_ADDRESS, which we assume is some constant, previously agreed upon address. The first processor sets this integer in memory to the constant DO_NOTHING and waits in a loop until that integer becomes any other value. The second processor simply writes a value into that shared memory address which causes the first to break out of the while loop and enter the switch statement. The second processor could tell the first to do one of several possible things based on what value it wrote to SHARED_ADDRESS. Cache Coherency and SMPWhat about processor caches? What if the shared memory is cached in one of one of the processors' caches? This would cause massive problems communicating via shared memory because the memory in question would have to be uncached to ensure that changes made to shared memory by one processor are seen by other processors interested in the same memory range. This problem is solved by a coherency protocol implemented in hardware that ensures that changes made by one processor are seen by other processors. The details of this scheme aren't particularly interesting in this document and since they make the processor caches appear transparent to software, they are not discussed further. 1.3 - Communicating Better with Interprocessor InterruptsThe about example is a rather clumsy and particularly inefficient way to communicate to other processors. First, the processor "listening" in the while loop isn't doing anything useful while it is waiting for the other processor to signal it. The other problem with this is that there may in fact be more than two processors in the system (remember that there can be dozens in some SMP machines). If more than one of these processors is listening and one processor tries to signal one to do something, then they all will wake up, not just one. We can reduce these problems by having the listening processor only check the flag periodically and between checks do something useful, but then the processor is less responsive. We could solve the problem of multiple listening processors with flags for each processor, but the latency and busy polling problems still remain. If you are an intermediate OS developer, chances you understand this problem and know the solution already: interrupts. In multiprocessor systems, communication can be made through interprocessor interrupts (IPIs) that allow one processor to send an interrupt to another specified processor or range of processors. The ability to interrupt another processor solves both the latency and polling problems. The processor can be doing useful work, but still stay responsive to interrupts from the other processors in the system. 2 - Intel Multiprocessing SpecificationNow that we have discussed the differences between polling and interrupts on SMP systems, it is time to consider the more practical questions of how they work and how to use them. For this purpose, to standardize how Intel processors work in a SMP setting, Intel developed a standard called the Intel Multiprocessing Specification, which sets standards for the interface between the BIOS/firmware level and system software (OS) level. It is strongly recommended that you download this manual, as it covers some specifics that are important. You can find it here. This manual was introduced with the 486 line of processors which supported multiprocessing. The 386 processors also supported multiprocessing, but saw almost no use as a multiprocessing platform because there were no standards. 2.1 - The APIC moduleThe centerpiece of the Intel Multiprocessing specification is the APIC device, which stands for Advanced Programmable Interrupt Controller. Even beginning OS developers have probably heard of the PIC (Programmable Interrupt Controller) which delivers IRQs to the processor. The APIC module is similar in function to the PIC, but it accepts and directs interrupts among multiple processors. In Intel multiprocessing systems, there is one local APIC module for each processor and at least one IO APIC that routes interrupt requests among multiple processors. The local APIC module is built into the processor die itself for Pentium family of processors, but is separate for 486 processors. This local module for 486s was a different model (the 82489DX) and had slightly fewer features than the later modules built into the Pentium line of processors. For that reason they are not discussed, and we focus on multiprocessing with the Pentium and higher line of processors. The local APIC module serves as the only input of interrupts to the processor. The external PIC and IO APICs send their interrupts to the local APIC of the destination processor and that local APIC interrupts the processor. The APIC can be programmed to mask these interrupts 0-255. However, the APIC cannot mask the exceptions 0-21 which are generated internal to the processor. Each local APIC module has a unique ID that is initialized by the BIOS, firmware, or hardware. The OS is guaranteed that the local APIC IDs are unqiue. Local APICs are also capable of sending IPIs (inter-processor interrupts) to other processors in the system using the local IDs of the destination. This is primarily how the OS communicates with other processors, by programming the current processor's (whichever processor the OS is running on) local APIC chip to send an IPI to a destination APIC ID. 2.2 - Bootup SequenceThe Specification not only defined the the APIC as the basic building block of multiple processor systems, but it also had to define some standards on booting the system so that multiple processor systems could remain backwards compatible. Some guarantee as to the state of the other processors in the system was needed so that an a uniprocessor OS could function correctly on one processor. The Multiprocessing specification defines a standard boot sequence that guarantees the OS that the system is in a state ready for multiprocessor detection and initialization. The specification states that in the standard boot sequence the BIOS, hardware or firmware (not the OS) will select one of the processors to be designated the BSP or Bootstrap Processor. The selection of which processor is the BSP can be either hardwired to physical location, generated randomly, or selected by some other means. The only restriction the specification enforces is that one and only one processor is selected as the BSP and the other processors, called AP's for Application Processors are initialized to Real Mode and put into a halted state. The APs' local APICs are initialized such that they will not service any interrupts. The system is initialized so that all interrupts are directed to the BSP. The BSP then boots normally exactly as if the system was a uniprocessor machine. 2.3 - Multiple Processor DetectionThe resulting initialization and loading of the OS in uniprocessor mode should be familiar to even beginning OS developers and is not the aim of this document. What is the aim of this document is the steps the operating system must now take to detect and initialize the APs, which are still in a halted state. In order for the OS to detect the presence of multiple processors, the specification requires that the BIOS or firmware construct two tables in physical memory that describes the configuration of the system, including information about processors, IO APIC modules, irq assignments, busses present in the system, and other useful data for the OS. The OS must find these structures and parse them in order to determine what initialization needs to be done. If the OS does not find these tables, then the OS can assume that the system is not multiprocessor capable and it can continue with uniprocessor initialization. This allows an OS compiled for SMP operation to fall back on default, uniprocessor behavior on a uniprocessor system. Finding the MP Floating Pointer StructureThe first structure the OS must search for is called the MP Floating Pointer Structure. This table contains some information pertaining to the multiprocessing configuration and indicates that the system is multiprocessing compliant. This structure has the following format:
The MP Floating Pointer Structure is in one of three memory areas: (1) The first kilobyte of the Extended BIOS Data Area (EBDA). (2) The last kilobyte of base memory (639-640k). (3) The BIOS ROM address space (0xF0000-0xFFFFF). The OS should search for the ASCII string "_MP_" in these three areas. If the OS finds this structure, this indicates the system is multiprocessing compliant and multiple processor initialization should continue. If this structure is not present in any of these three areas, then uniprocessor initialization should continue. Parsing the MP Configuration TableThe MP Floating Pointer Structure indicates whether the MP Configuration Table exists by the value in MP Features 1. If this byte is zero, then the value in MPConfig Pointer is a valid pointer to the physical address of the MP Configuration Table. If MP Features 1 is non-zero, this indicates that the system is one of the default configurations as described in the Intel Multiprocessing Specification Chapter 5. These default configurations are concisely described in that chapter of the specification and we will not discuss them fully here except to say that these default configurations have only two processors and the local APICs have IDs 0 and 1, among a few other nice properties. If one of these default implementations is specified in the MP Floating Pointer Structure, then the OS need not parse the MP Configuration Table, and can initialize the system based on the information in the specification. The MP Configuration Table contains information regarding the processors, APICs, and busses in the system. It has a header (called the base table) and a series of variable length entries immediately following it in increasing address. The base table has the following format:
The MP Configuration Table is immediately followed by Entry Count entries that describe the configuration of processors, busses and IO APICs in the system. The first byte of each entry denotes the entry type, e.g. a processor entry or a bus entry. The entries are sorted by entry type in ascending order. The table entry types are summarized as follows:
Since the entries of the MP Configuration Table are sorted by entry type in ascending order, the first entries will be all the processor entries, followed by all the bus entries, followed by the IO APIC entries, and so on. The OS should parse these entries in order to discover how many processors, how many IO APICs, and other information it will need to initialize the system. The processor entries have the format:
Bus entries identify the kinds of buses in the system. The BIOS is responsible for assigning them each a unique ID number. The entries allow the BIOS to communicate to the OS the buses in the system. The format of the entries is described in the Intel Multiprocessing Specification Chapter 5. Because using and initializing buses is beyond the scope of this document, bus entries are not discussed. The configuration table contains at least one IO APIC entry which provides to the OS the base address for communicating with the IO APIC and its ID. The entry for an IO APIC has the following format:
3 - Initializing and Using the local APICNow that we are able to detect the processors and IO APICs in a system, it is necessary to initialize and configure the bootstrap processor's local APIC so that it can begin to send interrupts to the other processors in the system. Interprocessor interrupts are the best way to communicate between processors in certain situations, and as we will see, they are used by the bootstrap processor to awaken the other processors in the system. 3.1 - Memory Mappings of APIC ModulesEach local APIC module is memory mapped into the address space of its corresponding processor. They are all mapped to their local processor's address space at the same address so that when a processor accesses this address range it is accessing its own local APIC. However, for an IO APIC, it is mapped into the address space of all processors at the same address so that all processors can address the same IO APIC through the same address range. Multiple IO APICs each have their own address range in which they are mapped, but are, again, mapped globally and accessable from all processors. The address ranges APICs are given as follows:
3.2 - The Local APIC's Register SetIn order for the OS to begin to communicate with the other processors present in the system, it must first initialize its own local APIC module. The local APIC module is the means by which the local processor can send interrupts to the other processors and is memory mapped into the address space of the processor at the addresses in the previous table. The APIC uses no IO ports and is configured by writing the appropriate settings into the APIC's registers at the correct memory offsets. The registers' offsets are summarized in the following table:
Note that the local APIC's registers are divided into 32 bit words that are aligned on 16 byte boundaries. Registers that are larger than 32 bits are split into multiple 32 bit words, aligned on successive 16 byte boundaries. The Intel Multiprocessing Specification states that all local APIC registers must be accessed with 32 bit reads and writes. 3.3 - Initializing the BSP's Local APICIn order for the OS to communicate with the other processors in the system, it first must enable and configure its local APIC. Software must first enable the local APIC by setting a bit in a register and programming other registers with vectors to handle bus and inter-processor interrupts. Spurious-Interrupt Vector RegisterThe Spurious-Interrupt Vector Register contains the bit to enable and disable the local APIC. It also has a field to specify the interrupt vector number to be delivered to the processor in the event of a spurious interrupt. This register is 32 bits and has the following format:
A spurious interrupt can happen when all pending interrupts are masked or there are no pending interrupts during an internal interrupt acknowledge cycle of the APIC. The APIC module delivers an interrupt vector to its local processor specified by the value in the VECTOR field of this register. The processor then transfers control to the interrupt handler in the IDT, at the vector number delivered to it by the APIC. Basically, the VECTOR field specifies which interrupt handler to transfer control to in the event of a spurious interrupt. Spurious interrupts happen because of certain interactions within the APIC's hardware itself, and do not reflect any meaningful information. Software can safely ignore these interrupts, and should program this vector to refer to an interrupt handler that ignores the interrupt. Local APIC Version and Local APIC ID RegistersThe Local APIC Version Register is a read-only register that the APIC reports its version information to software. It also specifies the maximum number of entries in the Local Vector Table (LVT). The Local APIC ID Register stores the ID of the local APIC.
Local Vector TableThe Local Vector Table allows software to program the interrupt vectors that are delivered to the processor in the event of errors, timer events, and LINT0 and LINT1 interrupt inputs. It also allows software to specify status and mode information to the APIC module for the local interrupts.
3.4 - Issuing Interrupt CommandsThe local APIC module has a 64 bit register called the Interrupt Command Register that software can use cause the APIC to issue interrupts to other processors. A write to the low 32 bits of the register causes the command specified in the write operation to be issued. The format of the Interrupt Command Register is as follows:
4 - Application Processor Startup5 - MP Detection and Initialization Recap
6 - Locks and IPIsThis article is mirrored here with permission from Ben L. Titzer © All Rights Reserved Bona Fide OS development 2001-2006. We Disclaim responsibility for all bad things, good things okay.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||