The OS uses a simple two-stage bootloader in the MBR of the disk. It was not designed for compatibility; I haven't tested it on any other VM software, other than QEMU.
The bootloader does a few things, in this order:
loadKernelHere
, which is defined in mbr.S
All of this is defined in mbr.S
. mbr.S
also
defines the GDT and IDT and the various segments. These variables are
global and can be used elsewhere in the OS source.
The kernel_main
function in lib.rs
is the Rust
entry point of the kernel; it is the first Rust code to run. The function
initialized kernel subsystems in the following order:
Starting the init process accomplishes a few things. First, it turns on interrupts for the first time. Second, it turns on virtual memory for the first time. Third, it is the first real process to run in the system, allowing it to spawn other processes.
Processes have a
Unfortunately, because processes are allocated on the kernel heap, Rust
will not allow them to use closures instead of function pointers. It would
be nice if we could have each process contain a dynamically bound call
using the FnOnce
trait (essentially a thunk), but
FnOnce
's call
method moves the
FnOnce
out of the process, which is not allowed because
process is heap allocated and has a destructor. There is an open RFC
about the Box
constuct, which would allow the desired
behavior, but it is not working yet. If this RFC was implemented, we would
be able to create processes like so:
let resource = Arc::new(Semaphore::new(foo));
let r = resource.clone();
process::make_ready(Process::new("bar_proc", move |this| {
println!("{:?} says {}", this, r.down());
}));
Processes are round-robin scheduled, according to the order in which
process::make_ready
is called. A process can yield the rest of
its quantum by calling process::proc_yield
. os1 uses preemptive
multiprocessing based on the PIT. This is largely invisible from the point
of view of processes, though.
From the point of view of the kernel, memory is split up into 3 sections. The first megabyte of memory consists of the kernel code, the vga buffer, and various constants. The next 3MiB of memory contain the kernel heap. The following 4MiB of memory contain the physical frame data structures (more later). These first 8MiB of memory are direct mapped for convenience. The next 8MiB of the virtual address space is mapped to the page directory. Finally, the next MiB of the virtual address space is reserved for the kernel to map as needed. The rest of the virtual address space is available in user mode.
The kernel heap is used for kernel data structures only; when a process switches to user-mode, its heap and stack are mapped to physical memory frames via page tables. The kernel heap is a first-fit heap implementation with block coalescing and splitting. In my implementation, each free block has a 3-word header consisting of the size of the block and forward and backward pointers to other free blocks, along with some flags. Each block has a footer than consists only of the size of the block.
Because the Rust compiler is the only entity making calls to allocate or free memory, less sanity checking and metadata is necessary than normal. This allows some simplifying assumptions and gives the heap implementation some nice properties that are not often possible in C:
free
, passing in the size of the block in addition to a
pointer to the block. This means that we always know as much information
as we need to add a block to the free list and find the next block to
merge with it if it is free. The compiler guarantees that calls to free
will only be for legitimate used blocks.
4*WORD_SIZE
in
size.(4*WORD_SIZE)
-byte-aligned.This allows for easy sanity checking and alignment guarantees, while there is also no overhead for metadata.
The kernel only reserves the first 8MiB of physical memory for its own use.
Using the E820 memory map obtained at boot, the physical memory allocator
initializes its data structures at startup. Each physical frame in upto 4GiB
of memory has a 1-word descriptor (struct FrameInfo
). If a
frame is free, this descriptor is used to keep the index of the next free
frame, so that all free frames are described by the LIFO list formed by all
of the descriptors.
A frame is allocated by removing its FrameInfo
from the free
list, and conversely, a frame is freed by adding its FrameInfo
to the list. When a frame is allocated, its FrameInfo
is used
to hold a pointer to a data structure describing all pages in all processes
that are mapped to that frame.
The virtual address space of a process consists of all addresses for
0xD0_0000-0xFFFF_FFFF
. When a page fault occurs (in kernel or
user mode), if the faulting address is in this range, the fault is
considered an error, and the process is killed. This is because the first
8MiB are shared, direct-mapped memory and should never produce a page fault.
The next 4MiB are mapped to the processes page directory, so if there is a
fault in this range, it is fatal, since it means the paging structures have
been corrupt. The 13MiB is claimed by the kernel for use by the
kmap
function, which temporarily maps a page meant to reside in
another process's address space. This is needed for creating a new process.
Luckly, with Rust, segfaults should be extremely rare.
When a page fault occurs for a legal address, the fault handler checks the following cases:
Turning off interrupts is only an option in kernel mode. It is the option of last resort. In some cases, it is unavoidable, such as code for context switching. When I turn on SMP, a spin lock will also be required in places where turning off interrupts was used.
os1 has semaphore implementation based on Rust's Mutex
type.
The Semaphore
type
only has a clone
method, signifying that the process would like
to acquire the resource. This returns a RAII SemaphoreGuard
.
When the guard's lifetime ends, the resource is automatically released.
Like Rust's native Mutex
(which unfortunately cannot be
used because it lives in libstd
and relies on libc), the
Semaphore
takes ownership of the data it protects. Thus, Rust
will prevent illegal or unsafe accesses to the data at compile time.
Semaphores are available via a system call to user processes. I haven't decided how this will look yet in user mode. In kernel mode, it looks something like this:
// wrap resource in a semaphore with initial count 3
let s = Semaphore::new(resource, 3);
{
// acquire
let guard = s.down();
}
// guard dies => release
os1 also has two other semaphore-based concurrency primitives. The first is
Barrier
. When the barrier is constructed, it is given a parameter
N
. The first N-1
processes that reach the barrier
will block. When the N
th process reaches the barrier, all processes
will unblock from the barrier.
let b = Barrier::new(3); // A barrier for 3 processes
...
// In process A
b.reach();
// If this is the 3rd process to reach the barrier, the first two will unblock;
// otherwise, this process will block until the third reaches the barrier.
The second primitive is Event
. When a process waits on the event, it
will block until someone "notifies" the event. An unlimited number of
processes can wait on an event.
let e = Event::new(); // An event
...
// In process A
e.wait(); // Process A blocks
...
// In process B
e.notify(); // Wakes up Process A and any others waiting
In os1, IPC is accomplished via shared pages. A process can request to share a page with another process. It will block until the other process accepts the request. Symmetrically, a process can request another process to share a page with it, and it will block until the request is fullfilled.
// In process A
// This call will share with the process with PID = 3 the frame mapped to 0xC000_0000.
// The process will block until this call returns.
if !(*CURRENT_PROCESS).addr_space.request_share(/* pid */ 3, /* vaddr */ 0xC000_0000) {
// If the call returns false, then an error has occured (e.g. process 3 died)
}
...
// In process B
// This call will accept a shared page from the process with PID = 2. The accepted frame
// will be mapped to 0xD000_0000 in this address space. Note that this address does not
// have to be the same as the sharing process. This process process will block until the
// call returns
if !(*CURRENT_PROCESS).addr_space.accept_share(/* pid */ 2, /* vaddr */ 0xD000_0000) {
// If the call returns false, then an error has occured (e.g. process 2 died)
}
os1 has a primitive display driver for QEMU in VGA mode. It features some very basic abstractions to make drawing and printing strings easier. Including bounded boxes and word wrap.
There is a simple keyboard-input mechanism. It mostly supports ASCII alpha-numeric characters.
There is also very simple IDE support via PIO (the old fashion way). This is used mostly to support the file system.
Like IPC, this is one of my larger departures from the Unix world. OFS is a simple, graph-based file system. It does not have directories. Every object in OFS is a file. These files make up the nodes of a graph. They can be linked together with bidirectional edges.
This FS wasn't designed to be completely practical, but rather for novelty. I don't know of any such graph-based file system in the wild. There is a good reason for this: graphs are pointer-based data structures by nature. Pointers are the bane of file systems. Researchers make extravagent efforts to avoid random disk accesses because they are super slow on hard disk drives. That said, supposedly the random access penalty is not too bad on SSDs, so maybe this might be practical in the future. Anyhow...
One file (inode #0) is designated as the "root" file. This is a bit of a misnomer because the graph is not necessarily a tree; it can have cycles. However, it does serve as a fixed point in the file system from which all other files can be reached. A path is simply a slash-separated list of filenames from the root to the destination.
For now, the file system is read-only. I still need to think about how to handle consistency issues. However, for the purposes of documentation, I will describe my end goals here.
OFS has a pretty straight-forward API. I might augment it for functionality later, but the basics should be the same:
Operation | Functionality |
---|---|
stat | Returns a copy of the inode of the specified file. As described below, from an inode, you can tell the metadata and links of a file. |
new_file | Create a new file. |
delete_file | Delete the file by removing all of its links and free its inode and dnodes. |
open | Open the file, returning a file handle. |
link | Create a link between two files. |
unlink | Remove the link between two files. |
change_meta | Change the metadata of a file. |
seek | Seek to the given location in the file. |
read | Read from the file. |
write | Write to the file. |
For simplicity, os1 treats the whole of hdd as a single partition. The first sector contains file system metadata. For now, this consists of 4 magic bytes, the inode count, and the dnode count. The next sectors contain the inode and dnode bitmaps, in that order. The rest of the drive contains the inodes and dnodes themselves.
Each inode represents a single file and contains the file metadata. This includes the filename, permissions, dates modified and created, links, and the dnode-number of the first dnode. The list of links can conclude with the index of a dnode containing extra links.
Each dnode contains the contents of the file. The last word of the dnode is the index of the next dnode.
Deleting a file may disconnect the graph. If this happens, the portion disconnected from the "root" needs to be deleted. This is the job of the OFS Troll. The Troll is a process that does a breadth-first search from the root searching for disconnected components.
Since the Troll is already going through all the inodes, it seems advantageous to also have it function as a basic live fsck. I haven't really thought through this too much yet.
(You might also be wondering why I called it a "Troll". Well, operating systems already have "zombies" and "reapers", so why not trolls?)