Booting

The OS uses a simple two-stage bootloader in the MBR of the disk. It was not designed for compatibility; I haven't tested it on any other VM software, other than QEMU.

The bootloader does a few things, in this order:

  1. Disable interrupts
  2. Print an 'x' to the console (in QEMU, this is represented in the serial console, not the VGA display)
  3. Load the second-stage bootloader from the next segments of memory
  4. Transfer control to the second-stage bootloader, which carries out the rest of these steps:
  5. Find copy the kernel from the disk into memory starting loadKernelHere, which is defined in mbr.S
  6. Exectute the E820 BIOS call to get a map of available and reserved physical memory regions.
  7. Do a long jump to start the kernel.

All of this is defined in mbr.S. mbr.S also defines the GDT and IDT and the various segments. These variables are global and can be used elsewhere in the OS source.

Kernel startup

The kernel_main function in lib.rs is the Rust entry point of the kernel; it is the first Rust code to run. The function initialized kernel subsystems in the following order:

  1. Initialize the TSS
  2. Initialize the kernel heap manager
  3. Initialize physical memory management
  4. Create the init and idle processes
  5. Initialize the PIC
  6. Initialize the PIT
  7. Load the root file system from HDD
  8. Jump to the init process

Starting the init process accomplishes a few things. First, it turns on interrupts for the first time. Second, it turns on virtual memory for the first time. Third, it is the first real process to run in the system, allowing it to spawn other processes.

Processes and Scheduling

Processes have a

Unfortunately, because processes are allocated on the kernel heap, Rust will not allow them to use closures instead of function pointers. It would be nice if we could have each process contain a dynamically bound call using the FnOnce trait (essentially a thunk), but FnOnce's call method moves the FnOnce out of the process, which is not allowed because process is heap allocated and has a destructor. There is an open RFC about the Box constuct, which would allow the desired behavior, but it is not working yet. If this RFC was implemented, we would be able to create processes like so:


    let resource = Arc::new(Semaphore::new(foo));
    let r = resource.clone();
    process::make_ready(Process::new("bar_proc", move |this| {
        println!("{:?} says {}", this, r.down());
    }));
    

Processes are round-robin scheduled, according to the order in which process::make_ready is called. A process can yield the rest of its quantum by calling process::proc_yield. os1 uses preemptive multiprocessing based on the PIT. This is largely invisible from the point of view of processes, though.

Memory

From the point of view of the kernel, memory is split up into 3 sections. The first megabyte of memory consists of the kernel code, the vga buffer, and various constants. The next 3MiB of memory contain the kernel heap. The following 4MiB of memory contain the physical frame data structures (more later). These first 8MiB of memory are direct mapped for convenience. The next 8MiB of the virtual address space is mapped to the page directory. Finally, the next MiB of the virtual address space is reserved for the kernel to map as needed. The rest of the virtual address space is available in user mode.

Heap

The kernel heap is used for kernel data structures only; when a process switches to user-mode, its heap and stack are mapped to physical memory frames via page tables. The kernel heap is a first-fit heap implementation with block coalescing and splitting. In my implementation, each free block has a 3-word header consisting of the size of the block and forward and backward pointers to other free blocks, along with some flags. Each block has a footer than consists only of the size of the block.

Because the Rust compiler is the only entity making calls to allocate or free memory, less sanity checking and metadata is necessary than normal. This allows some simplifying assumptions and gives the heap implementation some nice properties that are not often possible in C:

This allows for easy sanity checking and alignment guarantees, while there is also no overhead for metadata.

Physical memory

The kernel only reserves the first 8MiB of physical memory for its own use. Using the E820 memory map obtained at boot, the physical memory allocator initializes its data structures at startup. Each physical frame in upto 4GiB of memory has a 1-word descriptor (struct FrameInfo). If a frame is free, this descriptor is used to keep the index of the next free frame, so that all free frames are described by the LIFO list formed by all of the descriptors.

A frame is allocated by removing its FrameInfo from the free list, and conversely, a frame is freed by adding its FrameInfo to the list. When a frame is allocated, its FrameInfo is used to hold a pointer to a data structure describing all pages in all processes that are mapped to that frame.

Virtual Memory

The virtual address space of a process consists of all addresses for 0xD0_0000-0xFFFF_FFFF. When a page fault occurs (in kernel or user mode), if the faulting address is in this range, the fault is considered an error, and the process is killed. This is because the first 8MiB are shared, direct-mapped memory and should never produce a page fault. The next 4MiB are mapped to the processes page directory, so if there is a fault in this range, it is fatal, since it means the paging structures have been corrupt. The 13MiB is claimed by the kernel for use by the kmap function, which temporarily maps a page meant to reside in another process's address space. This is needed for creating a new process. Luckly, with Rust, segfaults should be extremely rare.

When a page fault occurs for a legal address, the fault handler checks the following cases:

  1. Is the page paged out to disk? If so, swap it back in
  2. Is the page marked present but read-only? If so, this page is COW, so clone it and mark the new page read/write
  3. Else, allocate a new frame a map the page to it

Concurrency Control

os1 uses preemptive multiprocessing via interrupts. os1 has two options for concurrency control (so far). The first is to simply turn off interrupts. The second is to use a semaphore.

Interrupts

Turning off interrupts is only an option in kernel mode. It is the option of last resort. In some cases, it is unavoidable, such as code for context switching. When I turn on SMP, a spin lock will also be required in places where turning off interrupts was used.

Semaphores

os1 has semaphore implementation based on Rust's Mutex type. The Semaphore type only has a clone method, signifying that the process would like to acquire the resource. This returns a RAII SemaphoreGuard. When the guard's lifetime ends, the resource is automatically released.

Like Rust's native Mutex (which unfortunately cannot be used because it lives in libstd and relies on libc), the Semaphore takes ownership of the data it protects. Thus, Rust will prevent illegal or unsafe accesses to the data at compile time.

Semaphores are available via a system call to user processes. I haven't decided how this will look yet in user mode. In kernel mode, it looks something like this:


    // wrap resource in a semaphore with initial count 3
    let s = Semaphore::new(resource, 3);

    {
        // acquire
        let guard = s.down();
    }
    // guard dies => release
    

Barriers and Events

os1 also has two other semaphore-based concurrency primitives. The first is Barrier. When the barrier is constructed, it is given a parameter N. The first N-1 processes that reach the barrier will block. When the Nth process reaches the barrier, all processes will unblock from the barrier.


    let b = Barrier::new(3); // A barrier for 3 processes

    ...

    // In process A

    b.reach();
    // If this is the 3rd process to reach the barrier, the first two will unblock;
    // otherwise, this process will block until the third reaches the barrier.
    

The second primitive is Event. When a process waits on the event, it will block until someone "notifies" the event. An unlimited number of processes can wait on an event.


    let e = Event::new(); // An event

    ...

    // In process A

    e.wait(); // Process A blocks

    ...

    // In process B
    e.notify(); // Wakes up Process A and any others waiting
    

Inter-process Communication (IPC)

In os1, IPC is accomplished via shared pages. A process can request to share a page with another process. It will block until the other process accepts the request. Symmetrically, a process can request another process to share a page with it, and it will block until the request is fullfilled.


    // In process A

    // This call will share with the process with PID = 3 the frame mapped to 0xC000_0000.
    // The process will block until this call returns.
    if !(*CURRENT_PROCESS).addr_space.request_share(/* pid */ 3, /* vaddr */ 0xC000_0000) {
        // If the call returns false, then an error has occured (e.g. process 3 died)
    }

    ...

    // In process B

    // This call will accept a shared page from the process with PID = 2. The accepted frame
    // will be mapped to 0xD000_0000 in this address space. Note that this address does not
    // have to be the same as the sharing process. This process process will block until the
    // call returns
    if !(*CURRENT_PROCESS).addr_space.accept_share(/* pid */ 2, /* vaddr */ 0xD000_0000) {
        // If the call returns false, then an error has occured (e.g. process 2 died)
    }
    

Display driver

os1 has a primitive display driver for QEMU in VGA mode. It features some very basic abstractions to make drawing and printing strings easier. Including bounded boxes and word wrap.

I/O

Keyboard

There is a simple keyboard-input mechanism. It mostly supports ASCII alpha-numeric characters.

Block Devices

There is also very simple IDE support via PIO (the old fashion way). This is used mostly to support the file system.

os1 File System (OFS)

Like IPC, this is one of my larger departures from the Unix world. OFS is a simple, graph-based file system. It does not have directories. Every object in OFS is a file. These files make up the nodes of a graph. They can be linked together with bidirectional edges.

This FS wasn't designed to be completely practical, but rather for novelty. I don't know of any such graph-based file system in the wild. There is a good reason for this: graphs are pointer-based data structures by nature. Pointers are the bane of file systems. Researchers make extravagent efforts to avoid random disk accesses because they are super slow on hard disk drives. That said, supposedly the random access penalty is not too bad on SSDs, so maybe this might be practical in the future. Anyhow...

One file (inode #0) is designated as the "root" file. This is a bit of a misnomer because the graph is not necessarily a tree; it can have cycles. However, it does serve as a fixed point in the file system from which all other files can be reached. A path is simply a slash-separated list of filenames from the root to the destination.

For now, the file system is read-only. I still need to think about how to handle consistency issues. However, for the purposes of documentation, I will describe my end goals here.

API

OFS has a pretty straight-forward API. I might augment it for functionality later, but the basics should be the same:

Operation Functionality
stat Returns a copy of the inode of the specified file. As described below, from an inode, you can tell the metadata and links of a file.
new_file Create a new file.
delete_file Delete the file by removing all of its links and free its inode and dnodes.
open Open the file, returning a file handle.
link Create a link between two files.
unlink Remove the link between two files.
change_meta Change the metadata of a file.
seek Seek to the given location in the file.
read Read from the file.
write Write to the file.

Disk-level Representation

For simplicity, os1 treats the whole of hdd as a single partition. The first sector contains file system metadata. For now, this consists of 4 magic bytes, the inode count, and the dnode count. The next sectors contain the inode and dnode bitmaps, in that order. The rest of the drive contains the inodes and dnodes themselves.

Each inode represents a single file and contains the file metadata. This includes the filename, permissions, dates modified and created, links, and the dnode-number of the first dnode. The list of links can conclude with the index of a dnode containing extra links.

Each dnode contains the contents of the file. The last word of the dnode is the index of the next dnode.

OFS Troll

Deleting a file may disconnect the graph. If this happens, the portion disconnected from the "root" needs to be deleted. This is the job of the OFS Troll. The Troll is a process that does a breadth-first search from the root searching for disconnected components.

Since the Troll is already going through all the inodes, it seems advantageous to also have it function as a basic live fsck. I haven't really thought through this too much yet.

(You might also be wondering why I called it a "Troll". Well, operating systems already have "zombies" and "reapers", so why not trolls?)