Two ways into ring 0: system calls and kernel modules

A running Linux system is split in two. User-land code runs in ring 3, unprivileged and sandboxed away from the hardware. The kernel runs in ring 0, with full control over memory, devices, and every process on the machine. The boundary between them is deliberate and well guarded: ring 3 code cannot simply jump into ring 0 and start touching kernel memory.

So how does a process ever get privileged work done? It asks. Every time you open a file, send a packet, or fork a process, your code crosses that boundary in a controlled way, runs some kernel code on your behalf, and comes back with a result. The interesting question for anyone who works close to the kernel is the inverse one: how do you extend the kernel so it offers new privileged entry points of your own?

There are two answers, and they sit at opposite ends of the same axis.

A system call is a static entry point. You write a function in ring 0, bake it into the kernel image, recompile the whole kernel, and boot into it. From then on the call is part of the operating system, addressable by a fixed number.
A kernel module is a dynamic entry point. You write ring 0 code, compile it on its own, and load it into a running kernel at runtime. No recompile, no reboot, and you can unload it just as easily.

In this post we build both, with real code, and we watch them converge. By the end, our module will hand itself its own user/kernel boundary in /dev and answer reads and writes from user-land, which is exactly what a system call does natively. Two mechanisms, one boundary crossing.

Path 1: the system call

A system call (like open, write, or read) is nothing more than a function, or a series of functions, running in ring 0. We call it a system call specifically because it can be invoked from ring 3. So our plan is straightforward: write a function in the kernel, register it in the syscall table, recompile, boot, and call it from user-land.

Our example will take a process ID and return a structure full of information about that process: its name, state, stack pointer, birth time, children, parent, root, and working directory. The body of the function is not the point. The point is the wiring that turns an ordinary C function into a system call.

Preparing the environment

To follow along you need a recent Linux kernel, the usual build tools (gcc, make), and a text editor. Because we are going to recompile the kernel and boot into it, do this on a virtual machine.

Recompiling and replacing your kernel can leave a machine unbootable. Do not do this on a system you care about. Work inside a VM, and take a snapshot before you touch the bootloader so a broken boot costs you a rollback rather than a reinstall.

For this article we used an Ubuntu Server 20.04 VM with bridged networking, 4 GB of RAM, and SSH access. Any distribution and hypervisor will do, with minor differences from what is shown here.

Linux distributions ship kernel headers and object files but not the kernel source. Install it, unpack it, and step into the source tree. The version below is 5.4.0, which you can confirm with uname -r.

sudo apt install linux-source
cd /usr/src/linux-source-5.4.0
sudo bunzip2 linux-source-5.4.0.tar.bz2
sudo tar xf linux-source-5.4.0.tar
cd linux-source-5.4.0/

We only need a configuration to build against. Generate a minimal default one. A minimal config keeps both the build time and the resulting image small, which is all we want for this exercise.

sudo make defconfig

You will also need flex, bison, libelf-dev, and libssl-dev to compile the kernel later. Install them the usual way through APT.

If you ever build on your native system instead of a VM, clone a fresh tree (git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git) and check out the version matching your distribution rather than editing the kernel you are currently booted on. If your disk is encrypted with cryptsetup, enable DM_CRYPT (make menuconfig, “Crypt target support”) or you will not be able to unlock it after booting the new kernel.

The structure to return

We start with the header, infopid.h. It declares the structure we will fill in the kernel and copy back to user-land. The comments document each field.

#ifndef INFOPID_H
#define INFOPID_H

#include <linux/sched.h>
#include <linux/limits.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/fs_struct.h>
#include <linux/slab.h>

/*
 * pid: pid of process
 * name: name of the process
 * state: unrunnable, runnable, stopped
 * stack: pointer to the beginning of process's stack
 * age: birth time in nanoseconds
 * child: array of all child processes pid
 * ppid: parent process id
 * root: root path of process
 * pwd: working directory of process
 */

struct info_pid {
    pid_t pid;
    char name[TASK_COMM_LEN];
    long state;
    void *stack;
    uint64_t age;
    pid_t child[256];
    pid_t ppid;
    char root[PATH_MAX];
    char pwd[PATH_MAX];
};

#endif

Writing the call

The function itself lives in infopid.c. Two details matter more than the rest. First, we define it with the SYSCALL_DEFINE2 macro, where the 2 is the number of parameters. The kernel provides one of these macros per arity, and they take arguments as alternating type/name pairs separated by commas, which is why the signature looks unusual. Second, we never write to the user pointer directly: we build the structure in kernel memory and hand it across the boundary with copy_to_user, which is the only safe way for ring 0 to write into a ring 3 buffer.

#include "infopid.h"
#include <linux/syscalls.h>

SYSCALL_DEFINE2(infopid, struct info_pid *, ret_pid, int, pid) {
    struct task_struct *cur, *child;
    struct info_pid *new;
    struct path root, pwd;
    struct pid *spid;
    char *tmp, buffer[PATH_MAX] = {0};
    int i = 0;

    if (!(spid = find_get_pid(pid)))
    {
        return -ESRCH;
    }

    cur = pid_task(spid, PIDTYPE_PID);

    if (!cur) {
        return -ESRCH;
    }

    new = kmalloc(sizeof(struct info_pid), GFP_KERNEL);

    if (!new)
        return -ENOMEM;

    memset(new->child, 0, 256 * sizeof(pid_t));
    get_fs_root(cur->fs, &root);
    get_fs_pwd(cur->fs, &pwd);    
    get_task_comm(new->name, cur);
    new->pid = task_pid_nr(cur);
    new->state = cur->state;
    new->stack = cur->stack;
    new->age = cur->start_time;

    list_for_each_entry(child, &cur->children, sibling) {
        if (i > 255)
            goto out;
        new->child[i++] = child->pid;
    }

out:
    new->ppid = task_pid_nr(cur->parent);
    spin_lock(&root.dentry->d_lock);
    tmp = dentry_path_raw(root.dentry, buffer, PATH_MAX);
    strcpy(new->root, tmp);
    spin_unlock(&root.dentry->d_lock);

    spin_lock(&pwd.dentry->d_lock);
    tmp = dentry_path_raw(pwd.dentry, buffer, PATH_MAX);
    strcpy(new->pwd, tmp);
    spin_unlock(&pwd.dentry->d_lock);

    if (copy_to_user(ret_pid, new, sizeof(struct info_pid))) {
        kfree(new);
        return -ESRCH;
    }

    kfree(new);

    return 0;
}

We look the process up by PID, allocate our structure with kmalloc, fill it from the task’s task_struct, walk the children list, resolve the root and working-directory paths under the appropriate locks, and copy the whole thing back to the caller. On any failure we return a negative errno, the convention every system call follows.

Registering it with the kernel

A function in a .c file is invisible to the kernel until three files in the source tree know about it. This is the static wiring that a module never needs.

First, tell the top-level kernel Makefile to compile our directory by appending it to core-y:

core-y += kernel/ certs/ mm/ fs/ ipc/ security/ crypto/ block/ infopid/

Then declare the prototype in include/linux/syscalls.h, alongside every other syscall prototype. Use your own absolute path to the header, not the one below:

/* Other declarations */
#include "/usr/src/linux-source-5.4.0/linux-source-5.4.0/infopid/infopid.h"
asmlinkage long sys_infopid(struct info_pid *, int);

Finally, give the call a number by adding it to the architecture’s syscall table, arch/x86/entry/syscalls/syscall_64.tbl. Place it last, with the next free index, following the existing nomenclature:

# Index  Arch  Name     Entrypoint
  335    64    infopid  __x64_sys_infopid

That number, 335, is the contract. It is how user-land will name the call, and it is frozen the moment we ship the kernel. Hold on to that thought, because it is the sharpest difference between this path and the next one.

Our directory holds three files at this point:

$ tree infopid/
infopid/
├── Makefile
├── infopid.c
└── infopid.h

0 directories, 3 files

Compiling and booting

With the call written and registered, build the entire kernel from the source root. Use -j <cores> to parallelise; this takes a while.

sudo make
sudo make modules_install
sudo make install

After installation, make sure you will actually boot the new kernel. In our case the new build was version 5.4.174, which GRUB ranks above the stock 5.4.0-105 and boots automatically. If yours does not, expose the GRUB menu by editing /etc/default/grub so you can pick the entry, then run sudo update-grub.

#GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=4

With Secure Boot enabled in your firmware, a self-compiled kernel will not boot until it is signed. Signing is out of scope here; disable Secure Boot in the VM or sign the image yourself.

Reboot, and confirm you are on the new kernel:

$ uname -r
5.4.174

Calling it from user-land

There is no libc wrapper for a call we just invented, so we reach it through the generic syscall() function, passing our number 335 and the arguments. The program below fills our structure for a given PID (its own by default) and prints everything, walking up the parent chain recursively.

#include <stdio.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <inttypes.h>
#include <stdlib.h>
#include <string.h>

#define TASK_COMM_LEN 16
#define PATH_MAX 4096

struct info_pid {
    pid_t pid;
    char name[TASK_COMM_LEN];
    long state;
    void *stack;
    uint64_t age;
    pid_t child[256];
    pid_t ppid;
    char root[PATH_MAX];
    char pwd[PATH_MAX];
};

void print_parents(pid_t pid)
{
    struct info_pid new;
    static int index = 0;
    printf("\tParent %d : %d\n", index++, pid);

    if (!pid)
        return ;
    int ret = syscall(335, &new, pid);
    if (ret) {
        printf("syscall failed...\n");
        perror("");
        exit(EXIT_FAILURE);
    }
    print_parents(new.ppid);
}

int main(int ac, char **av)
{
    pid_t pid;
    struct info_pid new;
    memset(&new, 0, sizeof(new));
    new.age = 0;

    if (ac == 1)
        pid = getpid();
    else
        pid = atoi(av[1]);

    int ret = syscall(335, &new, pid);
    if (ret) {
        printf("syscall failed...\n");
        perror("");
        return EXIT_FAILURE;
    }

    printf("Printing struct info_pid...\n");

    printf("PID       : %d\n", new.pid);
    printf("Name      : %s\n", new.name);
    printf("State     : %ld\n", new.state);
    printf("Stack     : %p\n", new.stack);
    printf("Birthtime : %ld\n", new.age);

    for (int j = 0; j < 255; j++)
    {
        if (!new.child[j])
            break ;
        printf("\tChild %d  : %d\n", j, new.child[j]);
    }

    print_parents(new.ppid);

    printf("Root      : %s\n", new.root);
    printf("PWD       : %s\n", new.pwd);

    return EXIT_SUCCESS;
}

Compile and run it, once on itself and once on PID 1:

$ gcc test_infopid.c -o test_infopid
$ ./test_infopid # With its own PID by default

Printing struct info_pid...
PID       : 1354
Name      : test_infopid
State     : 0
Stack     : 0xffffb76ac0750000
Birthtime : 1203877852032
    Parent 0 : 776
    Parent 1 : 775
    Parent 2 : 656
    Parent 3 : 551
    Parent 4 : 1
    Parent 5 : 0
Root      : /
PWD       : /home/ech0

$ ./test_infopid 1 # With PID 1

Printing struct info_pid...
PID       : 1
Name      : systemd
State     : 1
Stack     : 0xffffb76ac0010000
Birthtime : 15000000
    Child 0  : 290
    Child 1  : 317
    Child 2  : 500
    Child 3  : 509
    Child 4  : 511
    ...
    Child 18  : 659
    Parent 0 : 0
Root      : /
PWD       : /

The call works. We extended the kernel with a new privileged entry point and reached it from ring 3 by number. The price was steep, though: a full kernel rebuild, a reboot, and a call number that is now fixed forever. That cost is the whole reason the second path exists.

Path 2: the kernel module

A module (a driver, in Windows terms) is a piece of ring 0 code that can be loaded into and unloaded from a running kernel on demand. Your system is already full of them. List them with lsmod:

$ lsmod
Module                  Size  Used by
rfcomm                 81920  4
cdc_mbim               20480  0
cdc_wdm                24576  1 cdc_mbim
cdc_ncm                45056  1 cdc_mbim
cdc_ether              20480  1 cdc_ncm
...

Touchpad, camera, microphone, KVM: all modules. Some of them also expose communication interfaces, often character devices under /dev, the way the KVM module exposes /dev/kvm. We are going to do the same: build a module, load it, and eventually talk to it through /dev.

Everything about this path contrasts with the previous one. No kernel recompile, no new boot, no editing of the source tree. Your default environment is already ready: if you can compile C, you can build a module.

The system is (almost) never at risk here. The one real danger is a fault in ring 0 the kernel cannot recover from, which crashes the machine. If that happens, reboot and the module is gone. That alone is a reason to prototype modules in a VM too.

A minimal module

A single file is enough to start:

$ tree my_module/
my_module/
└── my_module.c

0 directories, 1 file

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Sigreturn Labs");
MODULE_DESCRIPTION("Hello World module");

static int __init hello_init(void) {
    printk(KERN_INFO "Hello World !\n");
    return 0;
}

static void __exit hello_cleanup(void) {
    printk(KERN_INFO "Cleaning up module.\n");
}

module_init(hello_init);
module_exit(hello_cleanup);

Reading top to bottom: we include the kernel headers we need; the MODULE_* macros attach metadata (license, author, description); two functions tagged __init and __exit run when the module is loaded and unloaded; and module_init / module_exit register them with the kernel. For now both functions just print to the kernel log with printk.

Notice there is no syscall table, no core-y, no prototype to declare. The module announces its own entry points to the kernel through module_init and module_exit, at load time, rather than being wired into the kernel image ahead of time.

Compiling and loading

Modules build against the running kernel’s headers, so the Makefile delegates to the kernel build system rather than calling gcc directly:

FILE := "my_module"
obj-m += my_module.o

all:
    echo "Compiling $(FILE)..."
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

clean:  
    echo "Cleaning modules..."
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
    rm -rf .$(FILE).ko.cmd .$(FILE).mod.o.cmd .$(FILE).o.cmd .cache.mk .tmp_versions $(FILE).ko $(FILE).o $(FILE).mod.c $(FILE).mod.o modules.order Module.symvers 2>&-

Build it:

make

The compilation drops several files in the directory. The one that matters is my_module.ko, the kernel object we load with insmod:

sudo insmod my_module.ko

As with a self-compiled kernel, an unsigned module will not load under Secure Boot. Module signing is out of scope here.

dmesg shows the string our init function printed:

[309528.612844] Hello World !

Unload it with rmmod and watch the cleanup function fire:

$ sudo rmmod my_module
[309551.667600] Cleaning up module.

That is the entire load/unload lifecycle, and it took no reboot and no kernel rebuild. We could stop here. But a module that only writes to the log is not talking to user-land yet, and that is where this path catches up with the first one.

Closing the loop: a misc device

Our system call could be reached from user-land because the kernel exposed it through the syscall table. A module gets no such entry in the table. So we give it its own front door: a misc device, a simple character device that appears under /dev and routes reads and writes to functions we define. In other words, the module is about to build the same kind of user/kernel boundary that a syscall enjoys natively, only this time we build it by hand.

To keep the mechanism in focus, the behaviour stays trivial: a write stores exactly ten bytes in a static buffer, and a read returns them. It looks pointless, and it is exactly enough to expose every subtlety that matters.

Registering the device

We declare the device structure as a static global:

static struct miscdevice my_dev;

In hello_init we fill a few fields and register it. A dynamic minor number lets the kernel assign one for us:

my_dev.minor = MISC_DYNAMIC_MINOR; // a dynamic minor number is requested
my_dev.name = "my_module_misc"; // name of the misc device
my_dev.fops = &my_fops; // operations structure
ret = misc_register(&my_dev); // registering the misc device

The fops field points at a file_operations structure, which is the heart of the interface: it tells the kernel which function to call for each operation on the device. We wire up read and write:

struct file_operations my_fops = {
    .read = hello_read,
    .write = hello_write
};

And we deregister the device when the module unloads, in hello_cleanup:

misc_deregister(&my_dev);

This needs three more headers (miscdevice.h, uaccess.h, fs.h). The skeleton now looks like this, still missing the two operation functions:

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/miscdevice.h>
#include <linux/uaccess.h>
#include <linux/fs.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Sigreturn Labs");
MODULE_DESCRIPTION("Hello World module");

static struct miscdevice my_dev;

struct file_operations my_fops = {
    .read = hello_read,
    .write = hello_write
};

static int __init hello_init(void) {
    int ret;

    printk(KERN_INFO "Hello World !\n");

    my_dev.minor = MISC_DYNAMIC_MINOR;
    my_dev.name = "my_module_misc";
    my_dev.fops = &my_fops;
    ret = misc_register(&my_dev);

    return ret;
}

static void __exit hello_cleanup(void) {
    printk(KERN_INFO "Cleaning up module.\n");
    misc_deregister(&my_dev);
}

module_init(hello_init);
module_exit(hello_cleanup);

The write operation

static ssize_t hello_write(struct file *f, const char __user *s, size_t n, loff_t *o)
{
    int retval = -EINVAL;

    if (!f || !s)
        return -EFAULT;
    if (n != LEN)
        return -EINVAL;

    retval = copy_from_user(buf, s, LEN);

    if (retval)
        return -EFAULT;

    printk(KERN_INFO "I have successfully written %s in buffer.", buf);

    return LEN;
}

The prototype is fixed by the kernel: f is the file descriptor for the device, s is the user-land pointer to the data the caller wrote, n is how many bytes they offered, and o is the current offset into the file. We reject null pointers, insist on exactly LEN bytes, and then pull the data across the boundary with copy_from_user, the mirror image of the copy_to_user we used in the syscall. It copies LEN bytes from the user pointer s into our kernel buffer buf and returns zero on success.

Return the right byte count. Returning 0 from a write or read tells the caller nothing happened, and many programs will simply call again, spinning the function forever. It is easy to lock up the kernel this way (unless you are quick with an rmmod).

The read operation

static ssize_t hello_read(struct file *f, char __user *s, size_t n, loff_t *o)
{
    if (!f || !s || !o)
        return -EFAULT;
    if (*o >= LEN)
        return 0;
    if (n > LEN)
        n = LEN;
    if (copy_to_user(s, &buf[*o], n))
        return -EFAULT;

    *o += n;

    return n;
}

The checks mirror the write path, with one addition that carries the whole design: the offset *o. When the reader has consumed everything, we must return 0 to signal end of data, and we decide that by comparing the offset against LEN. We also clamp n so a caller can never read past what we stored. Then we copy from buf[*o] (not from the start), advance the offset by the number of bytes sent, and return that count.

Why bother with the offset at all, rather than copying the whole string in one shot? Because cat is not the only reader. Run cat on the device and it requests a large page, far more than our ten bytes, so a single copy would satisfy it. But a program can issue read() directly, asking for two bytes at a time, and a real buffer could be larger than one page. If a caller asks for fewer bytes than we hold, the offset is what lets us resume from where the last read stopped. Chain enough read() calls and the caller recovers the entire buffer, no matter how small each request is.

Testing it

The final module, with LEN and the static buffer defined:

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/miscdevice.h>
#include <linux/uaccess.h>
#include <linux/fs.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Sigreturn Labs");
MODULE_DESCRIPTION("Hello World module");

#define LEN 10

static struct miscdevice my_dev;
static char buf[LEN];

static ssize_t hello_write(struct file *f, const char __user *s, size_t n, loff_t *o)
{
    int retval = -EINVAL;

    if (!f || !s)
        return -EFAULT;
    if (n != LEN)
        return -EINVAL;

    retval = copy_from_user(buf, s, LEN);

    if (retval)
        return -EFAULT;

    printk(KERN_INFO "I have successfully written %s in buffer.", buf);

    return LEN;
}

static ssize_t hello_read(struct file *f, char __user *s, size_t n, loff_t *o)
{
    if (!f || !s || !o)
        return -EFAULT;
    if (*o >= LEN)
        return 0;
    if (n > LEN)
        n = LEN;
    if (copy_to_user(s, &buf[*o], n))
        return -EFAULT;

    *o += n;

    return n;
}

struct file_operations my_fops = {
    .read = hello_read,
    .write = hello_write
};

static int __init hello_init(void) {
    int ret;

    printk(KERN_INFO "Hello World !\n");

    my_dev.minor = MISC_DYNAMIC_MINOR;
    my_dev.name = "my_module_misc";
    my_dev.fops = &my_fops;
    ret = misc_register(&my_dev);

    return ret;
}

static void __exit hello_cleanup(void) {
    printk(KERN_INFO "Cleaning up module.\n");
    misc_deregister(&my_dev);
}

module_init(hello_init);
module_exit(hello_cleanup);

Compile, load, grant write access to the device, and exercise it with ordinary shell tools:

make # compilation

sudo insmod my_module.ko # loading the module

sudo chmod o+w /dev/my_module_misc # writing rights

ls -l /dev/my_module_misc # verification

sudo echo -n "1234567890" > /dev/my_module_misc # writing 10 bytes in our buffer

dmesg # verification : [316644.720207] I have successfully written 1234567890 in buffer.

sudo cat /dev/my_module_misc # reading the buffer : 1234567890

echo writes ten bytes and cat reads them back. Now the part that justified the offset: a program that reads two bytes at a time.

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>

int main(void)
{
    char buf[10];
    int fd = open("/dev/my_module_misc", O_RDONLY);
    while (read(fd, buf, 2)) {
        buf[2] = 0;
        printf("%s\n", buf);
    }
}

$ gcc test.c -o test
$ sudo ./test
12
34
56
78
90

Five chained read() calls, two bytes each, and the whole buffer comes back in order. The offset did its job.

Two mechanisms, one boundary

We reached ring 0 from ring 3 twice, by two routes that could hardly be more different in how they get there, yet end at the same place: user-land code running our privileged code and getting a result back across the boundary.

	System call	Kernel module
Entry point	Static, in the kernel image	Dynamic, loaded at runtime
Build	Recompile the whole kernel	Compile one `.ko` against headers
Activation	Reboot into the new kernel	`insmod`, undone by `rmmod`
Addressed by	A fixed syscall number	A path under `/dev` (misc device)
User/kernel transfer	`copy_to_user` / `copy_from_user`	`copy_to_user` / `copy_from_user`
Risk	Can leave the machine unbootable	A fault crashes the kernel until reboot

The last two rows are the point. The syscall got its /dev-free front door for free, handed to it by the syscall table. The module had to build one, and once it did, the user/kernel transfer looked identical: the same copy_to_user and copy_from_user, the same careful byte accounting, the same negative-errno discipline. A misc device is a module reconstructing, by hand, the boundary crossing a system call is given.

That symmetry is also why both mechanisms are worth understanding for anyone working on the offensive or defensive side of a Linux system. The syscall table is a fixed, well-known target. Loadable modules are the canonical foothold for ring 0 persistence, and the copy_to_user / copy_from_user boundary is exactly where kernel code mishandles untrusted user input. Build both once, deliberately, and that attack surface stops being abstract.