Implementation
Last updated
Last updated
The implementation of system calls is platform dependent, hence refer to the platform-related headers for your platform implementation. In all the notes, I have used x86/x86_64 as the platform
Each syscall is assigned a syscall number
that uniquely identifies that syscall
The parameters of the system calls are word sized
(32 bit or 64bit)
There can be a maximum of 6 parameters (x86/x86_64).
The return type is of type long
in the kernel to provide support for both x32 and x64.
During a syscall the process runs in kernel mode (since this is in process context is legal here). For switching to the kernel mode, the user stack is stored and the kernel stack is loaded.
The mechanism needs to signal the kernel. This signal is called a software interrupt
.
The kernel on receive this interrupt executes the exception handler that is the syscall handler
in this case.
The assembly instructions for this is defined under arch/x86/entry/entry_64.S
for x86_64 systems. This file clearly mentions that a syscall for this architecture can have upto 6 arguments in registers.
The following notes are present in the comments of entry_SYSCALL_64
assembly code.
The steps to make a system call are :
swapping the GS
register using swapgs
. The GS register holds data – in user mode it stores the base address of the user-space, for kernel-space it stores the per-CPU structure.
Save the user stack, this is done using macros. PUSH_AND_CLEAR_REGS rax=$-ENOSYS
calls PUSH_REGS
and then clears the registers using CLEAR_REGS
.
calls the syscall entry fuunction using the instruction call do_syscall_65
.
The C code first calls syscall_enter_from_user_mode
which does the following :
Sets up RCU / Context Tracking and Tracing.
Invokes various entry work functions like ptrace, seccomp, audit, syscall tracing, etc.
Once this is done, it calls the syscall invocations.
The function do_syscall_x64
has the following code :
Here sys_call_table
is very important.
It is an array that stores the various system call handlers
It is defined in arch/x86/entry/syscall_64.c
as :
The handlers are defined under asm/syscalls_64.h
as :
Here __SYSCALL
just expands to __x64_##function
. Eg : __x64_sys_read
The system calls are declared under include/linux/sycalls.h
The definition is under their respective subsystems. For example : sys_read
is defined under fs/read_write.c
as :
The do_syscall_64
and do_syscall_x64
mention earlier store the return values in the eax/rax registers : regs->ax = sys_call_table[unr](regs);
as mentioned earler.
The control then returns back to the assembly code under entry_64.S
which then restores the registers backed-up earlier and then returns using sysretq
.
The linux source code (linux-6.8.1)
The C code for this is present under arch/x86/entry/common.c
. The function is do_syscall_64
. This function disables instrumentation using noinstr
() in it’s declaration.