Implementation
The implementation of system calls is platform dependent, hence refer to the platform-related headers for your platform implementation. In all the notes, I have used x86/x86_64 as the platform
Each syscall is assigned a
syscall number
that uniquely identifies that syscallThe parameters of the system calls are
word sized
(32 bit or 64bit)There can be a maximum of 6 parameters (x86/x86_64).
The return type is of type
long
in the kernel to provide support for both x32 and x64.
During a syscall the process runs in kernel mode (since this is in process context current is legal here). For switching to the kernel mode, the user stack is stored and the kernel stack is loaded.
Step 1 : Assembly code
The mechanism needs to signal the kernel. This signal is called a
software interrupt
.The kernel on receive this interrupt executes the exception handler that is the
syscall handler
in this case.
The assembly instructions for this is defined under arch/x86/entry/entry_64.S
for x86_64 systems. This file clearly mentions that a syscall for this architecture can have upto 6 arguments in registers.
The following notes are present in the comments of entry_SYSCALL_64
assembly code.
The steps to make a system call are :
swapping the
GS
register usingswapgs
. The GS register holds data – in user mode it stores the base address of the user-space, for kernel-space it stores the per-CPU structure.Save the user stack, this is done using macros.
PUSH_AND_CLEAR_REGS rax=$-ENOSYS
callsPUSH_REGS
and then clears the registers usingCLEAR_REGS
.
calls the syscall entry fuunction using the instruction
call do_syscall_65
.
Step 2 : Entry function
The C code for this is present under arch/x86/entry/common.c
. The function is do_syscall_64
. This function disables instrumentation using noinstr
(noinstr doc) in it’s declaration.
The C code first calls syscall_enter_from_user_mode
which does the following :
Sets up RCU / Context Tracking and Tracing.
Invokes various entry work functions like ptrace, seccomp, audit, syscall tracing, etc.
Once this is done, it calls the syscall invocations.
Step 3 : Finding the syscall
The function do_syscall_x64
has the following code :
Here
sys_call_table
is very important.It is an array that stores the various system call handlers
It is defined in
arch/x86/entry/syscall_64.c
as :
The handlers are defined under
asm/syscalls_64.h
as :
Here
__SYSCALL
just expands to__x64_##function
. Eg :__x64_sys_read
Step 4 : The system calls
The system calls are declared under
include/linux/sycalls.h
The definition is under their respective subsystems. For example :
sys_read
is defined underfs/read_write.c
as :
Step 5 : Returning from syscalls
The
do_syscall_64
anddo_syscall_x64
mention earlier store the return values in the eax/rax registers :regs->ax = sys_call_table[unr](regs);
as mentioned earler.The control then returns back to the assembly code under
entry_64.S
which then restores the registers backed-up earlier and then returns usingsysretq
.
References
The linux source code (linux-6.8.1)
Last updated