Notes for Linux Kernel
Mis à jour :
Introduction
Features
- Preemptive multitasking
- Virtual memory
- Shared libraries
- Demand loading, dynamic kernel modules
- TCP/IP networking
- Symmetrical Multi-Processing support
- Open source
Two modes in Linux
- User mode: application software(including different libraries)
- Kernel mode: system calls, linux kernel, hardware
What’s a kernel
- executive system monitor
- controls and mediates access to hardware
- implements and supports fundamental abstractions (processes, files, devices)
- schedules/allocates system resources (memory, cpu, disk, descriptors)
- security and protection
- respond to user requests for service (system calls)
Kernel design goal
performance, stability, capability, security and protection, portability, extensibility
Linux source tree (directories)
- /root : the home directory for the root user
- /home : the user’s home directories along with directories for services (ftp, http…)
- /bin : commands needed during booting up that may be needed by normal users
- /sbin : similar to /bin. but not intended for normal users. run by Linux.
- /proc : a virtual file system that exits in the kernels imagination, which is memory. not on a disk.
- a directory with info about process number
- each process has a directory below /proc
- /usr : contain all commands, libraries, man pages, games and static files for normal operations
- /boot : files used bu the bootstrap loader. kernel images are often kept here.
- /lib : shared libraries needed by the programs on the root file system
- /modules : loadable kernel modules, especially those needed to boot the system after disasters
- /dev : device files
- /etc : configuration files specific to the machine
- /skel : when a home directory is created, it is initialized with files from this directory
- /sysconfig : files that configure the linux system for devices
- /var : files that change for mail, news , printers log files, man pages, temp files
- /mnt : mount points for temporary mounts by the system administrator
- /tmp : temporary files. programs running after bootup should use /var/tmp
linux/arch
- subdirectories for each current port
- each contains kernel, lib, mm, boot and other directories whose contents override code stubs in architecture independent code
linux/drivers
- largest amount of code in the kernel tree
- device, bus, platform and general directories
linux/fs
- contains virtual file system framework (VFS) and subdirectories for actual file systems
linux/include
- include/asm-* : architecture-dependent include subdirectories
- include/linux :
- header info needed both by the kernel and user apps
- usually linked to /usr/include/linux
- kernel-only portions guarded by
#ifdef__KERNEL__ /* kernel stuff */ #endif
linux/init
- version.c : contains the version banner that prints at boot
- main.c : architecture-independent boot code (start_kernel is the main entry point)
linux/kernel
- the core kernel code
- sched.c : the main kernel file (scheduler, wait queues, timers, alarms, task queues)
- process control : fork.c, exec.c, signal.c, exit.c etc…
- kernel module support : kmod.c, ksyms.c, module.c
- other operations : time.c, resource.c, dma.c, printk.c, info.c, sys.c …
linux/lib
- kernel code cannot call standard C library routines
linux/mm
- paging and swapping
- allocation and deallocation
- memory mapping
linux/scripts
scripts for:
- menu-based kernel configuration
- kernel patching
- generating kernel documentation
Summary
- Linux is a modular, UNIX-like monolithic kernel
- kernel is the heart of the OS that executes with special hardware permission (kernel mode)
- “core kernel” provides framework, data structures, support for drivers, modules, subsystems
- architecture dependent source subtrees live in /arch
Booting
How computer startup?
- booting is a bootstrapping process that starts operating systems when the user turns on a computer system
- a boot sequence is the set of operations the computer performs when it is switched on that load an operating system
Booting sequence
- Turn on
- CPU jump to address of BIOS (0xFFFF0)
- BIOS runs POST (Power-On Self Test)
- Find bootable devices
- Load and execute boot sector from MBR
- Load OS
BIOS (Basic Input/Output System)
- BIOS refers to the software code run by a computer when first powered on
- the primary function is code program embedded on a chip and controls various devices that make up the computer
MBR (Master Boot Record)
- OS is booted from a hard disk, where MBR contains the primary boot loader
- The MBR is a 512-byte sector, located in the first sector in the disk (sector 1 of cylinder 0, head 0)
- After the MBR is loaded into RAM, the BIOS yields control to it
- The first 446 bytes are the primary boot loader, which contains both executable code and error message
- The next 64 bytes are the partition table, which contains a record for each of four partitions
- The MBR ends with two bytes that are defined as the magic number (0xAA55 or 0x55AA). The magic number serves as a validation check of the MBR.
- To see the contents of MBR, use:
sudo dd if=/dev/sda of=mbr.bin bs=512 count=1 od -xa mbr.bin
Boot loader
- more aptly called the kernel loader. the task at this stage is to load the linux kernel
- optional, initial RAM disk
- GRUB and LILO are the most popular Linux boot loader
- GRUB: GRand Unified Bootloader
- an operating system independent bootloader
- a multiboot software packet from GNU
- flexible command line interface
- file system access
- support multiple executable format
- support diskless system
- download OS from network
- LILO: LInux LOader
- not dependent on a specific file system
- can boot from hard disk and floppy
- up to 16 different images
- must change LILO when kernel image file or config file is changed
- GRUB: GRand Unified Bootloader
Kernel Image
- The kernel is the central part in most OS beacause of its task, which is the management of the system’s resources and the communication between hardware and software components.
- kernel always store in memory until computer is turned off.
- kernel image is not an executable kernel, but a compressed kernel image
- zImage size is less than 512 KB
- bzImage is larger than 512 KB
Task of kernel
- process management
- memory management
- device management
- system call
Init process
- the first thing the kernel does is to execute init process
- init is the root/parent of all processes executing in linux, and is responsible for staring all other processes
- the first process that init starts is a script: /etc/rc.d/rc.sysinit
- based on the appropriate run-level, scripts are executed to start various processes to run the system and make it functional
- the init process id is “1”
- init is responsible for starting system processes defined in /etc/inittab file
- init typically will start multiple instances of “getty”, which waits for console logins which spawn one’s user shell process
- upon shutdown, init control the sequence and processes for shutdown
Runlevels
- a runlevel is a software configuration of the system which allows only a selected group of processes to exist
- init can be in one of seven runlevels: 0-6
rc#.d files
- rc#.d files are the scripts for a given runlevel that run during boot and shutdown
- the scripts are found in /etc/rc.d/rc#.d, where the symbol # represents the run level
init.d
- deamon is a background process
- init.d is a directory that admin can start/stop individual deamons by changing on it
- admin can issue the command and either the satrt, stop, status, restart or reload option
- e.g. to stop the web server
cd /etc/rc.d/init.d/ httpd stop
实模式下的系统初始化
- setup.S连同内核映象由bootsect.S装入。setup.S从BIOS获取计算机系统的参数,放到内存参数区,仍在实模式下运行
- 辅助程序setup为内核映象的执行做好准备,然后跳转到0x10000(小内核),0x100000(大内核)开始内核本身的执行,此后就是内核的初始化过程
- setup.S:
- 版本检查和参数设置
- 为进入保护模式做准备
保护模式下的系统初始化
- 初始化寄存器和数据区
- 核心代码解压缩:调用
mics.c
中的decompress_kernel
- 页表初始化
- 初始化 idt, gdt, ldt
- 启动核心
- 前面是对CPU进行初始化,并启动保护模式
- 现在的任务是初始化内核的核心数据结构,主要涉及:
- 中断管理
- 进程管理
- 内存管理
- 设备管理
- 进入保护模式后,系统从
start_kernel
开始执行,start_kernel
函数变成0号进程,不再返回 Start_kernel
显示版本信息,调用setup_arch()
初始化核心的数据结构- 最后,调用
kernel_thread()
创建1号进程init
进程 - 父进程创建
init
子进程之后,返回执行cpu_idle
Q&A
- Q: 在i386中,内核可执行代码在内存中的首地址是否可随意选择?
A: 从原理上讲是可以的,但实际上要考虑许多因素。例如微机内存的低端1MB的地址空间是不连续的。所以要把内核代码放在低端就要看是否能放的下。如果内核代码模块太大就不能将内核放在低端。 - Q: 主引导扇区位于硬盘的什么位置,如果一个硬盘的主引导扇区有故障此硬盘是否还可以使用?
A: 主引导扇区位于硬盘的0面0道1扇区。一般来讲如果一个硬盘的主引导扇区有故障,此硬盘虽然可以使用,但不能作为引导盘使用了,因为它的主引导扇区不能读出内容。
/proc File System and Kernel Module Programming
What is a kernel module?
- (wiki) an object file that contains code to extend the running kernel
- (RedHat) modules are pieces of code that can be loaded and unloaded into the kernel upon demand
advantages and disadvantages
- advantages:
- allowing the dynamic insertion and removal of code from the kernel at run-time
- save memory cost
- disadvantages: fragmentation penalty -> decrease memory performance
current kernel modules
cd /lib/modules/2.6.32-22-generic/
find . -name "*.ko"
current loaded modules
lsmod
or cat /proc/modules
/proc
- a pseudo file system
- real time, resides in the virtual memory
- tracks the processes running on the machine and the state of the system
- a new /proc file system is created every time your linux machine reboots
- highly dynamic. the size of the proc directory is 0 and the last time of modification is the last bootup time
- /proc file system doesn’t exist on any particular media
- the contents of the /proc file system can be read by anyone who has the requisite permissions
- certain parts of the /proc file system can be read only by the owner of the process and of course root (some not even by root)
- the contents of the /proc are used by many utilities which grab the data from the particular /proc directory and display it. e.g. top, ps, lspci, dmesg etc
Tweak kernel parameters
- /proc/sys: making changes to this directory enables you to make real time changes to certain kernel parameters
- e.g. /proc/sys/net/ipv4/ip_forward, which has value of “0”, which can be seen using
cat
. This can be changed in real time byecho 1 > /proc/sys/net/ipv4/ip_forward
, thus allowing IP forwarding.
Details of some files in /proc
- buddyinfo: contains the number of free areas of each order for the kernel buddy system
- cmdline: kernel command line
- cpuinfo: human-readable information about the processors
- devices: list of device drivers configured into the currently running kernel (block and character)
- dma: shows which DMA channels are being used at the moment
- execdomains: related to security
- fb: frame buffer devices
- filesystems: filesystems configured/supported into/by the kernel
- interrupts: number of interrupts per IRQ on the x86 architecture
- iomem: shows the current map of the system’s memory for its various devices
- ioports: provides a list of currently registered port regions used for input or output communication with a device
- kcore:
- this file represents the physical memory of the system and is stored in the core file format
- unlike most /proc files, kcore does display a size. This value is given in bytes and is equal to the size of physical memory (RAM) used plus 4 KB.
- its content are designed to be examined by a debugger, such as gdb, the GNU Debugger
- only root user has the rights to view this file
- kmsg: used to hold messages generated by the kernel, which are then picked up by other programs, such as
klogd
- loadavg:
- provides a look at load average
- the first three columns measure CPU utilization of the last 1, 5 and 10 minutes periods.
- the fourth column shows the number of currently running processes and the total number of processes
- the last column displays the last process ID used
- locks: displays the files currently locked by the kernel
- mdstat: contains the current information for multiple-disk, RAID configurations
- meminfo:
- one of the most commonly used /proc files
- it reports plenty of valuable information about the current utilization of RAM on the system
- misc: this files lists miscellaneous drivers registered on the miscellaneous major device, which is number 10
- modules: displays a list of all modules that have been loade by the system
- mounts: provides a quick list of all mounts in use by the system
- mtrr: refers to the current Memory Type Range Registers (MTRRs) in use with the system
- partitions: detailed information on the various partitions currently available to the system
- pci: full list of every PCI device on your system
- slabinfo: information about the memory usage on the slab level
- stat: keeps track of a variety of different statistics about the system since it was last restarted
- swap: measure swap space and its uilization
- uptime: contains information about how long the system has on since its last restart
- version: tells the versions of the kinux kernel and gcc, as well as the version of red hat linux installed on the system
The numerical named directories
- the processed that are running at the instant a snapshot of the /proc file system was taken
- the contents of all the directories are the same as these directories contain the carious parameters and status of the corresponding process
- you have full access only to the processes that you have started
A typical process directory
- cmdline: contains the whole command line used to invoke the process. the contents of this file are the command line arguments with all the parameters (without formatting/spaces)
- cwd: symbolic link to the current working directory
- environ: contains all the process-specific environment variables
- exe: symbolic link of the executable
- maps: parts of the process’ address space mapped to a file
- fd: this directory contains the list of file descriptors as opened by the particular process
- root: symbolic link pointing to the directory which is the root file system for the particular process
- status: information about the process
Other subdirectories in /proc
- /proc/self: link to the currently running process
- /proc/bus:
- contains information specific to the various buses available on the system
- for ISA, PCI and USB buses, current data on each is available in /proc/bus/<bus type directory>
- individual bus directories, signified with numbers, contains binary files that refer to the various devices available on that bus
- /proc/driver: specific drivers in use by kernel
- /proc/fs: specific file system, file handle, inode, dentry and quota information
- /proc/ide: information abput IDE devices
- /proc/irq: used to set IRQ to CPU affinity
- /proc/net: networking parameters and statistics
- arp: kernel’s ARP table. useful for connecting hardware address to an IP address on a system
- dev: lists the network devices along with transmit and receive statistics
- route: displays the kernel’s routing table
- /proc/scsi: like /proc/ide, it gives info about scsi devices
- /proc/sys:
- allows you to make configuration changes to a running kernel, by
echo
command - any configuration changes made will disappear when the system is restarted
- allows you to make configuration changes to a running kernel, by
/proc/sys subdirectories
- /proc/sys/dev: provides parameters for particular devices on the system. e.g.
cdrom/info
: many important CD-ROM parameters - /proc/sys/kernel:
- acct: controls the suspension of process accounting based on the percentage of free space available on the filesystem containing the log
- ctrl-alt-del: controls whether [Ctrl]-[Alt]-[Delete] will gracefully restart the computer using init (value 0) or force an immediate reboot without syncing the dirty buffers to disk (value 1).
- domainname: allows you to configure the system’s domain name, such as domain.com.
- hostname: allows you to configure the system’s host name, such as host.domain.com
- threads-max: sets the maximum number of threads to be used in the kernel, with a default value of 4096
- panic: defines the number of seconds the kernel will postpone rebooting the system when a kernel panic is experienced. By default, the value is set to 0, which disables automatic rebooting after a panic.
- /proc/sys/vm: facilitates the configuration of the Linux kernel’s virtual memory subsystem
Advantages and disadvantages of the /proc file system
- advantages:
- coherent, intuitive interface to the kernel
- great for tweaking and collecting status info
- easy to use and programming
- disadvantages:
- certain amount of overhead, must use fs calls, alleviated somewhat by sysctl() interface
- user can possibly cause system instability
/proc file system entries
- to use any of the procfs functions, you have to include the correct header file
#include <linux/proc_fs.h>
struct proc_dir_entry* create_proc_entry(const char* name, mode_t mode, struct proc_dir_entry* parent);
- this function creates a regular file with the name
name
, the modemode
in the directoryparent
- to create a file in the root of the procfs, use
NULL
asparent
parameter - when successful, the function will return a pointer to the freshly created
struct proc_dir_entry
- e.g.
foo_file = create_proc_entry("foo", 0644, example_dir);
- this function creates a regular file with the name
struct proc_dir_entry* proc_mkdir(const char* name, struct proc_dir_entry* parent);
- create a directory
name
in the procfs directoryparent
- create a directory
struct proc_dir_entry* proc_symlink(const char* name, struct proc_dir_entry* parent, const char* dest);
- this creates a symlink in the procfs directory
parent
that points fromname
todest
. this translates in userland to ln -s dest name
- this creates a symlink in the procfs directory
void remove_proc_entry(const char* name, struct proc_dir_entry* parent);
- removes the entry
name
in the directoryparent
from the procfs. - be sure to free the data entry from the struct
proc_dir_entry
before the function is called
- removes the entry
Tips
- modules can only use APIs exported by kernel and other modules
- no libcs
- kernel exports some common APIs
- modules are part of kernel
- modules can control the whole system
- as a result, can damage the whole system
- modules basically can’t be written with C++
- determine which part should be in kernel
Process Management
Processes, Lightweight Processes and Threads
- Process: an instance of a program in execution
- (User) Thread: an execution flow of the process
- Pthread (POSIX thread) library
- Lightweight process (LWP): used to offer better support for multithreaded applications
- LWP may share resources: address space, open files…
- To associate a lightweight process with each thread
Process descriptor
task_struct
data structure:
state
: process statethread_info
: low-level information for the processmm
: pointers to memory area descriptorstty
: tty associated with the processfs
: current directoryfiles
: pointers to file descriptorssignal
: signals received
Process state
- TASK_RUNNING: 该状态表示进程处于可运行状态,也就是说要么正在CPU中运行,要么在runqueue队列中等待运行
- TASK_INTERRUPTABLE: 该状态表示进程处于可中断的睡眠状态。该进程正处在睡眠,但是可以被任何信号唤醒。当信号将该进程唤醒后,进程会去对信号做出响应。
- TASK_UNINTERRUPTABLE: 该状态表示进程处于不可中断的睡眠状态。该进程正处于睡眠,专心等待某一个事件(一般是IO事件),并且不希望被其他信号唤醒。
- TASK_STOPPED
- TASK_TRACED
- EXIT_ZOMBIE: 该状态是该进程变为僵尸进程,即其父进程没有对该进程的结束信号进行处理
- EXIT_DEAD
Identifying a process
- process descriptor pointers: 32 bits
- process ID (PID): 16 bits (~32767 for compatibility)
- linux associates different PID with each process or LWP
- programmers expect threads in the same group to have a common PID
- thread group: a collection of LWPs
- the PID of the first LWP in the group
tgid
field in process descriptor: usinggetpid()
system call
The process list
tasks
field intask_struct
structure- type
list_head
prev
,next
fields point to the previous and the nexttask_struct
- type
- process 0 (swapper):
init_task
Parenthood relationships among processes
- process 0 and 1: created by the kernel
- process 1
init
: the ancestor of all processes
- process 1
- fields in process descriptor for parenthood
- real_parent
- parent
- children
- sibling
Pidhash table and chained lists
- to support the search for the process descriptor of a PID, since sequential search in the process list is inefficient
- the
pid_hash
array contains four hash tables and correspnding field in the process descriptorpid
: PIDTYPE_PIDtgid
: PIDTYPE_TGID (thread group leader)pgrp
: PIDTYPE_PGID (group leader)session
: PIDTYPE_SID (session leader)
- chaining is used to handle PID collisions
- the size of each pidhash table is dependent on the available memory
PID
pids
field of the process descriptor: the pid data structure
nr
: PID numberpid_chain
: links to the previous and the next elements in the hash chain listpid_list
: head of the per-PID list (in thread group)
How processes are organized
- processes in TASK_STOPPED, EXIT_ZOMBIE, EXIT_DEAD: not linked in lists
- processes in TASK_INTERRUPTABLE, TASK_UNINTERRUPTABLE: waiting queues
- two kinds of sleeping processes:
- exclusive process
- nonexclusive process: always woken up by the kernel when the event occurs
Process switch, task switch, context switch
- hardware context switch: a far jmp (in older Linux)
- software context switch: a sequence of mov instructions
- allows better control over the validity of data being loaded
- the amount of time required is about the same
Performing the process switch
- switch the page global directory
- switch the kernel mode stack and the hardware context
Task State Segment
TSS: a specific segment type in x86 architecture to store hardware contexts
Creating processes
- in traditional UNIX, resources owned by parent process are duplicated
- very slow and inefficient
- mechanisms to solve this problem
- Copy on Write: parent and child read the same physical page
- Ligitweight process (LWP): parent and child share per-process kernel data structures
vfork()
system call: parent and child share the memory address space
clone()
, fork()
and vfork()
system calls
clone(fn, arg, flags, child_stack, tls, ptid, ctid)
: creating lightweight process- a wrapper function in C library
- Uses
clone()
system call
fork()
andvfork()
system calls: implemented byclone
with different parameters- each invokes
do_fork()
function
Kernel threads
- kernel threads run only in kernel mode
- they use only linear addresses greater than PAGE_OFFSET
kernel_thread()
: to create a kernel thread- example kernel threads:
- process 0 (swapper process), the ancestor of all processes
- process 1 (init process)
- others: keventd, kapm…
Destrying processes
exit()
library function- two system calls in Linux 2.6
_exit()
system call: handled bydo_exit()
functionexit_group()
system call: handled bydo_group_exit()
function
- two system calls in Linux 2.6
- process removal: releasing the process descriptor of a zombie process by
release_task()
Scheduling policy
- based on time-sharing:
- time slice
- based on priority ranking:
- dynamic
- classification of processes:
- interactive processes: e.g. shells, text editors, GUI applications
- batch processes: e.g. compilers, database search engine, web server
- real-time processes: e.g. audio/video applications, data-collection from physical sensors, robit controllers
Process preemption
- Linux (user) processes are preemptive
- when a new process has higher priority than the current
- when its time quantum expires,
TIF_NEED_RESCHED
in thread_info will be set
- a preempted process is not suspended since it remains in the TASK_RUNNING state
- Linux kernel before 2.6 is nonpreemptive, simpler
How long should a quantum last
- neither too long nor too short:
- too short: overhead for process switch
- too long: process no longer appear to be executed concurrently
- always a compromise: the rule of thumb, to choose a duration as long as possible, while keeping good system response
Scheduling algorithm
- earlier version of Linux: simple and straightforward
- at every process switch, scan the runnable processes, compute the priorities and select the best one to run
- much more complex in Linux 2.6
- scales well with the number of processes and processors
- constant time scheduling
- 3 scheduling classes
- SCHED_FIFO: first-in first-out real-time
- SCHED_RR: round-robin real-time
- SCHED_NORMAL: conventional time-shared processes
Scheduling of conventional processes
Static priority
- static priority
- conventional processes: 100 - 139
- can be changed by
nice()
,setpriority()
system calls
- can be changed by
- base time quantum: the time-quantum assigned by the scheduler if it has exhausted its previous time quantum
- conventional processes: 100 - 139
Dynamic priority and average sleep time
- dynamic priority: 100 - 139
dynamic priority = max(100, min(static_priority - bonus + 5, 139))
- bonus: 0 - 10
- < 5: penalty
- > 5: premium
- dependent on the average sleep time: average number of nanoseconds that the process spent while sleeping
- the more average sleep time, the more bonus:
- bonus: 0 - 10
Determine the status of a process
To determine whether a process is considered to be interactive or batch
- interactive if: dynamic_priority <= 3 * static_priority / 4 + 28
- i.e.
bonux - 5 >= static_priority / 4 - 28
- interactive delta
Active and expired processes
- active processes: runnable processes that have not exhausted their time quantum
- expired processes: runnable processes that have exhausted their time quantum
- periodically, the role of processes changes
Scheduling of real-time processes
- real-time priority: 1 - 99
- real-time processes are always considered active
- can be changed by
sched_setparam()
andsched_setscheduler()
system calls
- can be changed by
- a real-time process is replaced only when:
- another process has higher real-time priority
- put to sleep by blocking operation
- stopped or killed
- voluntarily relinquishes the CPU by
sched_yield()
system call - for round-robin real time (SCHED_RR), time quantum exhausted
Implementation Support of Scheduling
Data structures used by the scheduler
runqueue
data structure for each CPU- two sets of runnable processes in the
runqueue
structure prio_array_t
data structure- nr_active: # of process descriptors in the list
- bitmap: priority bitmap
- queue: the 140 list_heads
Functions used by the scheduler
scheduler_tick()
: keep the time_slice counter up-to-datetry_to_wake_up()
awaken a sleeping processrecalc_task_prio()
: updates the dynamic priorityload_balance()
: keep the runqueue of multiprocess system balancedschedule()
: select a new process to run- Direct invocation:
- when the process must be blocked to wait for resource
- when long iterative tasks are executed in device drivers
- Lazy invocation: by setting the TIF_NEED_RESCHED flag
- when current process has used up its quantum, by
scheduler_tick()
- when a process is woken up, by
try_to_wake_up()
- when a
sched_setscheduler()
system call is issued
- when current process has used up its quantum, by
- Direct invocation:
Runqueue balancing in multiprocessor systems
- 3 types of multiprocessor systems
- Classic multiprocessor architecture
- Hyper-threading
- NUMA
- A CPU can execute only the runnable processes in the corresponding runqueue
- The kernel periodically checks if the workloads of the runqueues are balanced
Scheduling domains
A set of CPUs whose workloads are kept balanced by the kernel
- hierarchically organized
- partitioned into groups: a subset of CPUs
System calls related to scheduling
nice()
: for backward compatability only- allow processes to change their base priority
- replaced by
setpriority()
getpriority()
andsetpriority()
:- act on the base priorities of all processes in a given group
- which: PRIO_PROCESS, PRIO_PGRP, PRIO_USER
- who: the value of pid, pgrp, or uid field to be used for selecting the processes
- niceval: the new base priority value: -20 - +19
sched_getscheduler()
,sched_setscheduler()
: queries/sets the scheduling policysched_getparam()
,sched_setparam()
: queries/sets the scheduling parameterssched_yield()
: to relinquish the CPU voluntarily without being suspendedsched_get_priority_min()
,sched_get_priority_max()
: to return the minimum/maximum real-time static prioritysched_rr_get_interval()
: to write into user space the round-robin time quantum for real-time process
Completely Fair Scheduler (CFS)
- Motivation of CFS: running task gets 100% usage of the CPU, all other tasks get 0% usage of the CPU
- New scheduling algorithm in linux kernel 2.6
- Design:
- the same virtual runtime for each task
- increasing the priority for sleeping task
- Implementation
- stores the records about the planned tasks in a red-black tree
- pick efficiently the process that has used the least amount of time (the leftmost node of the tree)
- the entry of the picked process is then removed form the tree, the spent execution time is updated and the entry is then returned to the tree where it normally takes some other location.
Kernel synchronization
Kernel control paths
- Linux kernel: like a server that answers requests
- parts of the kernel run in an interleaved way
- Kernel control path: a sequence of instructions executed in kernel mode on behalf of current process
- interrupts and exceptions
- lighter than a process (less context)
- Three CPU states are considered:
- User: running a process in user mode
- Excp: running an exception or a system call handler
- Intr: running an interrupt handler
Kernel preemption
- Preemptive kernel: a process running in kernel mode can be replaced by another process while in the middle of a kernel function
- The main motivation for making a kernel preemptive is to reduce the dispatch latency of the user mode process
- dispatch latency: delay between the time they become runnable and the time they actually begin running
- The kernel can be preempted only when it is excuting an exception handler (in particular a system call) and the kernel preemption has not been explicitly disabled
When synchronization is necessary
- A race condition can occur when the outcome of a computation depends on how two or more interleaved kernel control paths are nested
- To identify and protect the critical regions in exception handlers, interrupt handlers, deferrable functions, and kernel threads
- On single CPU, critical region can be implemented by disabling interrupts while accessing shared data
- If the same data is shared only by the service routines of system calls, critical region can be implemented by disabling kernel preemption while accessing shared data
- Things are more complicated on multiprocessor systems
- different synchronization techniques are necessary
When synchronization is not necessary
- The same interrupt cannot occur until the handler terminates
- Interrupt handlers and softirqs are nonpreemptable, non-blocking
- A kernel control path performing interrupt handling cannot be interrupted by a kernel control path executing a deferrable function or a system call service routine
- Softirqs cannot be interleaved
Per-CPU variables
- The simplest and most efficient synchronization technique consists of declaring kernel variables as per-cpu variables
- an array of data structures, one element per CPU in the system
- a CPU should not access the elements of the array corresponding to the other CPUs
- While per-cpu variables provide protection against concurrent accesses from several CPUs, they do not provide protection against accesses from asynchronous functions (interrupt handles and deferrable functions)
- per-cpu variables are prone to race conditions caused by kernel preemption, both in uniprocessor and multiprocessor systems
Memory Barriers
- when dealing with synchronization, instruction reordering must be avoided
- a memory barrier primitive ensures that the operations before the primitive are finished before the operations after the primitive
- all instructions that operate on I/O ports
- all instructions prefixed by lock byte
- all instructions that write into control registers, system registers or debug registers
- a few special instructions, e.g. iret
Spin locks
- Spin locks are a special kind of lock designed to work in a multiprocessor environment
- busy waiting
- very convenient
Read/Write Spin Locks
- To increase the amount of concurrency in the kernel
- multiple reads, one write
- rwlock_t structure
- lock field: 32 bit
Seqlock
- Introduced in Linux 2.6
- similar to read/write spin locks
- except that they give a much higher priority to writers
- a writer is allowed to proceed even when readers are active
Read-Copy Update
- Read-Copy update (RCU): another synchronization technique designed to protect data structures that are mostly accessed for reading by several CPUs
- RCU allows many readers and many writers to proceed concurrently
- RCU is lock free
- Key ideas
- only data structures that are dynamically allocated and referenced via pointers can be protected by RCU
- No kernel control path can sleep inside a critical section protected by RCU
Semaphores
- two kinds of semaphores
- kernel semaphores: by kernel control paths
- System V IPC semaphores: by user processes
- Kernel semaphores
- struct semaphore
- count
- wait
- sleepers
up()
: to release a kernel semaphore (similar to signal)down()
: to acquire kernel semaphore (similar to wait)
- struct semaphore
Read/Write semaphores
- similar to read/write spin locks
- except that waiting processes are suspended instead of spinning
- struct rw_semaphore
- count
- wait_list
- wait_lock
init_rwsem()
down_read()
,down_write
: acquire a read/write semaphoreup_read()
,up_write()
: release a read/write semaphore
Completions
- to solve a subtle race condition in multiprocessor systems
- similar to semaphores
- struct completion
- done
- wait
complete()
: corresponding toup()
wait_for_completion()
: corresponding todown()
Local interrupt disabling
- interrupts can be disabled on a CPU with
cli
instructionslocal_irq_disable()
macro
- interrupts can enabled by
sti
instructionslocal_irq_enable()
macro
Disabling/Enabling deferrable functions
softirq
- the kernel sometimes need to disable deferrable functions without diabling interrupts
local_bh_disable()
macrolocal_bh_enable()
macro
Synchronizing accesses to kernel data structures
- rule of thumb for kernel developers:
- always keep the concurrency level as high as possible in the system
- two factors:
- the number of I/O devices that operate concurrently
- the number of CPUs that do productive work
- a shared data structure consisting of a single integer value can be updated by declaring it as an
atomic_t
type and by using atomic operations - inserting an element into a shared linked list is never atomic since it consists of at leasr pointer assignments
Examples of race condition
- when a program uses two or more semaphores, the potential for deadlock is present because two different paths could wait for each other
- Linux has few problems with deadlocks on semaphore requests since each path usually acquire just one semaphore
- in cases such as
rmdir()
andrename()
system calls, two semaphores requests - to avoid such deadlocks, semaphore requests are performed in address order
- semaphore request are performed in predefined address order
Symmetric multiprocessing (SMP)
Traditional View
- Traditionally, the computer has been viewed as a sequential machine
- a processor executes instructions one at a time in sequence
- each instruction is a sequence of operations
- two popular approached to providing parallelism
- symmetric multiprocesors
- clusters
Categories of computer systems
- single instruction single data (SISD) stream
- single processor executes a single instruction stream to operate on data stored in a single memory
- Single instruction Multiple data (SIMD) stream
- each instruction is executed on a different set of data by the different processors
- multiple instruction single data (MISD) stream
- a sequence of data is transmitted to a set of processors, each execute a different instruction sequence
- multiple instruction multiple data (MIMD)
- a set of processors simultaneously execute different instruction sequences on different data sets
Symmetric multiprocessing
- kernel can execute on any processor
- allowing portions of the kernel to execute in parallel
- typically each processor does self-scheduling from the pool of available process or threads
SMP
- Symmetric multiprocessing (SMP) involves a multiprocessor computer hardware and software where two or more identical processors connect to a single shared main memory, have full access to all I/O devices and are controlled by a single OS instance that treats all processors equally, reserving none for special purposes
- Most multiprocessor systems today use an SMP architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as seperate processors
Linux Support for SMP
- 启动过程:BSP负责操作系统的启动,在启动的最后阶段,BSP通过IPI激活各个AP,在系统的正常运行过程中,BSP和AP基本上是无差别的
- 进程调度:与单处理器系统的主要差别是执行进程切换后,被换下的进程有可能会换到其他CPU上继续运行。在计算优先权时,如果进程上次运行的CPU也是当前CPU,则会适当提高优先权,这样可以更有效地利用Cache
- 中断系统:为了支持SMP,在硬件上需要APIC中断控制系统。Linux定义了各种IPI的中断向量以及传送IPI的函数
NUMA
- Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor
- Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memeory shared between processors)
- Benefits: the benefits of NUMA are limited to particular workloads, notably on servers where the data are often associated strongly with certain tasks or users
SMP 系统启动过程
- BIOS初始化:屏蔽AP(Application Processor),建立系统配置表
- MBR里面的引导程序(LILO,GRUB等)将内核载入内存
- 执行
head.S
中的startup_32
函数,最后调用start_kernel
- 执行
start_kernel
,这个函数相当于应用程序里面的main
函数 start_kernel
进行一系列的初始化工作,最后将执行:smp_init
:关键的一步,启动各APrest_init
,调用init
创建1号init
进程,自身调用cpu_idle
成为0号进程
- 1号进程
init
完成剩下的工作
reschedule_idle
工作过程
- 先检查p进程上一次运行的cpu是否空闲,如果空闲,这是最好的cpu,直接返回。
- 找一个合适的cpu,查看SMP中的每个CPU上运行的进程,与p进程相比的抢先权,把具有最高的抢先权值的进程记录在target_task中,该进程运行的cpu为最合适的CPU。
- 如target_task为空,说明没有找到合适的cpu,直接返回。
- 如果target_task不为空,则说明找到了合适的cpu,因此将target_task->need_resched置为1,如果运行target_task的cpu不是当前运行的cpu,则向运行target_task的cpu发送一个IPI中断,让它重新调度。
Memory Management - Addressing
Memory Addresses
- three kinds of addresses in 80x86 microprocessors
- logical address: included in the machine language instructions
- linear address (virtual address): a single 32-bit unsigned integer that can be used to address up to 4GB
- physical address (32-bit unsigned integers): used to address memory cells in memory chips
- logical address => segmentation unit => linear address => paging unit => physical address
Virtual File System (VFS)
Role of VFS
- a common interface to several kinds of file systems
- File systems supported by the VFS
- Disk-based filesystems
- Network filesystems
- special filesystems (e.g. /proc)
Common file system interface
- enable system calls such as
open()
,read()
andwrite()
to work regardless of file system or storage media
Terminology
- file system: storage of data adhering to a specific structure
- file: ordered string of bytes
- directory: analogous to a folder
- special type of file
- instead of normal data, it contains pointers to other files
- directories are hooked together to create the hierarchical name space
- metadata: information describing a file
Physical file representation
- Inode:
- unique index
- holds file attributes and data block locations pertaining to a file
- Data blocks
- contain file data
- may not be physically contiguous
- File name:
- human-readable identifier for each file
The common file model
- defines common file model conceptual interfaces and data structures
- e.g. user space (
write()
) => VFS (sys_write()
) => file system (file system write methods
) => physical media - object-oriented: data structures and associated operations
- Superblock object: a mounted file system
- Inode object: information about a file, represents a specific file
- File object: interaction between an open file and a process
- Dentry object: directory entry, single component of a path name
VFS operations
- each object contains operations object with methods:
- super_operations: invoked on a specific file system
- inode_operations: invoked on a specific inodes (which point to a file)
- dentry_operations: invoked on a specific directory entry
- file_operations: invoked on a file
- lower file system can implement own version of methods to be called by VFS
- if an operation is not defined by a lower file system (NULL), VFS will often call a generic version of the method
VFS Data Structures
Superblock object
- implemented by each file system
- used to store information describing that specific file system
- often physically written at the beginning of the partition (file system control block)
- found in
linux/fs.h
Inode object
- represent all the information needed to manipulate a file or directory
- constructed in memory, regardless of how file system stores metadata information
- three doubly linked lists
- List of valid unused inodes
- List of in-use inodes
- List of dirty inodes
Power Management - From Linux Kernel to Android
Linux Power Management
- Two popular power management standards in Linux
- APM (Advanced Power Management)
- Power management happens in two ways:
- function calls from APM driver to the BIOS requesting power state change
- automatically based on device activities
- Power states in APM:
- Full on
- APM enabled
- APM standby
- APM suspend
- Sleep/Hibernation
- Off
- Weakness of APM:
- each BIOS has its own policy, different computers have different activities
- unable to know why system is suspend
- BIOS does not know user activities
- does not know add-on devices (USB)
- Power management happens in two ways:
- ACPI (Advanced Configuration and Power Interface)
- control divided between OS and BIOS, decisions managed through OS
- Enables sophisticated power policies for general-purpose computers with standard usage patterns and hardware
- APM (Advanced Power Management)
Android Power Management
- Built on top of Linux Power Management
- Designed for devices which have a “default-off” behaviors
- the phone is not supposed to be on when we do not want to use it
- powered on only when requested to be run, off by default
- unlike PC, which has a “default-on” behavior
- Apps and services must request CPU resource with “wake locks” through the Android application framework (
PowerManager
class) and native Linux libraries in order to keep power on, otherwise Android will shut down the CPU - Android PM uses wake locks and time out mechanism to switch state of system power, so that system power consumption decreases
- If there are no wake locks, CPU will be turned off
- If there are partial wake locks, screen and keyboard will be turned off
Android PM State Machine
- When an user application acquire full wake lock or screen/keyboard touch activity event occur, the machine will enter ‘AWAKE’ state
- If timeout happens or power key is pressed, the machine will enter “notification” state:
- if partial wake lock is acquired, the machine will remain “notification”
- if all partial wake locks are released, the machine will enter “sleep” state
- There is one main wake lock called ‘main’ in the kernel to keep the kernel awake
- It is the last wake lock to be released when system goes to suspend
Acquiring wake locks
- request sent to PowerManager to acquire a wake lock
- PowerManagementService (in the kernel) to take a wake lock
- Add wake lock to the list
- set the power state:
- for a full wake lock, state should be set to “ON”
- For taking Partial wake lock, if it is the first partial wake lock, a kernel wake lock is taken. This will protect all the partial wake locks. For subsequent requests, kernel wake lock is not taken, but just added to the list
- It is the last wake lock to be released when system goes to suspend
System Sleep
- early_suspend:
- Used by drivers that need to handle power mode settings to the device before kernel is suspended
- Used to turn off screen and non-wakeup source input devices
- E.g., consider a display driver
- In early suspend, the screen can be turned off
- In the suspend, other things like closing the driver can be done
- late_resume:
- When system is resumed, resume is called first, followed by resume late
Battery Service
- The BatteryService monitors the battery status, level, temperature etc.
- A Battery Driver in the kernel interacts with the physical battery via ADC to read battery voltage and I2C
- Whenever BatteryService receives information from the BatteryDriver, it will act accordingly
- E.g. if battery level is low, it will ask system to shutdown
- Battery Service will monitor the battery status based on received uevent from the kernel
- uevent : An asynchronous communication channel for kernel