Macros, types and conventions:


long  IS_ERR(const void *ptr)            // pointer has value in “error” range (>= (unsigned long) –MAX_ERRNO)
long  IS_ERR_OR_NULL(const void *ptr)
long  PTR_ERR(const void *ptr)
void* ERR_PTR(long error)





BUG_ON(condition)          // can compile away, so do not create side effects in condition
// build-time checking


typecheck(__u64, a)        // build-time checking that a is of type __u64

container_of(member_ptr, container_typename, member_name)


BITS_PER_LONG              // 32 or 64



No floating point operations in kernel (there are routines to save/restore fp context).


Interrupt stack (optional. present only on some architectures), normally 1 page (on x86/x64 – 4K).


Kernel stack: on x86/x64 usually 2 pages, i.e. 8 KB.
At the bottom of the stack is struct thread_info.
On stack overflow will silently overwrite it with disastrous consequences.


struct thread_info* current_thread_info();

struct task struct* task = current_thread_info()->task;
struct task struct* task = current;


/* task->state */

TASK_RUNNING                 0      // currently running or on runqueue waiting to run

TASK_INTERRUPTIBLE           1      // in interruptible sleep

TASK_UNINTERRUPTIBLE         2      // in non-interruptible sleep

__TASK_STOPPED               4      // stopped due to SIGSTOP, SIGTSTP, SIGTTIN or SIGTTOU

__TASK_TRACED                8      // being traced by debugger etc. via ptrace


/* in task->exit_state */

EXIT_ZOMBIE                  16        // process has terminated, lingering around to be reaped by parent

EXIT_DEAD          32        // final state after parent collects status with wait4 or waitpid syscalls


/* in task->state again */

TASK_DEAD          64

TASK_WAKEKILL                128       // wake up on fatal signals

TASK_WAKING                  256







set_task_state(task, state)


Iterate parent tasks

(note: if parent terminates, task is reparented to member of the group or to init)


for (task = current;  task != &init_task;  task = task->parent)
    { ... }


Iterate all tasks in the system (expensive):


    { ... }


task = next_task(task);
task = prev_task(task)


Iterate over children tasks:


struct list_head* list;

list_for_each(list, &current->children)

    task = list_entry(list, struct task_struct, sibling);



Annotation for functions that can sleep. This macro will print a stack trace if it is executed in an atomic

context (spinlock, irq-handler, ...):





Modifiers (annotations):


__init                                                          place code into the section discarded after kernel or module init

__read_mostly                                              place data in “read mostly” section to reduce cache line interference
__write_once                                               same as __read_mostly

__page_aligned_bss                                      align data on page boundary and allocate it from bss-page-aligned section

notrace                                                        do not instrument this function for profiling

__irq_entry                                                  allocate code in section irqentry-text (only if graph tracer is enabled)

__kprobes                                                     allocate code in section kprobes-text (only if CONFIG_KPROBES)

asmlinkage                                                    assembler-level call linkage

__lockfunc                                                    allocate code in section spinlock-text


__kernel                                                      designate as kernel-space address

__user, __iomem, __rcu                               disable dereference (only used by __CHECKER__, see  sparse(1))


__nocast                                                      sparse will issue warning if attempted type-conversion to this var (e.g. signed int value to unsigned var)
                __bitwise                                                       sparse will issue warning if __bitwise variables are mixed with other integers

__safe                                                          sparse: not used in the codebase

__force                                                        sparse: disable warning (e.g. when casting int to bitwise)

               __le32, __be32                                               sparse: little endian, big endian


__must_hold(x)                                            sparse: the specified lock is held on function entry and exit

__acquires(x)                                                sparse: function acquires the specified lock

__releases(x)                                                sparse: function releases the specified lock






make install

then run kernel make:

make C=1                         # run sparse only on files that need to be recompiled
make C=2                         # run sparse on all files, whether recompiled or not
make C=2 CF="-D__CHECK_ENDIAN__"     # pass optional parameters to sparse



·          process (e.g. syscall), preemptible context

·          process, non-preemptible context

·          bottom-half (DPC/fork-like, but may be threaded, aka “soft interrupt” – e.g. tasklet)

·          primary interrupt handler (irq, aka "top half", aka “hard interrupt”)

·          PREEMPT_RT only: secondary (threaded) interrupt handler, per-IRQ thread


There are no IPL/IRQL levels.

SOFTIRQ is executed with hardware interrupts disabled
(but tasklet handler enables interrupts before entering the tasklet routine).


Bottom-half processing can be triggered by:

·          return from hardware interrupt (irq_exit)

·          local_bh_enable(), including indirect calls

·          ksoftirqd kernel thread (per-CPU, runs at low priority)

·          any code that explicitly calls do_softirq(), __do_softirq(), call_softirq() or invoke_softirq(),
such as network code


Note again that Linux is NOT "true" IPL/IRQL-based system, and local_irq_enable() or unlocking a spinlock DO NOT cause SOFTIRQ processing to happen.


Task switch can be triggered by:


·          REI to user-space

·          REI to kernel-space if (current->preempt_count == 0 && need_resched())

·          unlocking a lock, including indirect unlock or any other call to preempt_enable(),
if result is (current->preempt_count == 0 && need_resched())

·          explicit voluntary call to schedule(), including yield() and blocking functions


Preemption mask:



preempt_count()       current_thread_info()->preempt_count

preempt_count()               Returns preemption disable count,  this includes count masks added
                                                       by currently active hardirqs, softirqs and NMI. Low byte is preempt
                                                       disable count (if kernel is configured preemptible, otherwise 0),
                                                       byte 1 is softirq count, next 10 bits is hardirq count,
                                                       then 1 bit for NMI, then one bit for PREEMPT_ACTIVE.


There is also preempt_lazy_count (useful for PREEMPT_RT)

SOFTIRQ is changed by 1 on entering/leaving softirq processing and is changed by 2 on local_bh_enable/local_bh_disable, to distinguish between softirq-active and softirq-disabled states

PREEMPT_ACTIVE                                                    rescheduling is being performed (so do not reenter the scheduler)


Context control functions:


preempt_disable()                              increments PREEMPT field
preempt_enable()                               decrement PREEMT, if goes to 0, will check for need_reschdule()


local_bh_disable()                               increments SOFTIRQ field by 2 (whereas entering softirq increments it
                        only by 1, to distinguish between softirq-active/softirq-disabled states)

local_bh_enable()                                If goes to 0, will check for pending bottom halves (softirqs)


local_irq_disable()                              disable hw interrupts unconditionally (on some arch can be lazy)
local_irq_enable()                               enable hw interrupts unconditionally


unsigned long flags;

local_irq_save(flags)                         save previous state into flags and disable hw interrupts
local_irq_restore(flags)                  restore hw-interrupt-enable state from flags

Other functions:


hard_irq_disable()                              Really disable interrupts, rather than relying on lazy-disabling on
                                                                        some architectures.

preempt_enable_no_resched()    Decrement preemption count. If goes to 0, do not check for any
                                                                        pending reschedules.


in_nmi()                                                     Non-zero if executing NMI handler.


in_irq()                                                       Non-zero if currently executing  primary (hardirq) hardware

                                                                        interrupt  handler, 0 if in process context or bottom half or NMI.


in_softirq()                                              Non-zero if processing softirq or have BH disabled.


in_servng_softirq()                         Non-zero if processing softirq.


in_interrupt()                                        Non-zero if in NMI or primary interrupt (hardirq) or in bottom-half context, or if have BH disabled.

irqs_disabled()                                     Non-zero if local interrupt delivery is disabled.


in_atomic()                                              Checks (preempt_count() & ~PREEMPT_ACTIVE), cannot sense if

                                                                        spinlocks are held in non-preemptable kernel, so use only with
                                                                        great caution. Do not use in drivers.


synchronize_hardirq(irq)            Wait for pending hard IRQ handlers to complete on other CPUs


synchronize_irq(irq)                       Wait for pending hard & threaded IRQ handlers to complete on other CPUs


Warning: on non-preemptible systems (CONFIG_PREEMPT=n) spin_lock does not change preemption mask.


Bottom halves:


top half = primary interrupt context

bottom half = deferred interrupt processing (similar to DPC or fork), can be threaded




original BH

removed in 2.5

Task queues

removed in 2.5


since 2.3


since 2.3

Work queues

since 2.3




Original BH (removed in 2.5):


32-bit request mask and 32 lists of requests, each level globally synchronized across all CPUs.


Task queues  (removed in 2.5):


Set of queues.


Softirqs (BH):


Set of NR_SOFTIRQS statically defined bottom halves allocated at kernel compile time.

Cannot register extra softirq levels dynamically.

Handler can run on any CPU.


Rarely used directly (tasklets are more common), but tasklets are layered on Softirq.


Defined levels:




High-priority tasklets






Send network packets



Receive network packets



Block devices: done



Block devices: poll



Normal priority tasklets




(just inter-CPU balancing,

not the same as SOFTINT IPL$_RESCHED)



High-resolution timer



RCU callbacks



Level to name:


extern char *softirq_to_name[NR_SOFTIRQS];


Register handler:


void my_action(struct softirq_action *);

open_softirq(HRTIMER_SOFTIRQ, my_action);


Handler runs with interrupts enabled and cannot sleep.


Raise softirq:


Marks softirq level as pending.
Temporarily disables interrupts internally.




Equivalent to:


                        unsigned long flags;





If interrupts are already disabled, can do:




Try to send a softirq to a remote cpu. 

If this cannot be done, the work will be queued to the local cpu.


void  send_remote_softirq(struct call_single_data* cp, int cpu, int softirq);



Softirq processing logic:


do_softirqd is invoked:


·          on return from hardware interrupt (irq_exit)

·          local_bh_enable()

·          in the ksoftirqd kernel thread (per-CPU, runs at low priority)

·          in any code that explicitly calls do_softirq(), __do_softirq(), call_softirq() or invoke_softirq(),
  such as network code


void do_softirq(void)


    __u32 pending;

    unsigned long flags;

    struct softirq_action* h;


     if (in_interrupt())





    pending = local_softirq_pending();


    if (pending)


        ... on x86 and some other architectures ...

        ... switch to per-CPU softirq stack ...





        h = softirq_vec;




           if (pending & 1)



            pending >>= 1;

        } while (pending);   



    ... recheck if still pending, and if so retry the loop few times ...

    ... after which wakeup softirqd for execution on a worker thread ...







·          Multiple softirqs (of same or different levels) can concurrently execute on different processors.

·          Single pass on the same CPU goes from high-priority (0) to low-priority (9) softirqs, but in a larger scheme of things low-priority softirq can get executed earlier than high-priority softirq.

(E.g. if another processor picks up lower-priority softirq first, or if high-priority softirq is added while scan pass is already in progress, or if tasklets are relegated to the threaded execution, of due to ksoftirqd.)

·          softirq’s are prioritized for ordering only (and even then are subject to postponing for ksoftirqd thread).

high-priority softirq does not interrupt lower-priority softirq.


"While a (softirq) handler runs, softirqs on the current processor are disabled.
A softirq never preempts another softirq (on the same CPU).
The only event that can preempt a softirq is an interrupt handler."



Tasklets (BH):



·          built on top of softirqs

·          can be scheduled only as one instance at a time (and also only on one processor at a time)

·          if handler is already running on CPU1 and tasklet gets scheduled again on CPU2, second instance of handler can run (on CPU2) concurrently with the first one (running on CPU1)

·          handler runs with all interrupts enabled

·          does not have process context (no files etc.)

·          cannot sleep

·          per-processor queues tasklet_vec (for regular tasklets) and tasklet_hi_vec (for high-priority tasklets)



·          if tasklet_schedule is called, then tasklet is guaranteed to be executed on some CPU at least once after this

·          if tasklet is already scheduled, but its execution is still not started, it will be executed only once

·          if tasklet is already running on another CPU (or schedule is called from tasklet itself), it is scheduled again

·          tasklet is strictly serialized with respect to itself, but not to another tasklets;
  for inter-tasklet synchronization, use spinlocks.



      struct tasklet_struct


              struct tasklet_struct *next;

              unsigned long state;           // bits: TASKLET_STATE_SCHED (0), TASKLET_STATE_RUN (1)

              atomic_t count;            // if non-zero, disabled and cannot run
                                                                                                                                                  // otherwise enabled and can run if scheduled

              void (*func)(unsigned long);

              unsigned long data;



#define DECLARE_TASKLET(name, func, data) \

                struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(0), func, data }


#define DECLARE_TASKLET_DISABLED(name, func, data) \

               struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(1), func, data }


void   tasklet_init(struct tasklet_struct *t, void (*func)(unsigned long), unsigned long data)


int     tasklet_trylock(struct tasklet_struct *t)                                       // sets RUN bit on MP, return  true if was not set

void   tasklet_unlock(struct tasklet_struct *t)                                       // clear RUN bit

void   tasklet_unlock_wait(struct tasklet_struct *t)                                                 // spin while RUN bit is set


No-op on uniprocessor.


void   tasklet_schedule(struct tasklet_struct *t)                                  

void   tasklet_hi_schedule(struct tasklet_struct *t)


If SCHED is already set, do nothing.

Otherwise insert into queue and signal softirq.


void   tasklet_kill(struct tasklet_struct *t);


Cannot be called from interrupt.
Acquire SCHED bit (i.e. wait till owner clears it, e.g after removing from queue), spin while RUN bit is set, clear SCHED bit.


tasklet state at call to tasklet_kill


not queued

returns fine

queued, disabled

will lock up

enabled, queued

removed from queue
handler completed execution
attempts to re-schedule would not pass

enabled, running


void   tasklet_kill_immediate(struct tasklet_struct *t, unsigned int cpu);


Can be called only for CPU in state CPU_DEAD.

Remove tasklet even if it is in SCHED state.

void   tasklet_disable_nosync(struct tasklet_struct *t)


Disables tasklet by atomic_inc(t->count) and returns immediately.

void   tasklet_disable(struct tasklet_struct *t)


Disables tasklet by atomic_inc(t->count) and calls tasklet_unlock_wait().


void   tasklet_enable(struct tasklet_struct *t)
void   tasklet_hi_enable(struct tasklet_struct *t)


Enables tasklet by atomic_dec(t->count).


Tasklet block:

flags: sched (“scheduled”), run (“currently running”)

Handler logic:



softirq handler

if (!sched)


        sched = 1

        queue + softirq


copy:  list = per-cpu queue

clear per-cpu queue


for (each entry)


        if (run)


                put back on the queue

                request softirq




                run = 1


                if (disable_count != 0)

                        run = 0

                        goto requeue




                        sched = 0


                        run = 0





Work queues (BH):


·          items executed by a worker thread, in process context and can sleep

·          default threads are per-CPU event/n, but queue can also have dedicated processing thread or threads


wq = create_workqueue("my_queue");
wq = create_singlethread_workqueue("my_queue");
wq = create_freezable_workqueue("my_queue");

NULL if fails.

create_workqueue  creates queue and worker threads, one per CPU, named my_queue/n.


pre-existing queues:



Used by schedule[_delayed]_work[_on]().
Per-CPU multithreaded.

For short items.


Similar to system_wq, but for long items.
Queue flushing may take relatively long time.


Similar to system_wq, but freezable.


Workers are not bound to any specific CPU, not concurrency managed, and all queued works are executed immediately as long as max_active limit is not reached and resources are available.


int       keventd_up(void)


Check if system_wq exists (not NULL).


void  destroy_workqueue( struct workqueue_struct* wq);


void    workqueue_set_max_active(wq, int max_active)


Adjust maximum number of concurrently active works.
Do not call from interrupt context.


struct workqueue_struct*  wq;

work_struct  work;                                                   

void   my_func(struct work_struct *pWork);


DECLARE_WORK(work, myfunc);                                                                       // compile-time initialization

INIT_WORK(&work, myfunc)                                                       // run-time initialization

INIT_WORK_ONSTACK(&work, myfunc)                               // ... same for on-stack work object

PREPARE_WORK(&work, new_myfunc)                               // change the routine in an already initialized descriptor


bool    queue_work(wq, pWork)                                                                       // queue on current CPU

bool    schedule_work(pWork)                                                   // same as queue_work  on system_wq


Returns TRUE if was queued, FALSE if was already on the queue.


If queue is being drained with drain_queue, will not actually insert in the queue unless called from the context of a "chained work", i.e. work executing on this queue, but will not reflect it in return status.


To pass parameters to work routine, embed struct_work in larger structure.


bool    queue_work_on(int cpu, wq, pWork)                                          // queue for processing on  specified CPU

bool    schedule_work_on(int cpu, pWork)                                              // same as queue_work_on  on system_wq


Returns TRUE if was queued, FALSE if was already on the queue.
Caller must ensure the CPU cannot go away.
If queue is being drained with drain_queue, (see above).

struct delayed_work dwork

struct delayed_work *pDelayedWork

void   my_delayed_work_timer_func(unsigned long data);


DECLARE_DELAYED_WORK(dwork, my_delayed_work_timer_func)               // compile-time initialization

INIT_DELAYED_WORK(&dwork, my_delayed_work_timer_func)                                              // run-time initialization

INIT_DELAYED_WORK_ONSTACK(&dwork, my_delayed_work_timer_func)                      // ... same for on-stack dwork object

PREPARE_ DELAYED _WORK(&dwork, new_my_delayed_func)                          // change the routine in an already initialized struct


/* Deferrable is similar to delayed, but may postpone waking up CPU if sleeping */

DECLARE_DEFERRABLE_WORK(dwork, my_delayed_work_timer_func)       // compile-time initialization

INIT_DEFERRABLE_WORK(&dwork, my_delayed_work_timer_func)              // run-time initialization

INIT_DEFERRABLE_WORK_ONSTACK(&dwork, my_delayed_work_timer_func)              // ... same for on-stack dwork object

PREPARE_ DELAYED _WORK(&dwork, new_my_delayed_func)                          // change the routine in an already initialized struct


bool    queue_delayed_work(wq, struct delayed_work *work, unsigned long delay)

bool    schedule_delayed_work(struct delayed_work *work, unsigned long delay)

bool    queue_delayed_work_on(int cpu, wq, struct delayed_work *work, unsigned long delay)

bool    schedule_delayed_work_on(int cpu, struct delayed_work *work, unsigned long delay)


Similar to queue_work/schedule_work, but additional parameter delay (number of system ticks) specifies minimum time to delay before executing the item. If queue is being drained with drain_queue, (see above).


bool    mod_delayed_work(wq, pDelayedWork, unsigned long delay)

bool    mod_delayed_work_on(int cpu, wq, pDelayedWork, unsigned long delay)


Modify delay of or queue a delayed work on specific CPU or any CPU.


If pDelayedWork is idle, equivalent to queue_delayed_work_on(); otherwise, modify pDelayedWork 's timer so that it expires after delay.  If delay is zero, pDelayedWork is guaranteed to be scheduled immediately regardless of its current state.


Returns FALSE if pDelayedWork was idle and queued, TRUE if pDelayedWork was pending and its timer was modified.


Safe to call from any context including interrupt handler.


bool    cancel_work_sync(pWork)


Cancel pWork and wait for its execution to finish. 


Can be used even if the work re-queues itself or migrates to another workqueue. 

On return, pWork is guaranteed to be not pending or executing on any CPU.


Returns TRUE if pWork was pending, FALSE otherwise.


Caller must ensure that the workqueue on which pWork was last queued can't be destroyed before this function returns.

bool    cancel_delayed_work_sync(pDelayedWork)


Cancel a delayed work and wait for it to finish.

Similar to cancel_work_sync(), but for delayed works.


bool    cancel_delayed_work(pDelayedWork)


Cancel a pending delayed work. 

Returns TRUE if pDelayedWork was pending and canceled, FALSE if wasn't pending. 


Work callback function may still be running on return, unless function returns TRUE and the work doesn't re-arm itself.  Explicitly flush or use cancel_delayed_work_sync() to wait on it.


Safe to call from any context including IRQ handler.

bool    flush_work(pWork)


Wait for a work to finish executing the last queueing instance.

Wait until pWork has finished execution. 

pWork is guaranteed to be idle on return if it hasn't been requeued since flush started.


Returns TRUE if flush_work() waited for the work to finish execution, FALSE if it was already idle.


bool    flush_delayed_work(pDelayedWork)


Wait for a delayed work to finish executing the last queueing.


Delayed timer is cancelled and the pending work is queued for immediate execution. 

Like flush_work(), this function only considers the last queueing instance of pDelayedWork.


Returns TRUE if waited for the work to finish execution, FALSE if it was already idle.


void    flush_workqueue(wq)

void    flush_scheduled_work(void)                                        // same as flush_workqueue(system_wq)


Ensure that any scheduled work has run to completion.

Forces execution of the workqueue and blocks until its completion.


Sleep until all works which were queued on entry have been handled, but not livelocked by new incoming ones.


This is typically used in driver shutdown handlers.


void    drain_workqueue(wq)


Wait until the workqueue becomes empty. 


While draining is in progress, only chain queueing is allowed. 

In other words, only currently pending or running work items on wq can queue further work items on it. 


wq is flushed repeatedly until it becomes empty.  The number of flushing is determined by the depth of chaining and should be relatively short.


int       schedule_on_each_cpu(work_func_t  func)


Execute a function synchronously on each online CPU


Executes func on each online CPU using the system workqueue and blocks until all CPUs have completed.

Very slow.


Returns 0 on success, -errno on failure.


int       execute_in_process_context(work_func_t  fn, struct execute_work*  pExecuteWork)


Execute the routine within user context.


Executes the function immediately if process context is available, otherwise schedules the function for delayed execution.

Returns:     0 if function was executed immediately (i.e. was not in interrupt).

Returns:     1 if was in interrupt and attempted to schedule function for execution.


bool    workqueue_congested(unsigned int cpu, wq)


Test whether wq's cpu workqueue for cpu is congested. 


There is no synchronization around this function and the test result is unreliable and only useful as advisory hints or for debugging.

unsigned int    work_cpu(pWork)


Return the last known associated cpu for pWork: CPU number if pWork was ever queued, WORK_CPU_NONE otherwise.


unsigned int    work_busy(pWork)


Test whether a work is currently pending or running


There is no synchronization around this function and the test result is unreliable and only useful as advisory hints or for debugging. Especially for reentrant wqs, the pending state might hide the running state.


Returns or'd bitmask of WORK_BUSY_* bits.



Timers (BH):

Called via TIMER_SOFTIRQ .
Timers are not repetitive (once expires, need to be restarted).

void myfunc(unsigned long mydata)  { ... }

[static] DEFINE_TIMER(tmr, myfunc, 0, mydata);

struct timer_list  tmr;
init_timer(& tmr);
init_timer_onstack(& tmr);         // if variable is on stack, also call destroy_timer_on_stack(& tmr)
init_timer_deferable(& tmr);      // will not casue CPU to wake up from power sleep
tmr.function  =  myfunc;  =  mydata;

tmr.expires  =  jiffies + HZ / 2;

setup_timer(&tmr, myfunc, mydata)
setup_timer_on_stack(&tmr, myfunc, mydata)
setup_deferable_timer_on_stack(&tmr, myfunc, mydata)

void     add_timer(& tmr)                                     // must not be already pending
void     add_timer_on(&tmr, int cpu)

int        mod_timer(&tmr, unsigned long expires)

Equivalent to   del_timer(tmr); tmr.expires  = expires; add_timer(tmr);
The only safe way to queue timer when there are multiple unsynchronized concurrent users of the same timer.
If timer was inactive, returns 0, if was active, returns 1.

int        mod_timer_pending(&tmr, unsigned long expires)

Modify an already pending timer.
For active timers the same as mod_timer().
But will not reactivate or modify an inactive timer.

int        mod_timer_pinned(&tmr, unsigned long expires)


Update the expire field of an active timer (if the timer is inactive it will be activated) and ensure that the timer is scheduled on the current CPU.


This does not prevent the timer from being migrated when the current CPU goes offline.  If this is a problem for

you, use CPU-hotplug notifiers to handle it correctly, for example, cancelling the timer when the corresponding CPU goes



Equivalent to:   del_timer(timer); timer->expires = expires; add_timer(timer);

int        timer_pending(&tmr)    {  return != NULL;  }                                 // no inherent synchronization

int        del_timer(&tmr)

Deactivate timer.  OK to call either for active and inactive timers.
If was inactive, return 0, otherwise (if was active) return 1.
Guarantees that the timer won't be executed in the future, but does not wait for completion of an already running handler.


int        del_timer_sync(&tmr)

Deactivate a timer and wait for the handler to finish.


Synchronization rules: Callers must prevent restarting of the timer, otherwise this function is meaningless.

It must not be called from interrupt contexts unless the timer is an irqsafe one.

The caller must not hold locks which would prevent completion of the timer's handler.

The timer's handler must not call add_timer_on().

For non-irqsafe timers, you must not hold locks that are held in interrupt context while calling this function. Even if the lock has nothing to do with the timer in question.  Here's why:


   CPU0                                 CPU1

   ----                                 ----



                                         base->running_timer = mytimer;





   while (base->running_timer == mytimer);


Now del_timer_sync() will never return and never release somelock. The interrupt on the other CPU is waiting to grab somelock but it has interrupted the softirq that CPU0 is waiting to finish.

int        try_to_del_timer_sync(&tmr)

Try to deactivate a timer. Upon successful (ret >= 0) exit the timer is not queued and the handler is not running on any CPU.


Hi-res timers:

#include <linux/ktime.h>

#include <linux/hrtimer.h>


/* nanoseconds */

typedef union ktime {  s64 tv64;  }  ktime_t;




ktime_t  ktime_set(s64 sec, unsigned long ns)

ktime_t  timespec_to_ktime(struct timespec ts)

ktime_t  timespec64_to_ktime(struct timespec64 ts)

ktime_t  timeval_to_ktime(struct timeval tv)


s64  ktime_to_ns(kt)

s64  ktime_to_us(kt)

s64  ktime_to_ms(kt)

struct timespec  ktime_to_timespec(kt)

struct timespec  ktime_to_timespec64(kt)

struct timeval  ktime_to_timeval(kt)


ktime_t   ktime_add(kt1, kt2)

ktime_t   ktime_sub(kt1, kt2)

ktime_t   ktime_add_ns(kt, unsigned long ns)

ktime_t   ktime_sub_ns(kt, unsigned long ns)

ktime_t   ktime_add_safe(ktime_t kt1, ktime_t kt2)


ktime_t  ktime_add_ms(kt, u64 ms)

ktime_t  ktime_sub_us(kt, u64 us)


int   ktime_compare(kt1, kt2)

bool  ktime_equal(kt1, kt2)

bool  ktime_after(kt1, kt2)

bool  ktime_before(kt1, kt2)




timer->function = ftimer;

hrtimer_start(timer, ktime_set(0, ns), HRTIMER_MODE_REL);


hrtimer_cancel(timer);  // waits for the handler to finish


static enum hrtimer_restart ftimer(struct hrtimer *timer)


                        missed = hrtimer_forward_now(timer, ktime_set(0, ns);




                        return HRTIMER_NORESTART;  // or HRTIMER_RESTART



struct hrtimer *timer;

clockid_t which_clock =  CLOCK_MONOTONIC or CLOCK_REALTIME;

enum hrtimer_mode mode =

HRTIMER_MODE_ABS                       // absolute

HRTIMER_MODE_REL                        // relative to now

HRTIMER_MODE_ABS_PINNED   // absolute + bound to this CPU

HRTIMER_MODE_REL_PINNED // relative + bound to this CPU


void   hrtimer_init(timer, which_clock, mode)

void   hrtimer_init_on_stack(timer, which_clock, mode)

void  destroy_hrtimer_on_stack(timer)

int   hrtimer_start(timer, kt, mode)

int   hrtimer_start_range_ns(timer, kt unsigned long range_ns, mode)


u64   hrtimer_forward(timer, ktime_t now, ktime_t interval)

u64  hrtimer_forward_now(timer, ktime_t interval)


ktime_t  now = timer->base->get_time()

ktime_t  now = hrtimer_cb_get_time(timer)


void  hrtimer_set_expires(timer, kt)

void  hrtimer_set_expires_range(timer, kt, ktime_t delta)                                  // delta is extra tolerance

void  hrtimer_set_expires_range_ns(timer, kt, unsigned long delta_ns)

void  hrtimer_add_expires(timer, kt)

void  hrtimer_add_expires_ns(timer, u64 ns)


ktime_t  hrtimer_get_softexpires(timer)                                                   // without added tolerance (i.e. soft)

ktime_t  hrtimer_get_expires(timer)                                                            // with tolerance, i.e. soft + delta


ktime_t  hrtimer_expires_remaining(timer)

void   hrtimer_get_res(which_clock, struct timespec *tp)


Wait queues:


Replacement for sleep/wake.

Exclusive waiters (those with WQ_FLAG_EXCLUSIVE) count against wakeup limit counter, non-exclusive do not.


wait_queue_head_t   wq;

init_waitqueue_head(& wq);


Sample wait cycle:


[static] DEFINE_WAIT(mywait);                                             // create wait queue entry


add_wait_queue(&wq, &mywait);                  // add it to wait queue

while (! (condition))

        /* sets current task state */

        /*     TASK_INTERRUPTIBLE: signal wakes process up */

        /*     TASK_UNINTERRUPTIBLE: does not */

        prepare_to_wait(&wq, &mywait, TASK_INTERRUPTIBLE);

        if (signal_pending(current))
                ... handle signal ...




finish_wait(&wq, &mywait);                              // sets task state to TASK_RUNNING and removed mywait off wq


Note: instead of default_wake_function can supply own wake function in mywait.func(mywait.private).


Helper macros:


void   wait_event(&wq, condition)

Uninterruptible wait.


void   wait_event_lock_irq(&wq, condition, spinlock_t  lock)

void   wait_event_lock_irq_cmd(&wq, condition, spinlock_t  lock,  cmd)


Uninterruptible wait.
Condition checked under the lock.

Expected to be called with the lock held, returns with the lock held.

Command cmd is executed right before going to sleep.


int   wait_event_interruptible(&wq, condition)

int   wait_event_interruptible_lock_irq(&wq, condition, spinlock_t  lock)
int   wait_event_interruptible_lock_irq_cmd(&wq, condition, spinlock_t  lock,  cmd)

Interruptible wait.
Returns 0 if condition evaluated to true, or -ERESTARTSYS if was interrupted.


Condition checked under the lock.

Expected to be called with the lock held, returns with the lock held.

Command cmd is executed right before going to sleep.

long   wait_event_timeout(&wq, condition, timeout)                                                 // timeout is in jiffies


Uninterruptible wait with timeout.
Returns 0 if timeout elapsed, otherwise remaining time.

long   wait_event_interruptible _timeout(&wq, condition, timeout)                // timeout is in jiffies


Interruptible wait with timeout.
Returns -ERESTARTSYS if sleep was interrupted, 0 if timeout elapsed, otherwise remaining time.


int    wait_event_interruptible_exclusive(&wq, condition)


Interruptible exclusive wait.
Returns 0 if condition evaluated to true, or -ERESTARTSYS if was interrupted.


void     __add_wait_queue_exclusive(&wq, &mywait)

void     __add_wait_queue_tail_exclusive(&wq, &mywait)
void     add_wait_queue_exclusive(&wq, &mywait)

void     prepare_to_wait_exclusive(&wq, &mywait, TASK_[UN]INTERRUPTIBLE)


General routines/macros to set up for exclusive wait.


int   wait_event_interruptible_exclusive_locked(&wq, condition)

int   wait_event_interruptible_exclusive_locked_irq(&wq, condition)


Interruptible wait.

Must be called with wq.lock held.
Spinlock is unlocked while sleeping but condition testing is done while the lock

 is held and when this macro exits the lock is held.

      wait_event_interruptible_exclusive_locked uses  spin_lock()/spin_unlock().
wait_event_interruptible_exclusive_locked_irq uses  spin_lock_irq()/spin_unlock_irq().
Returns -ERESTARTSYS if sleep was interrupted, 0 if condition was satisfied.


int   wait_event_killable(&wq, &mywait)


Semi-interruptible (only by fatal signals) wait with timeout.
Returns 0 if condition evaluated to true, or -ERESTARTSYS if was interrupted.


int   wait_on_bit(void *word, int bit, int (*action)(void *), unsigned sleepmode)
int   wait_on_bit_lock(void *word, int bit, int (*action)(void *), unsigned sleepmode)

wait_on_bit : wait for a bit to be cleared.

wait_on_bit_lock : wait for a bit to be cleared when wanting to set it.


action is a function used to sleep.


Wake up:


wake_up_all(&wq);                                                   // wakes up all waiters

wake_up(&wq);                                                           // wakes up upto one waiter, same as wake_up_nr(&wq, 1)
wake_up_nr(&wq,  5);                                             // wakes up upto specified number of waiters


Only exclusive waiters (with WQ_FLAG_EXCLUSIVE) count against the counter.


/* same but with queue spinlock held */

unsigned long flags;

spin_lock_irqsave(&q->lock, flags);

wake_up_locked(&wq);                                         // wakes up upto one waiter
wake_up_all locked(&wq);                                  // wakes up upto one waiter
spin_unlock_irqrestore(&q->lock, flags);


void   wake_up_bit(void* word, int bit)


Wake up waiter on a bit.


Remove from wait queue:


void   remove_wait_queue(&wq, &mywait)



Completion variables:


Wait queue plus "done" flag.


                 struct completion {

                             unsigned int done;

                             wait_queue_head_t wait;



struct competion  comp;

#define  INIT_COMPLETION(x)       ((x).done = 0)                                  // reinitialize for reuse

void   init_completion(& comp)


void                      wait_for_completion(&comp)

int                         wait_for_completion_interruptible(&comp)

int                         wait_for_completion_killable(&comp)

unsigned long   wait_for_completion_timeout(&comp, unsigned long timeout)

long                      wait_for_completion_interruptible_timeout(&comp, unsigned long timeout)

long                      wait_for_completion_killable_timeout(&comp, unsigned long timeout)

bool                     try_wait_for_completion(&comp)

bool                     completion_done(&comp)


void                       complete(&comp)

void                       complete_all(&comp)


Atomic operations on 32-bit counters:


#define ACCESS_ONCE(x)    (*(volatile typeof(x) *)&(x))                     // note: does not imply memory barrier


typedef struct { volatile int counter; } atomic_t;


atomic_t v = ATOMIC_INIT(77);
int i;


int       atomic_read(pv,  i)                                     // no implied barriers !

void    atomic_set(pv,  i)                                          // ...

void    atomic_add(i,  pv)                                       // ...
void    atomic_sub(i,  pv)                                        // ...
void    atomic_inc(pv)                                               // ...
void    atomic_dec(pv)                                              // ...


No implied barriers  (compiler or memory) !!
For some operations, use helper functions

void    smp_mb__before_atomic_dec()

void    smp_mb__after_atomic_dec()


void    smp_mb__before_atomic_inc()

void    smp_mb__after_atomic_inc()

For others use  smp_wmb, smp_rmb, smp_mb or barrier.


int       atomic_sub_and_test(i, pv)                 // true if result is zero, performs memory barriers before and after
int       atomic_dec_and_test(pv)                      // ...
int       atomic_inc_and_test(pv)                       // ...


int       atomic_add_return(i, pv)                      // return result, performs memory barriers before and after
int       atomic_sub_return(i, pv)                       // ...
int       atomic_inc_return(pv)                                                           // ...
int       atomic_dec_return(pv)                                                          // ...


int       atomic_add_negative(i, pv)                 // *pv += i;  return true if result is negative, performs memory barriers before and after


int       atomic_xchg(pv, i)                                      // returns previous value, performs memory barriers before and after

int       atomic_cmpxchg(pv, old, new)                                 // CAS, returns actual old value, performs memory barriers before and after


int       atomic_add_unless(pv, int a, int mark)                      // if (*pv != mark) { *pv += a;  return true; }  else { return false; }   and does barriers

void    atomic_clear_mask(bitmask,  pv)

void    atomic_set_mask(bitmask,  pv)


int      _atomic_dec_and_lock(pv , spinlock_t *lock)


Atomically decrement *pv and if it drops to zero, atomically acquire lock and return true.
If does not drop to zero, just decrement, do not acquire lock and return false.

Atomic operations on 64-bit counters:

Most 32-bit architectures do not support 64-bit atomic operations, although x86_32 does.



Atomic operations on bits:


void    set_bit(unsigned long nr, volatile unsigned long *addr)                                                  // no barriers implied

void    clear_bit(unsigned long nr, volatile unsigned long *addr)                                              // ...

void    change_bit(unsigned long nr, volatile unsigned long *addr)                                         // ... (flips the bit)


                        void smp_mb__before_clear_bit(void)

                        void smp_mb__after_clear_bit(void)

int test_bit(unsigned long nr, __const__ volatile unsigned long *addr);


int       test_and_set_bit(unsigned long nr, volatile unsigned long *addr)                                  // returns boolean, performs memory barriers

int       test_and_clear_bit(unsigned long nr, volatile unsigned long *addr)

int       test_and_change_bit(unsigned long nr, volatile unsigned long *addr)


int       test_and_set_bit_lock(unsigned long nr, unsigned long *addr)                                                 // acquire-release semantics

void    clear_bit_unlock(unsigned long nr, unsigned long *addr)                                               // ...

Non-atomic operations on bits:


void    __set_bit(unsigned long nr, volatile unsigned long *addr)

void    __clear_bit(unsigned long nr, volatile unsigned long *addr)

void    __change_bit(unsigned long nr, volatile unsigned long *addr)


int       __test_and_set_bit(unsigned long nr, volatile unsigned long *addr)

int       __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr)

int       __test_and_change_bit(unsigned long nr, volatile unsigned long *addr)


void     __clear_bit_unlock(unsigned long nr, unsigned long *addr)                                        // implements release barrier


unsigned long    find_first_bit(const unsigned long *addr, unsigned long size)

unsigned long    find_first_zero_bit(const unsigned long *addr, unsigned long size)

unsigned long    find_next_bit(const unsigned long *addr, unsigned long size, unsigned long offset)


if not found, returns value >= size.


int     ffs(int x)
                        int     ffz(int x)


Memory barriers:

rmb, wmb, mb, read_barrier_depends  – emit on UP and MP
smp_rmb, smp_wmb, smp_mb, smp_read_barrier_depends  – emit MP ; on UP emit only barrier()
barrier()  –
compiler barrier

p = ...
read_barrier_depends();                      // a weaker form of read barrier, only affecting dependent data
a = *p;

[can drift down]
LOCK barrier
[cannot drift up]

[cannot drift down]
UNLOCK barrier
[can drift up]

Therefore UNLOCK + LOCK = full MB, but LOCK + UNLOCK is not (may drift into the middle and get reverse-ordered there).

smp_mb__before_spinlock() does not let stores before critical section to dirft inside it

set_current_state(TASK_UNINTERRUPTIBLE)  does: smp_mb(),  current->state = TASK_UNINTERRUPTIBLE
wake_up does smp_mb

mmiowb()  –  order writes to IO space (PCI etc.)



Not recursive.
Disables preemption on local CPU.
Optionally disables hardware interrupts or bottom-half processing.
While holding a spinlock, must not do anything that will cause sleep. Do not touch pageable (e.g. user) memory, kmalloc(GFP_KERNEL), any semaphore functions or any of the schedule functions, down_interruptible() and down().
Ok to call down_trylock() and up() as they do never sleep.
spinlock_t lock = SPIN_LOCK_UNLOCKED;    // obsolete, incompatible with RT
Or:    void spin_lock_init(&lock);
Disable-restore interrupts on local CPU:
unsigned long flags;
spin_lock_irqsave(&lock, flags);
spin_unlock_irqrestore(&lock, flags);
Performs  local_irq_save(); preempt_disable().
If knew for sure interrupts were enabled, can do instead:
Performs  local_irq_disable(); preempt_disable().
If spinlock is never acquired from interrupt handle, can use the form that does not disable interrupts:
int spin_trylock(&lock);
Performs  preempt_disable().
Disable bottom half processing, but do not disable hardware interrupts:
int spin_trylock_bh(&lock);
Performs  local_bh_disable(); preempt_disable().
Nested forms for lockdep:
spin_lock_nested(&lock, subclass)
spin_lock_irqsave_nested(&lock, flags, subclass)
spin_lock_nest_lock(&lock, nest_lock)               // signal: ok to take lock despite already holding nest_lock of the same class


RW spinlocks:


Not recursive.
Disables preemption on local CPU.
Optionally disables hardware interrupts or bottom-half processing.
No reader -> writer conversion.
Readers are favored over writers and can starve them, in this case plain spinlock_t may be prefereable to rwlock_t.
While holding a spinlock, must not do anything that will cause sleep. Do not touch pageable (e.g. user) memory, kmalloc(GFP_KERNEL), any semaphore functions or any of the schedule functions, down_interruptible() and down().
Ok to call down_trylock() and up() as they do never sleep.
rwlock_t lock = RW_LOCK_UNLOCKED;      // obsolete, incompatible with RT
Or:    DEFINE_RWLOCK(lock)
Or:    void rwlock_init(&lock);
Disable-restore interrupts on local CPU:
unsigned long flags;
read_lock_irqsave(&lock, flags);
read_unlock_irqrestore(&lock, flags);
write_lock_irqsave(&lock, flags);
write_unlock_irqrestore(&lock, flags);
Performs  local_irq_save(); preempt_disable().
If knew for sure interrupts were enabled, can do instead:
Performs  local_irq_disable(); preempt_disable().
If spinlock is never acquired from interrupt handle, can use the form that does not disable interrupts:
int read_trylock(&lock);
int write_trylock(&lock);
Performs  preempt_disable().
Disable bottom half processing, but do not disable hardware interrupts:
Performs  local_bh_disable(); preempt_disable().

Local-global spinlock:


·          Global data: very fast and scalable read locking, but very slow write locking.

·          Per-CPU data: fast write locking on local CPU, possible write locking of other CPU’s data, very slow write locking of all CPUs data.


struct lglock   lg;

void    lg_lock_init(&lg, “mylg”);

void    lg_local_lock(&lg);

void    lg_local_unlock(&lg);

void    lg_local_lock_cpu(&lg, int cpu);

void    lg_local_unlock_cpu(&lg, int cpu);

void    lg_global_lock(&lg);

void    lg_global_unlock(&lg);




Sleepable lock (may sleep when acquiring, can sleep while holding it).
May not use in softirq or hardirq context.
Task may not exit with mutex held.

struct mutex mtx;
[static] DEFINE_MUTEX(mtx);

void   mutex_init(&mtx);
void   mutex_destroy(&mtx);

void   mutex_lock(&mtx)
int      mutex_trylock(&mtx)
void   mutex_unlock(&mtx)                                                                           // must be unlocked by owner only
int      mutex_is_locked(&mtx)


int      mutex_lock_interruptible(&mtx)                                                        // 0 if acquired, otherwise -EINTR

int      mutex_lock_killable(&mtx)                                                                 // …


int      atomic_dec_and_mutex_lock(atomic_t* cnt, &mtx)


If decrements cnt to 0, return true holding mutex.

Otherwise return false not holding mutex.


If locking more than one mutex of the same lock validation class, lockdep will complain. To lock multiple mutexes of the same class, designate their subclasses and lock with the following functions.


unsigned int subclass;

void   mutex_lock_nested(&mtx, subclass)

void   mutex_lock_interruptible_nested(&mtx, subclass)

void   mutex_lock_killable_nested(&mtx, subclass)
void   mutex_lock_nest_lock(&mtx, nest_lock)           
// signal: ok to take mtx despite already holding nest_lock of the same class



Mutexes with priority inheritance:


struct rt_mutex rmx;
[static] DEFINE_RT_MUTEX(rmx);


void    rt_mutex_init(&rmx)

void    rt_mutex_destroy(&rmx)

void    rt_mutex_lock(&rmx)

int       rt_mutex_trylock(&rmx)

void    rt_mutex_unlock(&rmx)

int       rt_mutex_is_locked(&rmx)

int       rt_mutex_timed_lock(&rmx,  struct hrtimer_sleeper *timeout,  int detect_deadlock)


success = 0, otherwise –EINTR or –ETIMEOUT or -EDEADLK


int       rt_mutex_lock_interruptible(&rmx,  int detect_deadlock)


success = 0, otherwise –EINTR or -EDEADLK




Sleepable lock (may sleep when acquiring, can sleep while holding it).
May not use in softirq or hardirq context.

struct semaphore sem;
[static] DEFINE_SEMAPHORE(sem);

void    sema_init(&sem, int val)                                                       // val is maximum number of holders

void    down(&sem)                                                                           // acquire

int       down_trylock(&sem)                                                            // inverted result:  0 if acquired, 1 if did not

int       down_interruptible(&sem)                                                 // 0 or -EINTR

int       down_killable(&sem)                                                           // 0 or -EINTR

int       down_timeout(&sem, long jiffies)                                      // 0 or -ETIME

void    up(&sem)                                                                                // release

RW locks (RW semaphores):

Any number of readers, up to 1 writer.
Sleepable lock (may sleep when acquiring, can sleep while holding it).
May not use in softirq or hardirq context.
There is no interrptible/killable wait, all wait is uninterruptible.

struct rw_semaphore rws;
[static] DECLARE_SEMAPHORE(rws);

void    init_rwsem(&rws)

void    down_read(&rws)                                                               // acquire

void    down_write(&rws)


int      down_read_trylock(&rws)                                                  // true = ok, false = failure

int      down_write_trylock(&rws)


void    up_read(&rws)                                                                      // release

void    up_write(&rws)


int      rwsem_is_locked(&rws)                                                      // true if locked


void    downgrade_write(&rws)                                                // downgrade write lock to read lock


If locking more than one RW semaphore of the same lock validation class, lockdep will complain. To lock multiple mutexes of the same class, designate their subclasses and lock with the following functions.


void    down_read_nested(&rws, int subclass)

void    down_write_nested(&rws, int subclass)

void    down_write_nest_lock(&rws, nest_lock)           // signal: ok to take rws despite already holding nest_lock of the same class



Sequential locks.

·          Good when lots of readers, few writers

·          Writers take priority over readers

·          So readers cannot starve writers

"Write" side (in write_seqlock_xxx) internally uses spinlock and runs locked code with spinlock held.
"Read" side does not use spinlock.
"Write" side is subject to lock validator, "read" side is not.

Internally increments a variable inside the lock. When it is odd, write lock is held.

Writer usage

Reader usage


... write data ...


do {

        start = read_seqbegin(&sq);

        ... read data ...
} while (read_seqretry(&sq,  start))

seqlock_t sq;
[static] DEFINE_SEQLOCK(sq);
unsigned int start;

void   seqlock_init(& sq)

void   write_seqlock(&sq)

int      write_tryseqlock(&sq)                                                       // true if locked

void   write_sequnlock(&sq)


void   write_seqlock_irqsave(&sq, flags)

void   write_sequnlock_irqrestore(&sq, flags)


void    write_seqlock_irq(&sq)

void    write_sequnlock_irq(&sq)


void    write_seqlock_bh(&sq)

void    write_sequnlock_bh(&sq)

void     write_seqcount_begin(seqcount_t *s)                   // caller uses his own mutex or other lock, built-in spinlock is not used

void     write_seqcount_end(seqcount_t *s)                        // ...


void     write_seqcount_barrier(seqcount_t *s)              // invalidate in-progress read operations


start  =  read_seqbegin(const &sq)

int          read_seqretry(const &sq, unsigned start)


start  =  read_seqbegin_irqsave(&sq, flags)

start  =  read_seqretry_irqrestore(&sq, iv, flags)


start  =  __read_seqcount_begin(const seqcount_t *s)                                   // like read_seqbegin, but does not have smp_rmb at the end

start  =  read_seqcount_begin(const seqcount_t *s)                                      // equivalent to read_seqbegin

start  =  raw_seqcount_begin(const seqcount_t *s)                                       // get seqcount without waiting for it to go even, plus smp_rmb

int            __read_seqcount_retry(const seqcount_t *s, unsigned start)    
// like read_seqretry, but does not issue smp_rmb
                                                                                                                                  // (caller must issue rmb before the call)

int           read_seqcount_retry(const seqcount_t *s, unsigned start)         // equivalent to read_seqretry




Lock validator (lockdep):


Every lock is assigned a class (= “key address”):

·          For statically allocated lock it is the address of the lock

·          For dynamic locks it is the spot of their init (spin_lock_init, mutex_init etc.) routine

Validator checks for:

·          ClassA lock taken after ClassB if previously locks of these two classes were ever taken in inverse order

·          ClassX lock is taken when ClassX lock is already held.

o    Exemption: use functions like mutex_lock_nested and specify subclass.
Validator treats “class+subclass” as a separate class.

·          Locks released in the order that is not reverse of their acquisition.
(A stack of currently-held locks is maintained, so any lock being released should be at the top of the stack; anything else means that something strange is going on.)

·          Any (spin)lock ever acquired by hardirq can never be acquired when interrupts (hardirqs) are enabled.

o    Conversely: any lock ever acquired with hardirqs enabled is “hardirq-unsafe”
and cannot be acquired in hardirq.

·          Any (spin)lock ever acquired by softirq can never be acquired when softirqs are enabled.

o    Conversely: any lock ever acquired with softirqs enabled is “softirq-unsafe” and cannot be acquired in hardirq.

·          Cannot:  (holding) hardirq-safe -> (acquire) hardirq-unsafe.

o    Since hardirq-unsafe -> hardirq-safe is valid, so reverse is invalid, as it would cause lock inversion.
May be too strong, but it is checked against nevertheless.

·          Cannot:  (holding) softirq-safe -> (acquire) softirq-unsafe.


·          Unloading of kernel modules causes class ids to be available for reuse and can produce false warning and leakage of class table (that can overflow and result in a message).

·          Static initialization of large number of locks (e.g. array of structures with per-entry lock) can cause class table to overflow. Solution: initialize locks dynamically.


·          /proc/lockdep_stats

·          /proc/lockdep

Kernel threads:


struct task_struct* task;


task = kthread_create(thread_func, void* data, const char namefmt*, ...);
if (! IS_ERR(task))  wake_up_process(task);


Initially SCHED_NORMAL, nice 0.
Can alter with sched_setscheduler[_nocheck](…).


// create thread bound to NUMA memory node

task = kthread_create_on_node(thread_func, void* data, const char namefmt*,

                              cpu_to_node(cpu), ...);
if (! IS_ERR(task))  wake_up_process(task);


task = kthread_run(thread_func, void* data, const char namefmt*, ...);


int retval = kthread_stop(task);


Returns thread's retval or -EINTR if the process had never started.


int thread_func(void* data)


    /* check if kthread_stop had been called */

    if (kthread_should_stop())

        return retval;    /* return status to kthread_stop caller */






Priority ranges:


task_struct* p;



    0 (highest) … MAX_PRIO-1 (lowest, 139)

    MAX_PRIO = 140



    0 (highest) … MAX_RT_PRIO-1 (lowest, 99)



    MAX_RT_PRIO (100) … MAX_PRIO-1 (139)


Default priority for irq threads (PREEMPT_RT):

    MAX_USER_RT_PRIO/2  (50)






NR_CPUS  –  maximum number in the system              // eventually will be run-time variable nr_cpu_ids (nr_cpu_ids <= NR_CPUS)


unsigned long my_percpu[NR_CPUS];


int cpu = get_cpu();                                                  // disable preemption and get current processor


put_cpu();                                                                 // enable preemption


int   smp_processor_id();                                        // unsafe if preemption is not disabled


Per-CPU data interface, static form usable only in kernel but not in loadable modules:


DECLARE_PER_CPU(vartype, varname)                         // declare per-CPU variable (not required locally, DEFINE_PER_CPU is ok)
DEFINE_PER_CPU(vartype, varname)                           // define and allocate per-CPU variable


DECLARE_PER_CPU_SHARED_ALIGNED(...)                 // cacheline aligned

DECLARE_PER_CPU_ALIGNED(...)                                   // cacheline aligned

DECLARE_PER_CPU_PAGE_ALIGNED(...)                      // page-aligned


EXPORT_PER_CPU_SYMBOL(varname)                      // export for use by modules


get_cpu_var(varname)++;                                     // disable preemption, access variable and increment it

put_cpu_var(varname);                                          // reenable preemption


per_cpu(varname, cpu) = 0;                                // access per-cpu variable instance for particular CPU (does not disable

                                                                                  //    preemption and provides by itself no locking)


Per-CPU data interface, dynamic form usable both in kernel and in loadable modules:


void*   alloc_percpu(vartype);                                             // on failure returns NULL

void*   __alloc_percpu(size_t size, size_t align);

void    free_percpu(void* data);


A* p = alloc_percpu(A);                                                       // same as below

A* p = alloc_percpu(sizeof(A), __alignof(__A));             // ...


Then can do get_cpu_var(p) and put_cpu_var(p).


this_cpu_read , this_cpu_write, this_cpu_add, this_cpu_sub etc. – operations performed with interrupts or preemption disabled

CPU sets:


typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;
cpumask_t  cpumask;

unsigned long* bp = coumask_bits(& cpumask);


const struct cpumask *const   cpu_possible_mask;   // populatable, e.g. can be hot-plugged, fixed at boot time      

const struct cpumask *const   cpu_present_mask;     // populated

const struct cpumask *const   cpu_online_mask;       // available to scheduler

const struct cpumask *const   cpu_active_mask;        // available to migration (a subset of online, currently identical)


#define   num_online_cpus()                 cpumask_weight(cpu_online_mask)

#define   num_possible_cpus()              cpumask_weight(cpu_possible_mask)

#define   num_present_cpus()               cpumask_weight(cpu_present_mask)

#define   num_active_cpus()                  cpumask_weight(cpu_active_mask)


#define   cpu_online(cpu)                      cpumask_test_cpu((cpu), cpu_online_mask)

#define   cpu_possible(cpu)                   cpumask_test_cpu((cpu), cpu_possible_mask)

#define   cpu_present(cpu)                    cpumask_test_cpu((cpu), cpu_present_mask)
#define   cpu_active(cpu)                       cpumask_test_cpu((cpu), cpu_active_mask)

#define   cpu_is_offline(cpu)                 unlikely(!cpu_online(cpu))


cpumask_t  mask  = { CPU_BITS_NONE };
mask_t mask = cpu_none_mask;

mask_t mask = cpu_all_mask;
cpumask_set_cpu(unsigned int cpu, & mask)
cpumask_clear_cpu(unsigned int cpu, & mask)
int   cpumask_test_cpu(unsigned int cpu, & mask)
int   cpumask_test_and_set_cpu(unsigned int cpu, & mask)
int   cpumask_test_and_clear_cpu(unsigned int cpu, & mask)


cpumask_setall(& mask);

cpumask_and(&mask, &m1, &m2);
cpumask_or(&mask, &m1, &m2);
cpumask_xor(&mask, &m1, &m2);
cpumask_andnot(&mask, &m1, &m2);
cpumask_complement(&mask, &m1);

bool  cpumask_[is]empty(& mask);
bool  cpumask_[is]subset(& mask, & superset_mask);

#define cpu_isset(cpu, cpumask) test_bit((cpu), (cpumask).bits)


cpumask_copy(&dst, &src);

bool  cpumask_equal(&m1, &m2)


int    cpumask_parse_user(const char __user *buf, int len, &mask)                                           // 0 = ok, otherwise -errno
int    cpumask_parselist_user(const char __user *buf, int len, &mask)
int    cpulist_parse(const char *buf, & mask)


int    cpumask_scnprintf(char *buf, int len,  &mask)
int    cpulist_scnprintf(char *buf, int len,  &mask)


unsigned int   cpumask_first(&mask)                           // returns >= nr_cpu_ids if none
unsigned int   cpumask_next(int n, &mask)                // returns >= nr_cpu_ids if none
unsigned int   cpumask_next_zero(int n, &mask)      // returns >= nr_cpu_ids if none


for_each_cpu(cpu, mask) {...}                                         // iterate over every cpu in a mask
for_each_cpu_not(cpu, mask) {...}                                // iterate over every cpu not in a mask
for_each_cpu_and(cpu, m1, m2) {...}                           // iterate over every cpu in both masks


#define   for_each_possible_cpu(cpu)                       for_each_cpu((cpu), cpu_possible_mask)

#define   for_each_online_cpu(cpu)                                                   for_each_cpu((cpu), cpu_online_mask)

#define   for_each_present_cpu(cpu)                        for_each_cpu((cpu), cpu_present_mask)




Calling function on other CPUs:


void myfunc(void* info);


Handler function is invoked called on remote CPUs in IPI interrupt context,
on local CPU – with interrupts disabled.


The following functions must be called with interrupts enabled (or can deadlock).
They return 0 if successful, otherwise –errno (e.g. –ENXIO if CPU is not online).
If wait is true, waits for completion of the call.


int      smp_call_function_single(int cpuid, myfunc, void *info, int wait);
int      smp_call_function_any(&mask, myfunc, void *info, int wait);


void   smp_call_function_many(&mask, myfunc, void *info, bool wait);

Execute myfunc(info) on all CPUs in mask except current CPU.
Preemption must be disabled when calling smp_call_function_many.

May not call it from bottom-half or interrupt handler.


int      smp_call_function(myfunc, void *info, int wait);


Call myfunc(info) on all CPUs in cpu_online_mask except current CPU.
May not call smp_call_function from bottom-half or interrupt handler.
Temporary disables preemption internally.


int      on_each_cpu(myfunc, void *info, int wait);


Call myfunc(info) on all CPUs in cpu_online_mask and current CPU.
May not call smp_call_function from bottom-half or interrupt handler.
Temporary disables preemption internally.

void   on_each_cpu_mask(&mask, myfunc, void *info, bool wait);


Call myfunc(info) on all CPUs in the mask including current CPU if it is in the mask.
May not call smp_call_function from bottom-half or interrupt handler.

Preemption must be disabled when calling on_each_cpu_mask.


void    on_each_cpu_cond(bool (*cond_func)(int cpu, void *info), myfunc, void *info, bool wait, gfp_t gfp_flags);


For each CPU in cpu_online_mask, invoke locally cond_func(cpu, info) and for those CPUs for which it returns true, perform call remotely myfunc(info) and optionally wait. Flags gpf_flags are used to allocate temporary CPU mask block.

May not call on_each_cpu_cond from bottom-half or interrupt handler.
Temporary disables preemption internally.


Stop Machine:


int     stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)


Freeze the machine on all CPUs and execute fn(data).

This schedules a thread to be run on each CPU in the cpus mask (or each online CPU if cpus is NULL) at the highest priority, which threads also disable interrupts on their respective CPUs. The result is that no one is holding a spinlock or inside any other preempt-disable region when fn runs. This is effectively a very heavy lock equivalent to grabbing every spinlock (and more).


fn(data) is executed only on one CPU, with interrupts disabled.

Memory zones:



pages that can be used for DMA

x86-32: below 16 MB


pages that can be used for DMA,
but by 32-bit devices only



normal pages,
permanently mapped into kernel memory

on x86-32: 16 MB to 896 MB
on some other architectures: all memory


pages not permanently mapped into kernel memory

on x86-32: all memory above 896 MB
on some other architectures: empty


(related to hot-plugging)


Normal allocations are usually satisfied from ZONE_NORMAL, but on shortage can satisfy from ZONE_DMA.

x64 has only ZONE_DMA, ZONE_DMA32 and ZONE_NORMAL.

Memory allocation (kmaloc):

void*   kmalloc(size_t size, gfp_t flags)

void*   kmalloc_array(size_t n, size_t size, gfp_t flags)

void*   kcalloc(size_t n, size_t size, gfp_t flags)                                    // zeroes out

void*   kzalloc(size_t size, gfp_t flags)                                                          // zeroes out


void*   kmalloc_node(size_t size, gfp_t flags, int node)                                          // from specific NUMA node

void*   kzalloc_node(size_t size, gfp_t flags, int node)                                            // zeroes out


Common flag combinations:


Default, most usual.
Can block and can be called only in process context.
Can do block IO and file system calls inside.


High-priority allocation. Does not sleep.
Use in interrupt handlers, bottom halves (softirqs, tasklets) or when holding a spinlock
or in other situations when cannot sleep.
Will use emergency pool if necessary.


Like GFP_ATOMIC, but will not fallback to emergency pool.


Like GFP_KERNEL, but will not do block IO.
Use in block IO code when want to avoid recursion.


Like GFP_KERNEL (can block and initiate disk IO), but will not do file system IO.
Use in file system code when want to avoid recursion.


Normal allocation, can block.
Use to allocate memory for user-space processes.


Allocate from ZONE_HIGHMEM. Can block.
Used to allocate memory for user-space processes.






Allocate from ZONE_DMA.
Should only be used for kmalloc caches, otherwise use slab created with SLAB_DMA


Allocate from ZONE_DMA32

Their meanings:




























Basic flags:


Allocate from ZONE_DMA


Allocate from ZONE_DMA32


Allocate from ZONE_HIGHMEM


Allocate from ZONE_MOVABLE


Allocator can sleep


High-priority: allocator can access emergency pools


Allocator can start block IO


Allocator can start filesystem IO


Request cold-cache pages instead of trying to return cache-warm pages


Allocator does not print failure warnings


Allocator repeats (once) the allocation if it fails, but the allocation can potentially fail


Allocator repeats allocation indefinitely, the allocation cannot fail.

Deprecated, no new users allowed.


Allocator never retries the allocation if it fails


User reserves


The allocator adds compound page metadata (used internally by hugetlb code)


Zero out page content


Allocator does not fall back on reserves (takes precedence over MEMALLOC if both are specified)


The allocator enforces "hardwall" cpuset boundaries


Allocate NUMA node-local memory ponly, no fallback


The allocator marks the pages reclaimable


Allocation comes from a memcg-accounted resource


Don't track with kmemcheck




On behalf of other node


Allocator intends to dirty page


Memory allocation (slab allocator):

cache (pool) -> set of slab blocks (each is several pages) -> objects in each slab

struct kmem_cache*   kmem_cache_create(
                                                 size_t align,                                          
// required alignment, often L1_CACHE_BYTES, or 0 if SLAB_HWCACHE_ALIGN
                                                 unsigned long flags,
// SLAB_xxx, see below
                                                 void (*ctor)(void*)) 
// can be NULL

void   kmem_cache_destroy(struct kmem_cache*  cache)


DEBUG: perform extensive check on free


DEBUG: insert "red zones" around allocated memory to help detect buffer overruns


DEBUG: fill a slab with known value (0xA5) to catch access to uninitialized memory


Align objects within a slab to a cache line


Allocate from ZONE_DMA


DEBUG: store the last owner for bug hunting


Panic if allocation fails


Defer freeing slabs to RCU (read more in header file)


Spread some memory over cpuset


Trace allocations and frees




Avoid kmemleak tracing




Fault injection mark


Objects are reclaimable


Objects are short-lived


void*    kmem_cache_alloc(cache, gfp_t flags);
void*    kmem_cache_alloc_node(cache, gfp_t flags, int node)
void      kmem_cache_free(cache, void* p);

Memory allocation (virtually contiguous pages):

Allocated pages are virtually contiguous and not necessarily physically contiguous.
Less performant than kalloc() since need to update page tables and flush TLBs.

void*     vmalloc(unsigned long size);

void*     vzalloc(unsigned long size);                                                             // zeroed out

void       vfree(const void* addr);


void*     vmalloc_user(unsigned long size);                                                  // memory for user-space, zeroed out


void*     vmalloc_node(unsigned long size, int node);                                // allocate pages in specific NUMA node

void*     vzalloc_node(unsigned long size, int node);


void*     vmalloc_exec(unsigned long size);                                                 // executable memory

void*     vmalloc_32(unsigned long size);                                                    // 32-bit addressable memory

void*     vmalloc_32_user(unsigned long size);                                          // 32-bit addressable memory for user-space, zeroed-out


void*     vmap(struct page **pages, unsigned int count, unsigned long flags, pgprot_t prot);

void        vunmap(const void* addr);                                                            // release mapping


void        vmalloc_sync_all(void);

Memory allocation (physical pages):

struct page*  alloc_pages(gfp_t  flags,  unsigned int order)

Allocate (2 ^ order)  contiguous physical pages, return pointer to first pages's page structure.
On failure returns NULL.

unsigned long    page_address(struct page*  page)                            // get physical address

unsigned long     __get_free_pages(gfp_t gfp_flags, unsigned int order)
unsigned long     __get_free_page(flags)

Same as alloc_pages, but return physical address or 0.

unsigned long    get_zeroed_page(gfp_t flags)

                Allocate one zeroed page, return physical address or 0.

#define __get_dma_pages(flags, order)    __get_free_pages((flags) | GFP_DMA, (order))

void     __free_pages(struct page *page, unsigned int order)
void     __free_page(struct page *page)

void     free_pages(unsigned long addr, unsigned int order)
void     free_page(unsigned long addr)

void*   alloc_pages_exact(size_t size, gfp_t flags)
void     free_pages_exact(void*physaddr, size_t size)

Allocate contiguous physical pages to hold size bytes, return physical address or NULL on failure.

struct page*    alloc_pages_node(int nid, gfp_t flags, unsigned int order)
struct page*    alloc_pages_exact_node(int nid, gfp_t flags, unsigned int order)
void*                               alloc_pages_exact_nid(int nid, size_t size, gfp_t flags)

Mapping physical pages into kernel space:

void*   kmap(struct page* page)


If page is in low memory (ZONE_NORMAL, ZONE_DMA) its virtual address is simply returned.

If page is in ZONE_HIGHMEMORY, a mapping is created and virtual address is returned.
May sleep.


void      kunmap(struct page* page)


Unmap mapping created by kmap.


void*    kmap_atomic(struct page* page)

void      kunmap(struct page* page)

kmap_atomic/kunmap_atomic is significantly faster than kmap/kunmap because no global lock is needed and because the kmap code must perform a global TLB invalidation when the kmap pool wraps.


However when holding an atomic kmap it is not legal to sleep, so atomic kmaps are appropriate for short, tight code paths only.

struct page*   kmap_to_page(void *  kva)





unsigned long pfn = page_to_pfn(page);


if (pfn_valid(pfn))

       page = pfn_to_page(pfn);


void get_page(page);


void put_page(page);


void *kva = page_address(page)

       get kv-address for a page in lowmem zone

       or for a pages in himem zone mapped into kernel address space

       otherwise returns NULL


                               if (virt_addr_valid(kva))                                                                    // for kva in lowmem zone

                                                             page = virt_to_page(kva);


Pin user pages in memory:


       note: obsolete, should use get_user_pages_locked|unlocked|fast


struct page *pages;


long nr = get_user_pages(current, current->mm, (unsigned long) buf, npages, int write, int force, &pages, NULL);


....  modify ...






Map kernel memory into userspace (e.g. for mmap):



err = remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,

             unsigned long pfn, unsigned long size, pgprot_t prot)


err = remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, vma->vm_end - vma->vm_start, vma->vm_page_prot)


Map kernel memory to userspace:


static struct vm_operations_struct xxx_mmap_ops = {

       .open  =              xxx_mm_open,

       .close  =        xxx_mm_close,



static void xxx_mm_open(struct vm_area_struct *vma)


       struct file *file = vma->vm_file;

       struct socket *sock = file->private_data;

       struct sock *sk = sock->sk;


       if (sk)




static void xxx_mm_close(struct vm_area_struct *vma)


       struct file *file = vma->vm_file;

       struct socket *sock = file->private_data;

       struct sock *sk = sock->sk;


       if (sk)







                               if (vma->vm_pgoff)

                                                             return -EINVAL;


                               size = vma->vm_end - vma->vm_start;


                               start = vma->vm_start;

                               for (i = 0; i < ....; i++) {

                                                             struct page *page = virt_to_page(po->pg_vec[i]);

                                                             int pg_num;


                                                             for (pg_num = 0; pg_num < ....; pg_num++, page++) {

                                                                                            err = vm_insert_page(vma, start, page);

                                                                                            if (unlikely(err))

                                                                                                                          goto out;

                                                                                            start += PAGE_SIZE;



                               vma->vm_ops = &xxx_mmap_ops;



Memory leak detection (kmemleak):


Page allocations and ioremap are not tracked.

Only kmalloc, kmem_cache_alloc (slab allocations) and vmalloc.




To disable by default:



boot with kmemleak=on


mount  -t debugfs  nodev  /sys/kernel/debug

cat  /sys/kernel/debug/kmemleak


Clear the list of all current possible memory leaks:


echo clear > /sys/kernel/debug/kmemleak


Trigger an intermediate memory scan:


echo scan > /sys/kernel/debug/kmemleak

cat  /sys/kernel/debug/kmemleak


Disable default thread that scans every 10 minutes:


echo 'scan=off' > /sys/kernel/debug/kmemleak

Dump the information about object at <addr>:


echo 'dump=<addr>' > /sys/kernel/debug/kmemleak


Also  /proc/slabinfo


HZ tic ks (jiffies) per second, typically 100 or 1000, also can be NOHZ.
User-visible value is USER_HZ, so when communicating to userland scale like x  / (HZ / USER_HZ).

Jiffies has unsigned long and 64-bit values (jiffies, jiffies_64).

unsigned long volatile   jiffies;                           // wraps around if BITS_PER_LONG < 64
u64                                    jiffies_64;

On x86 they are separate (but jiffies can overlay lower word of jiffies_64).
On x64 they are the same thing.

jiffies may be read atomically.
jiffies_64 on 32-bit architectures (BITS_PER_LONG < 64) must be read under seqlock_t  jiffies_lock.

Can always use get_jiffies_64(), on 32-bit architectures it reads data under seqlock, on 64-bit architectures just reads it.

#define  time_after(a,b)                           ((long)(b) - (long)(a) < 0))

#define  time_before(a,b)                       time_after(b,a)


#define  time_after_eq(a,b)                   ((long)(a) - (long)(b) >= 0))

#define  time_before_eq(a,b)                time_after_eq(b,a)


#define  time_in_range(a,b,c)                (time_after_eq(a,b) && time_before_eq(a,c))


#define  time_in_range_open(a,b,c)      (time_after_eq(a,b) && time_before(a,c))


#define  time_after64(a,b)                     ((__s64)(b) - (__s64)(a) < 0))

#define  time_before64(a,b)                  time_after64(b,a)


#define  time_after_eq64(a,b)               ((__s64)(a) - (__s64)(b) >= 0))

#define  time_before_eq64(a,b)            time_after_eq64(b,a)


#define  time_is_before_jiffies(a)          time_after(jiffies, a)

#define  time_is_after_jiffies(a)             time_before(jiffies, a)

#define  time_is_before_eq_jiffies(a)    time_after_eq(jiffies, a)

#define  time_is_after_eq_jiffies(a)       time_before_eq(jiffies, a)


unsigned int            jiffies_to_msecs(const unsigned long j);

unsigned int            jiffies_to_usecs(const unsigned long j);

unsigned long         msecs_to_jiffies(const unsigned int m);

unsigned long         usecs_to_jiffies(const unsigned int u);

unsigned long         timespec_to_jiffies(const struct timespec *value);

void                         jiffies_to_timespec(const unsigned long jiffies, struct timespec *value);

unsigned long         timeval_to_jiffies(const struct timeval *value);

void                         jiffies_to_timeval(const unsigned long jiffies, struct timeval *value);


clock_t                    jiffies_to_clock_t(unsigned long x);

clock_t                    jiffies_delta_to_clock_t(long delta)


unsigned long         clock_t_to_jiffies(unsigned long x);

u64                          jiffies_64_to_clock_t(u64 x);

u64                          nsec_to_clock_t(u64 x);

u64                          nsecs_to_jiffies64(u64 n);

unsigned long         nsecs_to_jiffies(u64 n);


struct timespec {

        __kernel_time_t    tv_sec;
        long                          tv_nsec;



long remaining = schedule_timeout(10 * HZ)

unsigned long  wait_till  =  jiffies + 5 * HZ;
while (time_before(jiffies, wait_till))
        cond_resched();                                                                       // invoke scheduler if need_resched is true, e.g. there is some higher-priority task

void   mdelay(unsigned long msecs)              // busy-wait using calibrated loops
void   udelay(unsigned long usecs)
void   ndelay(unsigned long nsecs)

const enum hrtimer_mode mode = HRTIMER_MODE_REL;

ktime_t expires;


int  schedule_hrtimeout(&expires, mode)

int  schedule_hrtimeout_range(&expires, unsigned long delta, mode)

int  schedule_hrtimeout_range_clock(&expires, unsigned long delta, mode, int clock)

On  timer  tick:



 * Event handler for periodic ticks


void tick_handle_periodic(struct clock_event_device *dev)


          int cpu = smp_processor_id();

          ktime_t next;




          if (dev->mode != CLOCK_EVT_MODE_ONESHOT)



           * Setup the next period for devices, which do not have

           * periodic mode:


          next = ktime_add(dev->next_event, tick_period);

          for (;;) {

                   if (!clockevents_program_event(dev, next, false))



                    * Have to be careful here. If we're in oneshot mode,

                    * before we call tick_periodic() in a loop, we need

                    * to be sure we're using a real hardware clocksource.

                    * Otherwise we could get trapped in an infinite

                    * loop, as the tick_periodic() increments jiffies,

                    * when then will increment time, posibly causing

                    * the loop to trigger again and again.


                   if (timekeeping_valid_for_hres())


                   next = ktime_add(next, tick_period);





 * Periodic tick


static void tick_periodic(int cpu)


          if (tick_do_timer_cpu == cpu) {



                   /* Keep track of the next tick event */

                   tick_next_period = ktime_add(tick_next_period, tick_period);










 * Must hold jiffies_lock


void do_timer(unsigned long ticks)


          jiffies_64 += ticks;






 * Called from the timer interrupt handler to charge one tick to the current

 * process.  user_tick is 1 if the tick is user time, 0 for system.


void update_process_times(int user_tick)


          struct task_struct *p = current;

          int cpu = smp_processor_id();


          /* Note: this timer irq context must be accounted for as well. */

          account_process_tick(p, user_tick);


          rcu_check_callbacks(cpu, user_tick);



          if (in_irq())








 * Called by the local, per-CPU timer interrupt on SMP.


void run_local_timers(void)







 * Account a single tick of cpu time.

 * @p: the process that the cpu time gets accounted to

 * @user_tick: indicates if the tick is a user or a system tick


void account_process_tick(struct task_struct *p, int user_tick)


          cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);

          struct rq *rq = this_rq();


          if (sched_clock_irqtime) {

                   irqtime_account_process_tick(p, user_tick, rq);




          if (steal_account_process_tick())



          if (user_tick)

                   account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);

          else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))

                   account_system_time(p, HARDIRQ_OFFSET, cputime_one_jiffy,





Print messages, dump stack etc.:


printk(KERN_WARNING "format string", args...)





circular buffer, default size 16K (LOG_BUF_LEN), configurable with CONFIG_LOG_BUF_SHIFT








#include <linux/ratelimit.h>

// depreceated since rate-limiting state is global for all printk_ratelimit() call sites

if (printk_ratelimit())



// better way: limit for a particular call site

printk_ratelimited(KERN_WARNING "format-string", args)


pr_xxx(format, args):
















pr_devel if DEBUG is defined



KERN_CONT = continue line with no NL char




to enable all messages: echo 8  >/proc/sys/kernel/printk

default is "warning (4) and up"

content of /proc/sys/kernel/printk (full):  <current>  <default>  <minimum>  <boot-time-default>

on boot: loglevel=8



printk_emerg_once(...) and other printk_xxx_once(...)











For drivers:


#include <linux/device.h>


dev_xxx(const struct device *dev, fmt, args)

e.g. dev_dbg(...)


dev_xxx_ratelimited(dev, fmt, args)


dev_WARN(...), dev_WARN_ONCE(...) => includes file/lineno and backtrace


How to selectively enable/disable pr_debug/dev_dbg call sites (via debugfs):






printk formats:


                                     int                                                       %d or %x

                                     unsigned int  %u or %x

                                     long                                                   %ld or %lx

                                     unsigned long                             %lu or %lx

                                     long long                                        %lld or %llx

                                     unsigned long long                 %llu or %llx

                                     size_t                                                %zu or %zx

                                     ssize_t                                              %zd or %zx

                                     s32                                                     %d or %x

                                     u32                                                    %u or %x

                                     s64                                                     %lld or %llx

                                     u64                                                    %llu or %llx

                                     pointer                                            %p


Also has formats for:


symbolic decoding of pointers (symbol + offset)

physical addresses (phys_addr_t)

DMA addresses (dma_addr_t)

raw buffers as hex string

MAC/FDDI addresses

IPv4/IPv6 addresses


dentry names

struct clk



Printing to ftrace buffer:


#include <linux/kernel.h>


trace_puts(str)  => extra fast






Read-black tree – self-balancing (semi-balanced) binary search tree:


·          each non-leaf node has 1 or 2 children

·          value(left child) < value(node) < value(right child)

·          depth(deepest path) <= 2 * depth(shallowest path)


struct rb_root {

          struct rb_node *rb_node;


struct rb_node {

          unsigned long  __rb_parent_color;

          struct rb_node *rb_right;

          struct rb_node *rb_left;



Typical user node:


struct mytype {

    struct rb_node node;

    char* mystring;


struct rb_root root = RB_ROOT;                  // { NULL }


#define rb_parent(r)   ((struct rb_node *)((r)->__rb_parent_color & ~3))


#define rb_entry(ptr, type, member) container_of(ptr, type, member)


void  rb_replace_node(struct rb_node *victim, struct rb_node *new, struct rb_root *root)

void  rb_link_node(struct rb_node * node, struct rb_node * parent, struct rb_node ** rb_link)




  struct mytype *my_search(struct rb_root *root, char *string)


                       struct rb_node *node = root->rb_node;


                       while (node) {

                                               struct mytype *data = container_of(node, struct mytype, node);

                                                int result;


                                                result = strcmp(string, data->mystring);


                                                if (result < 0)

                                                                       node = node->rb_left;

                                                else if (result > 0)

                                                                       node = node->rb_right;


                                                                       return data;


                        return NULL;



Remove an existing node:


  struct mytype *data = my_search(&mytree, "walrus");


  if (data) {

                       rb_erase(&data->node, &mytree);           // void rb_erase(struct rb_node *victim, struct rb_root *tree)




To replace an existing node with the new one with the same key:

void rb_replace_node(struct rb_node *victim, struct rb_node *new, struct rb_root *root);


                        Inserting into tree:


  int my_insert(struct rb_root *root, struct mytype *data)


                       struct rb_node **new = &(root->rb_node), *parent = NULL;


                       /* Figure out where to put new node */

                       while (*new) {

                                               struct mytype *this = container_of(*new, struct mytype, node);

                                               int result = strcmp(data->mystring, this->mystring);


                                                parent = *new;

                                               if (result < 0)

                                                                       new = &((*new)->rb_left);

                                               else if (result > 0)

                                                                       new = &((*new)->rb_right);


                                                                       return FALSE;



                       /* Add new node and rebalance tree. */

                       rb_link_node(&data->node, parent, new);

                       rb_insert_color(&data->node, root);


                        return TRUE;



Iterating through the tree:


struct rb_node*  rb_first(const struct rb_root*)

struct rb_node*  rb_last(const struct rb_root*)

struct rb_node*  rb_next(const struct rb_node*)

struct rb_node*  rb_prev(const struct rb_node*)


  struct rb_node *xnode;

  for (xnode = rb_first(&mytree);  xnode;  xnode = rb_next(xnode))

           printk("key=%s\n", rb_entry(xnode, struct mytype, node)->mystring);




Radix tree: maps "unsigned long" to "void*"


 #include <linux/radix-tree.h>


RADIX_TREE(name, gfp_mask);


struct radix_tree_root my_tree;

INIT_RADIX_TREE(my_tree, gfp_mask);


struct radix_tree_root *tree;

unsigned long key;

void *item;


int  radix_tree_insert(tree, key, item);

item = radix_tree_lookup(tree, key);

item = radix_tree_delete(tree, key);



struct kobject {

          const char*                  name;

          struct list_head             entry;

          struct kobject*              parent;

          struct kset*                 kset;

          struct kobj_type*            ktype;

          struct sysfs_dirent*         sd;

          struct kref                  kref;

          unsigned int    state_initialized:1;

          unsigned int    state_in_sysfs:1;

          unsigned int    state_add_uevent_sent:1;

          unsigned int    state_remove_uevent_sent:1;

          unsigned int    uevent_suppress:1;



          struct kobj_type {

                   void (*release)(struct kobject *kobj);

                   const struct sysfs_ops *sysfs_ops;

                   struct attribute **default_attrs;

                   const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj);

                   const void *(*namespace)(struct kobject *kobj);


struct kobject*   kobject_get(struct kobject *kobj)


If kobj is not NULL, increment reference counter and return kobj.


void    kobject_put(struct kobject *kobj)


If kobj is not NULL, decrement reference counter, if goes to zero then release the object.


kobject/kset sample: LINUX_ROOT/samples/kobject



Handles device numbers.
There are two kobj_maps: one for block devices and one for character devices.
Each struct probe describes device number range starting at dev with length range.

struct kobj_map *    kobj_map_init(kobj_probe_t *base_probe, struct subsystem *s)

int      kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
                              struct module *owner, kobj_probe_t *get, int (*lock)(dev_t, void *), void *data)

void    kobj_unmap(struct kobj_map *domain, dev_t dev, unsigned long range)

struct kobject *    kobj_lookup(struct kobj_map *domain, dev_t dev, int *index)

Finds the given device number dev on the given map domain. If the owner field is set, we temporarily take a reference on the corresponding module via try_module_get(owner) in order to protect the lock and get calls. If the lock function is present it will be called, and the present probe skipped if it returns an error. Then the get function is called to get the kobject for the given device number. The resulting kobject is returned as value, and the offset in the interval of device numbers is returned via index; module is released via module_put.


Lockup detection:





Default is 10 seconds, meaning 10 seconds for hard lockup detection and 20 seconds for soft lockup detection.


Hard lockup is lock up in kernel mode with interrupts disabled.

Soft lockup is lock up in kernel mode not letting other tasks to run.


Code is in kernel/watchdog.c

Docs in Documentation/lockup-watchdogs.txt


Using floating point in kernel:

#include <asm/i387.h>




For read-mostly data, when transient reader/writer inconsistency ok.
Decreases read-side overhead, increases write-side overhead.


#include <linux/rcupdate.h>


// read-side RCU critical section:

//   - cannot sleep, but can be preemted if CONFIG_PREEMPT_RCU

//   - can be nested


static struct xxx_struct* __rcu xptr;



p = rcu_dereference(xptr);

// p is valid here



// p is invalid here


// update section (synchronous, waiting for RCU sync)

// use spinlocks, semaphores etc. to interlock updaters

// cannot be called from irq or bh context



p = xptr;

rcu_assign_pointer(xptr, new_p);


synchronize_rcu();                                                       // wait for grace period



// update section (asynchronous,non-waiting for RCU sync)

// use spinlocks, semaphores etc. to interlock updaters



p = xptr;

rcu_assign_pointer(xptr, new_p);


call_rcu(&p->x_rcu, xxx_rcu_func);


struct xxx_struct {

        struct rcu_head x_rcu;



void xxx_rcu_func(struct rcu_head *p_rcu_head)


       struct my_struct *p = container_of(p_rcu_head, struct my_struct, x_rcu);




Note:  RCU callbacks can be invoked in softirq context. Therefore it cannot block, and any lock acquired within RCU callback must be

acquired elsewhere with spin_lock_irq or spin_lock_bh, to avoid self-deadlock. Watch out for locks indirectly acquired

by API functions (e.g. kfree etc.) called by the callback!


Any lock acquired by RCU callback must be acquired elsewhere with softirq disabled (e.g. spin_lock_bh, spin_lock_irqsave), or self-deadlock will result.


RCU callbacks are usually executed on the same CPU, as scheduled by call_rcu_xxx, but this is not guaranteed (for example, if CPU goes offline, pending callbacks will be transferred to another CPU for execution).


RCU read-side primitives do not necessarily contain memory barriers.  Therefore CPU and compiler may reorder code into and out of RCU

read-side critical sections.  It is the responsibility of the RCU update-side primitives to deal with this.


What is RCU, Fundamentally?

What is RCU? Part 2: Usage

RCU part 3: the RCU API

The RCU API, 2010 Edition


"Is parallel programming hard..."


enumerate list entries


list_for_each_entry(p, &xxx_listhead, list) {





list_for_each_entry_rcu(p, &xxx_listhead, list) {




add list entry


if (p->condition)

                        list_add(&p->list, xxx_listhead);


                        list_add_tail(&p->list, xxx_listhead);



if (p->condition)

                        list_add_rcu(&p->list, xxx_listhead);


                        list_add_tail_rcu(&p->list, xxx_listhead);


delete list entry


list_for_each_entry(p, &xxx_listhead, list) {

                        if (p->condition) {








list_for_each_entry_rcu(p, &xxx_listhead, list)  {

                        if (p->condition) {



                                                call_rcu(&p->rcu, xxx_free);





in-place updates


list_for_each_entry(p, &xxx_listhead, list) {

                        if (p->condition) {

                                                p->a = new_a;

                                                p->b = new_b;







list_for_each_entry_rcu(p, &xxx_listhead, list)  {

                        if (p->condition) {

                                                np = copy(p);

                                                if (np == NULL)   ...

                                                np->a = new_a;

                                                np->b = new_b;

                                                list_replace_rcu(&p->list, &np->list);


                                                call_rcu(&p->rcu, xxx_free);





eliminating stale data

If stale data cannot be tolerated, use per-entry “deleted” flag and per-entry spinlock, and re-verify the validity of data.


RCU Classic


RCU Sched

Realtime RCU




Wait for RCU read-side critical sections

Wait for RCU-BH read-side critical sections & irqs

Wait for RCU-Sched read-side critical sections, preempt-disable regions, hardirqs & NMIs

Realtime response

Wait for SRCU read-side critical sections, allow sleeping readers

Sleeping readers and fast grace periods

Read-side primitives



(and friends)




Update-side primitives







Update-side primitives
(synchronous expedited)





synchronize_srcu_ expedited()


Update-side primitives







Update-side primitives
(wait for callbacks)







Read side constraints

No blocking except preemption ans spinlock acquisition

No bh enabling

No blocking

No blocking except preemption and lock acquisition

No synchronize_srcu()

No synchronize_qrcu()

Read side overhead

Preempt disable/enable (free on non-PREEMPT)

BH disable/enable

Preempt disable/enable (free on non-PREEMPT)

Simple instructions, irq disable/enable

Simple instructions, preempt disable/enable

Atomic increment and decrement of shared variable

Asynchronous update-side overhead (for example, call_rcu())







Grace-period latency

10s of milliseconds

10s of milliseconds

10s of milliseconds

10s of milliseconds

10s of milliseconds

10s of nanoseconds in absence of readers

Non-PREEMPT_RT implementation

RCU Classic


RCU Classic




PREEMPT_RT implementation


Realtime RCU

Forced Schedule on all CPUs

Realtime RCU





Critical sections

Grace period







rcu_read_lock_bh _held