Linux kernel interfaces

Macros, types and conventions:

long IS_ERR(const void *ptr) // pointer has value in “error” range (>= (unsigned long) –MAX_ERRNO)
long IS_ERR_OR_NULL(const void *ptr)
long PTR_ERR(const void *ptr)
void* ERR_PTR(long error)

likely(cond)

unlikely(uncond)

BUG_ON(condition) // can compile away, so do not create side effects in condition
BUILD_BUG_ON(condition) // build-time checking

typecheck(__u64, a) // build-time checking that a is of type __u64

container_of(member_ptr, container_typename, member_name)

BITS_PER_LONG // 32 or 64

No floating point operations in kernel (there are routines to save/restore fp context).

Interrupt stack (optional. present only on some architectures), normally 1 page (on x86/x64 – 4K).

Kernel stack: on x86/x64 usually 2 pages, i.e. 8 KB.
At the bottom of the stack is struct thread_info.
On stack overflow will silently overwrite it with disastrous consequences.

struct thread_info* current_thread_info();
current_thread_info()->preempt_count

struct task struct* task = current_thread_info()->task;
struct task struct* task = current;

/* task->state */

TASK_RUNNING 0 // currently running or on runqueue waiting to run

TASK_INTERRUPTIBLE 1 // in interruptible sleep

TASK_UNINTERRUPTIBLE 2 // in non-interruptible sleep

__TASK_STOPPED 4 // stopped due to SIGSTOP, SIGTSTP, SIGTTIN or SIGTTOU

__TASK_TRACED 8 // being traced by debugger etc. via ptrace

/* in task->exit_state */

EXIT_ZOMBIE 16 // process has terminated, lingering around to be reaped by parent

EXIT_DEAD 32 // final state after parent collects status with wait4 or waitpid syscalls

/* in task->state again */

TASK_DEAD 64

TASK_WAKEKILL 128 // wake up on fatal signals

TASK_WAKING 256

#define TASK_KILLABLE (TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)

#define TASK_STOPPED (TASK_WAKEKILL | __TASK_STOPPED)

#define TASK_TRACED (TASK_WAKEKILL | __TASK_TRACED)

#define TASK_NORMAL (TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE)

set_task_state(task, state)

Iterate parent tasks

(note: if parent terminates, task is reparented to member of the group or to init)

for (task = current; task != &init_task; task = task->parent)
{ ... }

Iterate all tasks in the system (expensive):

for_each_process(task)

{ ... }

task = next_task(task);
task = prev_task(task)

Iterate over children tasks:

struct list_head* list;

list_for_each(list, &current->children)
{

task = list_entry(list, struct task_struct, sibling);

}

Annotation for functions that can sleep. This macro will print a stack trace if it is executed in an atomic

context (spinlock, irq-handler, ...):

might_sleep()

Modifiers (annotations):

__init place code into the section discarded after kernel or module init

__read_mostly place data in “read mostly” section to reduce cache line interference
__write_once same as __read_mostly

__page_aligned_bss align data on page boundary and allocate it from bss-page-aligned section

notrace do not instrument this function for profiling

__irq_entry allocate code in section irqentry-text (only if graph tracer is enabled)

__kprobes allocate code in section kprobes-text (only if CONFIG_KPROBES)

asmlinkage assembler-level call linkage

__lockfunc allocate code in section spinlock-text

__kernel designate as kernel-space address

__user, __iomem, __rcu disable dereference (only used by __CHECKER__, see sparse(1))

__nocast sparse will issue warning if attempted type-conversion to this var (e.g. signed int value to unsigned var)
__bitwise sparse will issue warning if __bitwise variables are mixed with other integers

__safe sparse: not used in the codebase

__force sparse: disable warning (e.g. when casting int to bitwise)

__le32, __be32 sparse: little endian, big endian

__must_hold(x) sparse: the specified lock is held on function entry and exit

__acquires(x) sparse: function acquires the specified lock

__releases(x) sparse: function releases the specified lock

__acquire(x)

__release(x)

__cond_lock(x,c)

Sparse:

git://git.kernel.org/pub/scm/devel/sparse/sparse.git
make
make install

then run kernel make:

make C=1                         # run sparse only on files that need to be recompiled
make C=2                         # run sparse on all files, whether recompiled or not
make C=2 CF="-D__CHECK_ENDIAN__"     # pass optional parameters to sparse

Contexts:

· process (e.g. syscall), preemptible context

· process, non-preemptible context

· bottom-half (DPC/fork-like, but may be threaded, aka “soft interrupt” – e.g. tasklet)

· primary interrupt handler (irq, aka "top half", aka “hard interrupt”)

· PREEMPT_RT only: secondary (threaded) interrupt handler, per-IRQ thread

There are no IPL/IRQL levels.

SOFTIRQ is executed with hardware interrupts disabled (but tasklet handler enables interrupts before entering the tasklet routine).

Bottom-half processing can be triggered by:

· return from hardware interrupt (irq_exit)

· local_bh_enable(), including indirect calls

· ksoftirqd kernel thread (per-CPU, runs at low priority)

· any code that explicitly calls do_softirq(), __do_softirq(), call_softirq() or invoke_softirq(),
such as network code

Note again that Linux is NOT "true" IPL/IRQL-based system, and local_irq_enable() or unlocking a spinlock DO NOT cause SOFTIRQ processing to happen.

Task switch can be triggered by:

· REI to user-space

· REI to kernel-space if (current->preempt_count == 0 && need_resched())

· unlocking a lock, including indirect unlock or any other call to preempt_enable(),
if result is (current->preempt_count == 0 && need_resched())

· explicit voluntary call to schedule(), including yield() and blocking functions

Preemption mask:

preempt_count()    ≡    current_thread_info()->preempt_count

preempt_count()               Returns preemption disable count, this includes count masks added
                                                       by currently active hardirqs, softirqs and NMI. Low byte is preempt
                                                       disable count (if kernel is configured preemptible, otherwise 0),
                                                       byte 1 is softirq count, next 10 bits is hardirq count,
                                                       then 1 bit for NMI, then one bit for PREEMPT_ACTIVE.

There is also preempt_lazy_count (useful for PREEMPT_RT)

SOFTIRQ is changed by 1 on entering/leaving softirq processing and is changed by 2 on local_bh_enable/local_bh_disable, to distinguish between softirq-active and softirq-disabled states

PREEMPT_ACTIVE rescheduling is being performed (so do not reenter the scheduler)

Context control functions:

preempt_disable() increments PREEMPT field
preempt_enable() decrement PREEMT, if goes to 0, will check for need_reschdule()

local_bh_disable() increments SOFTIRQ field by 2 (whereas entering softirq increments it
only by 1, to distinguish between softirq-active/softirq-disabled states)

local_bh_enable() If goes to 0, will check for pending bottom halves (softirqs)

local_irq_disable() disable hw interrupts unconditionally (on some arch can be lazy)
local_irq_enable() enable hw interrupts unconditionally

unsigned long flags;

local_irq_save(flags) save previous state into flags and disable hw interrupts
local_irq_restore(flags) restore hw-interrupt-enable state from flags

Other functions:

hard_irq_disable() Really disable interrupts, rather than relying on lazy-disabling on
some architectures.

preempt_enable_no_resched() Decrement preemption count. If goes to 0, do not check for any
pending reschedules.

in_nmi() Non-zero if executing NMI handler.

in_irq() Non-zero if currently executing primary (hardirq) hardware

interrupt handler, 0 if in process context or bottom half or NMI.

in_softirq() Non-zero if processing softirq or have BH disabled.

in_servng_softirq() Non-zero if processing softirq.

in_interrupt() Non-zero if in NMI or primary interrupt (hardirq) or in bottom-half context, or if have BH disabled.

irqs_disabled() Non-zero if local interrupt delivery is disabled.

in_atomic() Checks (preempt_count() & ~PREEMPT_ACTIVE), cannot sense if

spinlocks are held in non-preemptable kernel, so use only with
great caution. Do not use in drivers.

synchronize_hardirq(irq) Wait for pending hard IRQ handlers to complete on other CPUs

synchronize_irq(irq) Wait for pending hard & threaded IRQ handlers to complete on other CPUs

Warning: on non-preemptible systems (CONFIG_PREEMPT=n) spin_lock does not change preemption mask.

Bottom halves:

top half = primary interrupt context

bottom half = deferred interrupt processing (similar to DPC or fork), can be threaded

Methods:

original BH	removed in 2.5
Task queues	removed in 2.5
Softirq	since 2.3
Tasklet	since 2.3
Work queues	since 2.3
Timers

Original BH (removed in 2.5):

32-bit request mask and 32 lists of requests, each level globally synchronized across all CPUs.

Task queues (removed in 2.5):

Set of queues.

Softirqs (BH):

Set of NR_SOFTIRQS statically defined bottom halves allocated at kernel compile time.

Cannot register extra softirq levels dynamically.

Handler can run on any CPU.

Rarely used directly (tasklets are more common), but tasklets are layered on Softirq.

Defined levels:

HI_SOFTIRQ	0	High-priority tasklets
TIMER_SOFTIRQ	1	Timers
NET_TX_SOFTIRQ	2	Send network packets
NET_RX_SOFTIRQ	3	Receive network packets
BLOCK_SOFTIRQ	4	Block devices: done
BLOCK_IOPOLL_SOFTIRQ	5	Block devices: poll
TASKLET_SOFTIRQ	6	Normal priority tasklets
SCHED_SOFTIRQ	7	Scheduler (just inter-CPU balancing, not the same as SOFTINT IPL$_RESCHED)
HRTIMER_SOFTIRQ	8	High-resolution timer
RCU_SOFTIRQ	9	RCU callbacks

Level to name:

extern char *softirq_to_name[NR_SOFTIRQS];

void my_action(struct softirq_action *);

open_softirq(HRTIMER_SOFTIRQ, my_action);

Handler runs with interrupts enabled and cannot sleep.

Raise softirq:

Marks softirq level as pending.
Temporarily disables interrupts internally.

raise_softirq(NET_TX_SOFTIRQ);

Equivalent to:

unsigned long flags;

local_irq_save(flags);

raise_softirq_irqoff(nr);

local_irq_restore(flags);

If interrupts are already disabled, can do:

raise_softirq_irqoff(NET_TX_SOFTIRQ);

Try to send a softirq to a remote cpu.

If this cannot be done, the work will be queued to the local cpu.

void send_remote_softirq(struct call_single_data* cp, int cpu, int softirq);

Softirq processing logic:

do_softirqd is invoked:

· on return from hardware interrupt (irq_exit)

· local_bh_enable()

· in the ksoftirqd kernel thread (per-CPU, runs at low priority)

· in any code that explicitly calls do_softirq(), __do_softirq(), call_softirq() or invoke_softirq(),
such as network code

void do_softirq(void)

{

__u32 pending;

unsigned long flags;

struct softirq_action* h;

if (in_interrupt())

return;

local_irq_save(flags);

pending = local_softirq_pending();

if (pending)

{

... on x86 and some other architectures ...

... switch to per-CPU softirq stack ...

set_softirq_pending(0);

local_irq_enable();

h = softirq_vec;

{

if (pending & 1)

h->action(h);

h++;

pending >>= 1;

} while (pending);

}

... recheck if still pending, and if so retry the loop few times ...

... after which wakeup softirqd for execution on a worker thread ...

local_irq_restore(flags);

}

Thus:

· Multiple softirqs (of same or different levels) can concurrently execute on different processors.

· Single pass on the same CPU goes from high-priority (0) to low-priority (9) softirqs, but in a larger scheme of things low-priority softirq can get executed earlier than high-priority softirq.

(E.g. if another processor picks up lower-priority softirq first, or if high-priority softirq is added while scan pass is already in progress, or if tasklets are relegated to the threaded execution, of due to ksoftirqd.)

· softirq’s are prioritized for ordering only (and even then are subject to postponing for ksoftirqd thread).

high-priority softirq does not interrupt lower-priority softirq.

"While a (softirq) handler runs, softirqs on the current processor are disabled.
A softirq never preempts another softirq (on the same CPU).
The only event that can preempt a softirq is an interrupt handler."

Tasklets (BH):

Tasklets

· built on top of softirqs

· can be scheduled only as one instance at a time (and also only on one processor at a time)

· if handler is already running on CPU1 and tasklet gets scheduled again on CPU2, second instance of handler can run (on CPU2) concurrently with the first one (running on CPU1)

· handler runs with all interrupts enabled

· does not have process context (no files etc.)

· cannot sleep

· per-processor queues tasklet_vec (for regular tasklets) and tasklet_hi_vec (for high-priority tasklets)

Rules:

· if tasklet_schedule is called, then tasklet is guaranteed to be executed on some CPU at least once after this

· if tasklet is already scheduled, but its execution is still not started, it will be executed only once

· if tasklet is already running on another CPU (or schedule is called from tasklet itself), it is scheduled again

· tasklet is strictly serialized with respect to itself, but not to another tasklets;
for inter-tasklet synchronization, use spinlocks.

struct tasklet_struct

{

struct tasklet_struct *next;

unsigned long state; // bits: TASKLET_STATE_SCHED (0), TASKLET_STATE_RUN (1)

atomic_t count; // if non-zero, disabled and cannot run
// otherwise enabled and can run if scheduled

void (*func)(unsigned long);

unsigned long data;

}

#define DECLARE_TASKLET(name, func, data) \

struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(0), func, data }

#define DECLARE_TASKLET_DISABLED(name, func, data) \

struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(1), func, data }

void tasklet_init(struct tasklet_struct *t, void (*func)(unsigned long), unsigned long data)

int tasklet_trylock(struct tasklet_struct *t) // sets RUN bit on MP, return true if was not set

void tasklet_unlock(struct tasklet_struct *t) // clear RUN bit

void tasklet_unlock_wait(struct tasklet_struct *t) // spin while RUN bit is set

No-op on uniprocessor.

void tasklet_schedule(struct tasklet_struct *t)

void tasklet_hi_schedule(struct tasklet_struct *t)

If SCHED is already set, do nothing.

Otherwise insert into queue and signal softirq.

void tasklet_kill(struct tasklet_struct *t);

Cannot be called from interrupt.
Acquire SCHED bit (i.e. wait till owner clears it, e.g after removing from queue), spin while RUN bit is set, clear SCHED bit.

tasklet state at call to tasklet_kill	outcome
not queued	returns fine
queued, disabled	will lock up
enabled, queued	removed from queue handler completed execution attempts to re-schedule would not pass
enabled, running

void tasklet_kill_immediate(struct tasklet_struct *t, unsigned int cpu);

Can be called only for CPU in state CPU_DEAD.

Remove tasklet even if it is in SCHED state.

void tasklet_disable_nosync(struct tasklet_struct *t)

Disables tasklet by atomic_inc(t->count) and returns immediately.

void tasklet_disable(struct tasklet_struct *t)

Disables tasklet by atomic_inc(t->count) and calls tasklet_unlock_wait().

void tasklet_enable(struct tasklet_struct *t)
void tasklet_hi_enable(struct tasklet_struct *t)

Enables tasklet by atomic_dec(t->count).

Tasklet block:

func(data)
disable_count
flags: sched (“scheduled”), run (“currently running”)

Handler logic:

tasklet_schedule

softirq handler

if (!sched)

{

sched = 1

queue + softirq

}

copy: list = per-cpu queue

clear per-cpu queue

for (each entry)

{

if (run)
{

requeue:

put back on the queue

request softirq

}

else

{

run = 1

if (disable_count != 0)
{

run = 0

goto requeue

}

else

{

sched = 0

func(data)

run = 0

}

Work queues (BH):

· items executed by a worker thread, in process context and can sleep

· default threads are per-CPU event/n, but queue can also have dedicated processing thread or threads

wq = create_workqueue("my_queue");
wq = create_singlethread_workqueue("my_queue");
wq = create_freezable_workqueue("my_queue");

NULL if fails.

create_workqueue creates queue and worker threads, one per CPU, named my_queue/n.

pre-existing queues:

system_wq	Used by schedule[_delayed]_work[_on](). Per-CPU multithreaded. For short items.
system_long_wq	Similar to system_wq, but for long items. Queue flushing may take relatively long time.
system_freezable_wq	Similar to system_wq, but freezable.
system_unbound_wq	Workers are not bound to any specific CPU, not concurrency managed, and all queued works are executed immediately as long as max_active limit is not reached and resources are available.

int keventd_up(void)

Check if system_wq exists (not NULL).

void destroy_workqueue( struct workqueue_struct* wq);

void workqueue_set_max_active(wq, int max_active)

Adjust maximum number of concurrently active works.
Do not call from interrupt context.

struct workqueue_struct* wq;

work_struct work;

void my_func(struct work_struct *pWork);

DECLARE_WORK(work, myfunc); // compile-time initialization

INIT_WORK(&work, myfunc) // run-time initialization

INIT_WORK_ONSTACK(&work, myfunc) // ... same for on-stack work object

PREPARE_WORK(&work, new_myfunc) // change the routine in an already initialized descriptor

bool queue_work(wq, pWork) // queue on current CPU

bool schedule_work(pWork) // same as queue_work on system_wq

Returns TRUE if was queued, FALSE if was already on the queue.

If queue is being drained with drain_queue, will not actually insert in the queue unless called from the context of a "chained work", i.e. work executing on this queue, but will not reflect it in return status.

To pass parameters to work routine, embed struct_work in larger structure.

bool queue_work_on(int cpu, wq, pWork) // queue for processing on specified CPU

bool schedule_work_on(int cpu, pWork) // same as queue_work_on on system_wq

Returns TRUE if was queued, FALSE if was already on the queue.
Caller must ensure the CPU cannot go away.
If queue is being drained with drain_queue, (see above).

struct delayed_work dwork

struct delayed_work *pDelayedWork

void my_delayed_work_timer_func(unsigned long data);

DECLARE_DELAYED_WORK(dwork, my_delayed_work_timer_func) // compile-time initialization

INIT_DELAYED_WORK(&dwork, my_delayed_work_timer_func) // run-time initialization

INIT_DELAYED_WORK_ONSTACK(&dwork, my_delayed_work_timer_func) // ... same for on-stack dwork object

PREPARE_ DELAYED _WORK(&dwork, new_my_delayed_func) // change the routine in an already initialized struct

/* Deferrable is similar to delayed, but may postpone waking up CPU if sleeping */

DECLARE_DEFERRABLE_WORK(dwork, my_delayed_work_timer_func) // compile-time initialization

INIT_DEFERRABLE_WORK(&dwork, my_delayed_work_timer_func) // run-time initialization

INIT_DEFERRABLE_WORK_ONSTACK(&dwork, my_delayed_work_timer_func) // ... same for on-stack dwork object

PREPARE_ DELAYED _WORK(&dwork, new_my_delayed_func) // change the routine in an already initialized struct

bool queue_delayed_work(wq, struct delayed_work *work, unsigned long delay)

bool schedule_delayed_work(struct delayed_work *work, unsigned long delay)

bool queue_delayed_work_on(int cpu, wq, struct delayed_work *work, unsigned long delay)

bool schedule_delayed_work_on(int cpu, struct delayed_work *work, unsigned long delay)

Similar to queue_work/schedule_work, but additional parameter delay (number of system ticks) specifies minimum time to delay before executing the item. If queue is being drained with drain_queue, (see above).

bool mod_delayed_work(wq, pDelayedWork, unsigned long delay)

bool mod_delayed_work_on(int cpu, wq, pDelayedWork, unsigned long delay)

Modify delay of or queue a delayed work on specific CPU or any CPU.

If pDelayedWork is idle, equivalent to queue_delayed_work_on(); otherwise, modify pDelayedWork 's timer so that it expires after delay. If delay is zero, pDelayedWork is guaranteed to be scheduled immediately regardless of its current state.

Returns FALSE if pDelayedWork was idle and queued, TRUE if pDelayedWork was pending and its timer was modified.

Safe to call from any context including interrupt handler.

bool cancel_work_sync(pWork)

Cancel pWork and wait for its execution to finish.

Can be used even if the work re-queues itself or migrates to another workqueue.

On return, pWork is guaranteed to be not pending or executing on any CPU.

Returns TRUE if pWork was pending, FALSE otherwise.

Caller must ensure that the workqueue on which pWork was last queued can't be destroyed before this function returns.

bool cancel_delayed_work_sync(pDelayedWork)

Cancel a delayed work and wait for it to finish.

Similar to cancel_work_sync(), but for delayed works.

bool cancel_delayed_work(pDelayedWork)

Cancel a pending delayed work.

Returns TRUE if pDelayedWork was pending and canceled, FALSE if wasn't pending.

Work callback function may still be running on return, unless function returns TRUE and the work doesn't re-arm itself. Explicitly flush or use cancel_delayed_work_sync() to wait on it.

Safe to call from any context including IRQ handler.

bool flush_work(pWork)

Wait for a work to finish executing the last queueing instance.

Wait until pWork has finished execution.

pWork is guaranteed to be idle on return if it hasn't been requeued since flush started.

Returns TRUE if flush_work() waited for the work to finish execution, FALSE if it was already idle.

bool flush_delayed_work(pDelayedWork)

Wait for a delayed work to finish executing the last queueing.

Delayed timer is cancelled and the pending work is queued for immediate execution.

Like flush_work(), this function only considers the last queueing instance of pDelayedWork.

Returns TRUE if waited for the work to finish execution, FALSE if it was already idle.

void flush_workqueue(wq)

void flush_scheduled_work(void) // same as flush_workqueue(system_wq)

Ensure that any scheduled work has run to completion.

Forces execution of the workqueue and blocks until its completion.

Sleep until all works which were queued on entry have been handled, but not livelocked by new incoming ones.

This is typically used in driver shutdown handlers.

void drain_workqueue(wq)

Wait until the workqueue becomes empty.

While draining is in progress, only chain queueing is allowed.

In other words, only currently pending or running work items on wq can queue further work items on it.

wq is flushed repeatedly until it becomes empty. The number of flushing is determined by the depth of chaining and should be relatively short.

int schedule_on_each_cpu(work_func_t func)

Execute a function synchronously on each online CPU

Executes func on each online CPU using the system workqueue and blocks until all CPUs have completed.

Very slow.

Returns 0 on success, -errno on failure.

int execute_in_process_context(work_func_t fn, struct execute_work* pExecuteWork)

Execute the routine within user context.

Executes the function immediately if process context is available, otherwise schedules the function for delayed execution.

Returns: 0 if function was executed immediately (i.e. was not in interrupt).

Returns: 1 if was in interrupt and attempted to schedule function for execution.

bool workqueue_congested(unsigned int cpu, wq)

Test whether wq's cpu workqueue for cpu is congested.

There is no synchronization around this function and the test result is unreliable and only useful as advisory hints or for debugging.

unsigned int work_cpu(pWork)

Return the last known associated cpu for pWork: CPU number if pWork was ever queued, WORK_CPU_NONE otherwise.

unsigned int work_busy(pWork)

Test whether a work is currently pending or running

There is no synchronization around this function and the test result is unreliable and only useful as advisory hints or for debugging. Especially for reentrant wqs, the pending state might hide the running state.

Returns or'd bitmask of WORK_BUSY_* bits.

Timers (BH):

Called via TIMER_SOFTIRQ .
Timers are not repetitive (once expires, need to be restarted).

void myfunc(unsigned long mydata) { ... }

[static] DEFINE_TIMER(tmr, myfunc, 0, mydata);

struct timer_list tmr;
init_timer(& tmr);
init_timer_onstack(& tmr); // if variable is on stack, also call destroy_timer_on_stack(& tmr)
init_timer_deferable(& tmr); // will not casue CPU to wake up from power sleep
tmr.function = myfunc;
tmr.data = mydata;

tmr.expires = jiffies + HZ / 2;

setup_timer(&tmr, myfunc, mydata)
setup_timer_on_stack(&tmr, myfunc, mydata)
setup_deferable_timer_on_stack(&tmr, myfunc, mydata)

void     add_timer(& tmr)                                     // must not be already pending
void     add_timer_on(&tmr, int cpu)

int        mod_timer(&tmr, unsigned long expires)

Equivalent to del_timer(tmr); tmr.expires = expires; add_timer(tmr);
The only safe way to queue timer when there are multiple unsynchronized concurrent users of the same timer.
If timer was inactive, returns 0, if was active, returns 1.

int mod_timer_pending(&tmr, unsigned long expires)

Modify an already pending timer.
For active timers the same as mod_timer().
But will not reactivate or modify an inactive timer.

int mod_timer_pinned(&tmr, unsigned long expires)

Update the expire field of an active timer (if the timer is inactive it will be activated) and ensure that the timer is scheduled on the current CPU.

This does not prevent the timer from being migrated when the current CPU goes offline. If this is a problem for

you, use CPU-hotplug notifiers to handle it correctly, for example, cancelling the timer when the corresponding CPU goes

offline.

Equivalent to: del_timer(timer); timer->expires = expires; add_timer(timer);

int timer_pending(&tmr) { return tmr.entry.next != NULL; } // no inherent synchronization

int del_timer(&tmr)

Deactivate timer. OK to call either for active and inactive timers.
If was inactive, return 0, otherwise (if was active) return 1.
Guarantees that the timer won't be executed in the future, but does not wait for completion of an already running handler.

int del_timer_sync(&tmr)

Deactivate a timer and wait for the handler to finish.

Synchronization rules: Callers must prevent restarting of the timer, otherwise this function is meaningless.

It must not be called from interrupt contexts unless the timer is an irqsafe one.

The caller must not hold locks which would prevent completion of the timer's handler.

The timer's handler must not call add_timer_on().

For non-irqsafe timers, you must not hold locks that are held in interrupt context while calling this function. Even if the lock has nothing to do with the timer in question. Here's why:

CPU0 CPU1

---- ----

call_timer_fn();

base->running_timer = mytimer;

spin_lock_irq(somelock);

<IRQ>

spin_lock(somelock);

del_timer_sync(mytimer);

while (base->running_timer == mytimer);

Now del_timer_sync() will never return and never release somelock. The interrupt on the other CPU is waiting to grab somelock but it has interrupted the softirq that CPU0 is waiting to finish.

int try_to_del_timer_sync(&tmr)

Try to deactivate a timer. Upon successful (ret >= 0) exit the timer is not queued and the handler is not running on any CPU.

Hi-res timers:

#include <linux/ktime.h>

#include <linux/hrtimer.h>

/* nanoseconds */

typedef union ktime { s64 tv64; } ktime_t;

KTIME_SEC_MAX

ktime_t ktime_set(s64 sec, unsigned long ns)

ktime_t timespec_to_ktime(struct timespec ts)

ktime_t timespec64_to_ktime(struct timespec64 ts)

ktime_t timeval_to_ktime(struct timeval tv)

s64 ktime_to_ns(kt)

s64 ktime_to_us(kt)

s64 ktime_to_ms(kt)

struct timespec ktime_to_timespec(kt)

struct timespec ktime_to_timespec64(kt)

struct timeval ktime_to_timeval(kt)

ktime_t ktime_add(kt1, kt2)

ktime_t ktime_sub(kt1, kt2)

ktime_t ktime_add_ns(kt, unsigned long ns)

ktime_t ktime_sub_ns(kt, unsigned long ns)

ktime_t ktime_add_safe(ktime_t kt1, ktime_t kt2)

ktime_t ktime_add_ms(kt, u64 ms)

ktime_t ktime_sub_us(kt, u64 us)

int ktime_compare(kt1, kt2)

bool ktime_equal(kt1, kt2)

bool ktime_after(kt1, kt2)

bool ktime_before(kt1, kt2)

hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);

timer->function = ftimer;

hrtimer_start(timer, ktime_set(0, ns), HRTIMER_MODE_REL);

...

hrtimer_cancel(timer); // waits for the handler to finish

static enum hrtimer_restart ftimer(struct hrtimer *timer)

{

missed = hrtimer_forward_now(timer, ktime_set(0, ns);

....

return HRTIMER_NORESTART; // or HRTIMER_RESTART

}

struct hrtimer *timer;

clockid_t which_clock = CLOCK_MONOTONIC or CLOCK_REALTIME;

enum hrtimer_mode mode =

HRTIMER_MODE_ABS // absolute

HRTIMER_MODE_REL // relative to now

HRTIMER_MODE_ABS_PINNED // absolute + bound to this CPU

HRTIMER_MODE_REL_PINNED // relative + bound to this CPU

void hrtimer_init(timer, which_clock, mode)

void hrtimer_init_on_stack(timer, which_clock, mode)

void destroy_hrtimer_on_stack(timer)

int hrtimer_start(timer, kt, mode)

int hrtimer_start_range_ns(timer, kt unsigned long range_ns, mode)

u64 hrtimer_forward(timer, ktime_t now, ktime_t interval)

u64 hrtimer_forward_now(timer, ktime_t interval)

ktime_t now = timer->base->get_time()

ktime_t now = hrtimer_cb_get_time(timer)

void hrtimer_set_expires(timer, kt)

void hrtimer_set_expires_range(timer, kt, ktime_t delta) // delta is extra tolerance

void hrtimer_set_expires_range_ns(timer, kt, unsigned long delta_ns)

void hrtimer_add_expires(timer, kt)

void hrtimer_add_expires_ns(timer, u64 ns)

ktime_t hrtimer_get_softexpires(timer) // without added tolerance (i.e. soft)

ktime_t hrtimer_get_expires(timer) // with tolerance, i.e. soft + delta

ktime_t hrtimer_expires_remaining(timer)

void hrtimer_get_res(which_clock, struct timespec *tp)

Wait queues:

Replacement for sleep/wake.

Exclusive waiters (those with WQ_FLAG_EXCLUSIVE) count against wakeup limit counter, non-exclusive do not.

wait_queue_head_t wq;

init_waitqueue_head(& wq);

Sample wait cycle:

[static] DEFINE_WAIT(mywait); // create wait queue entry

add_wait_queue(&wq, &mywait); // add it to wait queue

while (! (condition))
{

/* sets current task state */

/* TASK_INTERRUPTIBLE: signal wakes process up */

/* TASK_UNINTERRUPTIBLE: does not */

prepare_to_wait(&wq, &mywait, TASK_INTERRUPTIBLE);

if (signal_pending(current))
... handle signal ...

schedule();

}

finish_wait(&wq, &mywait); // sets task state to TASK_RUNNING and removed mywait off wq

Note: instead of default_wake_function can supply own wake function in mywait.func(mywait.private).

Helper macros:

void wait_event(&wq, condition)

Uninterruptible wait.

void wait_event_lock_irq(&wq, condition, spinlock_t lock)

void wait_event_lock_irq_cmd(&wq, condition, spinlock_t lock, cmd)

Uninterruptible wait.
Condition checked under the lock.

Expected to be called with the lock held, returns with the lock held.

Command cmd is executed right before going to sleep.

int wait_event_interruptible(&wq, condition)

int wait_event_interruptible_lock_irq(&wq, condition, spinlock_t lock)
int wait_event_interruptible_lock_irq_cmd(&wq, condition, spinlock_t lock, cmd)

Interruptible wait.
Returns 0 if condition evaluated to true, or -ERESTARTSYS if was interrupted.

Condition checked under the lock.

Expected to be called with the lock held, returns with the lock held.

Command cmd is executed right before going to sleep.

long wait_event_timeout(&wq, condition, timeout) // timeout is in jiffies

Uninterruptible wait with timeout.
Returns 0 if timeout elapsed, otherwise remaining time.

long wait_event_interruptible _timeout(&wq, condition, timeout) // timeout is in jiffies

Interruptible wait with timeout.
Returns -ERESTARTSYS if sleep was interrupted, 0 if timeout elapsed, otherwise remaining time.

int wait_event_interruptible_exclusive(&wq, condition)

Interruptible exclusive wait.
Returns 0 if condition evaluated to true, or -ERESTARTSYS if was interrupted.

void __add_wait_queue_exclusive(&wq, &mywait)

void __add_wait_queue_tail_exclusive(&wq, &mywait)
void add_wait_queue_exclusive(&wq, &mywait)

void prepare_to_wait_exclusive(&wq, &mywait, TASK_[UN]INTERRUPTIBLE)

General routines/macros to set up for exclusive wait.

int wait_event_interruptible_exclusive_locked(&wq, condition)

int wait_event_interruptible_exclusive_locked_irq(&wq, condition)

Interruptible wait.

Must be called with wq.lock held.
Spinlock is unlocked while sleeping but condition testing is done while the lock

is held and when this macro exits the lock is held.

wait_event_interruptible_exclusive_locked uses spin_lock()/spin_unlock().
wait_event_interruptible_exclusive_locked_irq uses spin_lock_irq()/spin_unlock_irq().
Returns -ERESTARTSYS if sleep was interrupted, 0 if condition was satisfied.

int wait_event_killable(&wq, &mywait)

Semi-interruptible (only by fatal signals) wait with timeout.
Returns 0 if condition evaluated to true, or -ERESTARTSYS if was interrupted.

int wait_on_bit(void *word, int bit, int (*action)(void *), unsigned sleepmode)
int wait_on_bit_lock(void *word, int bit, int (*action)(void *), unsigned sleepmode)

wait_on_bit : wait for a bit to be cleared.

wait_on_bit_lock : wait for a bit to be cleared when wanting to set it.

action is a function used to sleep.

Wake up:

wake_up_all(&wq); // wakes up all waiters

wake_up(&wq); // wakes up upto one waiter, same as wake_up_nr(&wq, 1)
wake_up_nr(&wq, 5); // wakes up upto specified number of waiters

Only exclusive waiters (with WQ_FLAG_EXCLUSIVE) count against the counter.

/* same but with queue spinlock held */

unsigned long flags;

spin_lock_irqsave(&q->lock, flags);

wake_up_locked(&wq); // wakes up upto one waiter
wake_up_all locked(&wq); // wakes up upto one waiter
spin_unlock_irqrestore(&q->lock, flags);

void wake_up_bit(void* word, int bit)

Wake up waiter on a bit.

Remove from wait queue:

void remove_wait_queue(&wq, &mywait)

Completion variables:

Wait queue plus "done" flag.

struct completion {

unsigned int done;

wait_queue_head_t wait;

};

struct competion comp;

DECLARE_COMPLETION(comp)
DECLARE_COMPLETION_ONSTACK(comp)
#define INIT_COMPLETION(x) ((x).done = 0) // reinitialize for reuse

void init_completion(& comp)

void wait_for_completion(&comp)

int wait_for_completion_interruptible(&comp)

int wait_for_completion_killable(&comp)

unsigned long wait_for_completion_timeout(&comp, unsigned long timeout)

long wait_for_completion_interruptible_timeout(&comp, unsigned long timeout)

long wait_for_completion_killable_timeout(&comp, unsigned long timeout)

bool try_wait_for_completion(&comp)

bool completion_done(&comp)

void complete(&comp)

void complete_all(&comp)

Atomic operations on 32-bit counters:

#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x)) // note: does not imply memory barrier

typedef struct { volatile int counter; } atomic_t;

atomic_t v = ATOMIC_INIT(77);
int i;

int atomic_read(pv, i) // no implied barriers !

void atomic_set(pv, i) // ...

void    atomic_add(i, pv)                                       // ...
void    atomic_sub(i, pv)                                        // ...
void    atomic_inc(pv)                                               // ...
void    atomic_dec(pv)                                              // ...

No implied barriers (compiler or memory) !!
For some operations, use helper functions

void smp_mb__before_atomic_dec()

void smp_mb__after_atomic_dec()

void smp_mb__before_atomic_inc()

void smp_mb__after_atomic_inc()

For others use smp_wmb, smp_rmb, smp_mb or barrier.

int       atomic_sub_and_test(i, pv)                 // true if result is zero, performs memory barriers before and after
int       atomic_dec_and_test(pv)                      // ...
int       atomic_inc_and_test(pv)                       // ...

int       atomic_add_return(i, pv)                      // return result, performs memory barriers before and after
int       atomic_sub_return(i, pv)                       // ...
int       atomic_inc_return(pv)                                                           // ...
int       atomic_dec_return(pv)                                                          // ...

int atomic_add_negative(i, pv) // *pv += i; return true if result is negative, performs memory barriers before and after

int atomic_xchg(pv, i) // returns previous value, performs memory barriers before and after

int atomic_cmpxchg(pv, old, new) // CAS, returns actual old value, performs memory barriers before and after

int atomic_add_unless(pv, int a, int mark) // if (*pv != mark) { *pv += a; return true; } else { return false; } and does barriers

void atomic_clear_mask(bitmask, pv)

void atomic_set_mask(bitmask, pv)

int _atomic_dec_and_lock(pv , spinlock_t *lock)

Atomically decrement *pv and if it drops to zero, atomically acquire lock and return true.
If does not drop to zero, just decrement, do not acquire lock and return false.

Atomic operations on 64-bit counters:

Most 32-bit architectures do not support 64-bit atomic operations, although x86_32 does.

atomic64_t
long
ATOMIT64_INIT(...)
atomic64_read(...)

etc.

Atomic operations on bits:

void set_bit(unsigned long nr, volatile unsigned long *addr) // no barriers implied

void clear_bit(unsigned long nr, volatile unsigned long *addr) // ...

void change_bit(unsigned long nr, volatile unsigned long *addr) // ... (flips the bit)

void smp_mb__before_clear_bit(void)

void smp_mb__after_clear_bit(void)

int test_bit(unsigned long nr, __const__ volatile unsigned long *addr);

int test_and_set_bit(unsigned long nr, volatile unsigned long *addr) // returns boolean, performs memory barriers

int test_and_clear_bit(unsigned long nr, volatile unsigned long *addr)

int test_and_change_bit(unsigned long nr, volatile unsigned long *addr)

int test_and_set_bit_lock(unsigned long nr, unsigned long *addr) // acquire-release semantics

void clear_bit_unlock(unsigned long nr, unsigned long *addr) // ...

Non-atomic operations on bits:

void __set_bit(unsigned long nr, volatile unsigned long *addr)

void __clear_bit(unsigned long nr, volatile unsigned long *addr)

void __change_bit(unsigned long nr, volatile unsigned long *addr)

int __test_and_set_bit(unsigned long nr, volatile unsigned long *addr)

int __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr)

int __test_and_change_bit(unsigned long nr, volatile unsigned long *addr)

void __clear_bit_unlock(unsigned long nr, unsigned long *addr) // implements release barrier

unsigned long find_first_bit(const unsigned long *addr, unsigned long size)

unsigned long find_first_zero_bit(const unsigned long *addr, unsigned long size)

unsigned long find_next_bit(const unsigned long *addr, unsigned long size, unsigned long offset)

if not found, returns value >= size.

int ffs(int x)
int ffz(int x)

Memory barriers:

rmb, wmb, mb, read_barrier_depends – emit on UP and MP
smp_rmb, smp_wmb, smp_mb, smp_read_barrier_depends – emit MP ; on UP emit only barrier()
barrier() – compiler barrier

p = ...
read_barrier_depends(); // a weaker form of read barrier, only affecting dependent data
a = *p;

[can drift down]
LOCK barrier
[cannot drift up]

[cannot drift down]
UNLOCK barrier
[can drift up]

Therefore UNLOCK + LOCK = full MB, but LOCK + UNLOCK is not (may drift into the middle and get reverse-ordered there).

smp_mb__before_spinlock() does not let stores before critical section to dirft inside it

set_current_state(TASK_UNINTERRUPTIBLE) does: smp_mb(), current->state = TASK_UNINTERRUPTIBLE
wake_up does smp_mb

mmiowb() – order writes to IO space (PCI etc.)

Spinlocks:

Not recursive.

Disables preemption on local CPU.

Optionally disables hardware interrupts or bottom-half processing.

While holding a spinlock, must not do anything that will cause sleep. Do not touch pageable (e.g. user) memory, kmalloc(GFP_KERNEL), any semaphore functions or any of the schedule functions, down_interruptible() and down().

Ok to call down_trylock() and up() as they do never sleep.

Initialization:

spinlock_t lock = SPIN_LOCK_UNLOCKED;    // obsolete, incompatible with RT

Or:    DEFINE_SPINLOCK(name)

Or:    void spin_lock_init(&lock);

Disable-restore interrupts on local CPU:

unsigned long flags;

spin_lock_irqsave(&lock, flags);

spin_unlock_irqrestore(&lock, flags);

Performs  local_irq_save(); preempt_disable().

If knew for sure interrupts were enabled, can do instead:

spin_lock_irq(&lock);

spin_unlock_irq(&lock);

Performs  local_irq_disable(); preempt_disable().

If spinlock is never acquired from interrupt handle, can use the form that does not disable interrupts:

spin_lock(&lock);

spin_unlock(&lock);

int spin_trylock(&lock);

Performs  preempt_disable().

Disable bottom half processing, but do not disable hardware interrupts:

spin_lock_bh(&lock);

spin_unlock_bh(&lock);

int spin_trylock_bh(&lock);

Performs  local_bh_disable(); preempt_disable().

Nested forms for lockdep:

spin_lock_nested(&lock, subclass)

spin_lock_irqsave_nested(&lock, flags, subclass)

spin_lock_nest_lock(&lock, nest_lock)               // signal: ok to take lock despite already holding nest_lock of the same class

RW spinlocks:

Not recursive.

Disables preemption on local CPU.

Optionally disables hardware interrupts or bottom-half processing.

No reader -> writer conversion.

Readers are favored over writers and can starve them, in this case plain spinlock_t may be prefereable to rwlock_t.

While holding a spinlock, must not do anything that will cause sleep. Do not touch pageable (e.g. user) memory, kmalloc(GFP_KERNEL), any semaphore functions or any of the schedule functions, down_interruptible() and down().

Ok to call down_trylock() and up() as they do never sleep.

Initialization:

rwlock_t lock = RW_LOCK_UNLOCKED;      // obsolete, incompatible with RT

Or:    DEFINE_RWLOCK(lock)

Or:    void rwlock_init(&lock);

Disable-restore interrupts on local CPU:

unsigned long flags;

read_lock_irqsave(&lock, flags);

read_unlock_irqrestore(&lock, flags);

write_lock_irqsave(&lock, flags);

write_unlock_irqrestore(&lock, flags);

Performs  local_irq_save(); preempt_disable().

If knew for sure interrupts were enabled, can do instead:

read_lock_irq(&lock);

read_unlock_irq(&lock);

write_lock_irq(&lock);

write_unlock_irq(&lock);

Performs  local_irq_disable(); preempt_disable().

If spinlock is never acquired from interrupt handle, can use the form that does not disable interrupts:

read_lock(&lock);

read_unlock(&lock);

write_lock(&lock);

write_unlock(&lock);

int read_trylock(&lock);

int write_trylock(&lock);

Performs  preempt_disable().

Disable bottom half processing, but do not disable hardware interrupts:

read_lock_bh(&lock);

read_unlock_bh(&lock);

write_lock_bh(&lock);

write_unlock_bh(&lock);

Performs  local_bh_disable(); preempt_disable().

Local-global spinlock:

· Global data: very fast and scalable read locking, but very slow write locking.

· Per-CPU data: fast write locking on local CPU, possible write locking of other CPU’s data, very slow write locking of all CPUs data.

struct lglock lg;
DEFINE_STATIC_LGLOCK(lg);

void lg_lock_init(&lg, “mylg”);

void lg_local_lock(&lg);

void lg_local_unlock(&lg);

void lg_local_lock_cpu(&lg, int cpu);

void lg_local_unlock_cpu(&lg, int cpu);

void lg_global_lock(&lg);

void lg_global_unlock(&lg);

Mutexes:

Sleepable lock (may sleep when acquiring, can sleep while holding it).
May not use in softirq or hardirq context.
Non-recursive.
Task may not exit with mutex held.

struct mutex mtx;
[static] DEFINE_MUTEX(mtx);

void mutex_init(&mtx);
void   mutex_destroy(&mtx);

void   mutex_lock(&mtx)
int      mutex_trylock(&mtx)
void   mutex_unlock(&mtx)                                                                           // must be unlocked by owner only
int      mutex_is_locked(&mtx)

int mutex_lock_interruptible(&mtx) // 0 if acquired, otherwise -EINTR

int mutex_lock_killable(&mtx) // …

int atomic_dec_and_mutex_lock(atomic_t* cnt, &mtx)

If decrements cnt to 0, return true holding mutex.

Otherwise return false not holding mutex.

If locking more than one mutex of the same lock validation class, lockdep will complain. To lock multiple mutexes of the same class, designate their subclasses and lock with the following functions.

unsigned int subclass;

void mutex_lock_nested(&mtx, subclass)

void mutex_lock_interruptible_nested(&mtx, subclass)

void mutex_lock_killable_nested(&mtx, subclass)
void mutex_lock_nest_lock(&mtx, nest_lock) // signal: ok to take mtx despite already holding nest_lock of the same class

Mutexes with priority inheritance:

struct rt_mutex rmx;
[static] DEFINE_RT_MUTEX(rmx);

void rt_mutex_init(&rmx)

void rt_mutex_destroy(&rmx)

void rt_mutex_lock(&rmx)

int rt_mutex_trylock(&rmx)

void rt_mutex_unlock(&rmx)

int rt_mutex_is_locked(&rmx)

int rt_mutex_timed_lock(&rmx, struct hrtimer_sleeper *timeout, int detect_deadlock)

success = 0, otherwise –EINTR or –ETIMEOUT or -EDEADLK

int rt_mutex_lock_interruptible(&rmx, int detect_deadlock)

success = 0, otherwise –EINTR or -EDEADLK

Semaphores:

Sleepable lock (may sleep when acquiring, can sleep while holding it).
May not use in softirq or hardirq context.

struct semaphore sem;
[static] DEFINE_SEMAPHORE(sem);

void sema_init(&sem, int val) // val is maximum number of holders

void down(&sem) // acquire

int down_trylock(&sem) // inverted result: 0 if acquired, 1 if did not

int down_interruptible(&sem) // 0 or -EINTR

int down_killable(&sem) // 0 or -EINTR

int down_timeout(&sem, long jiffies) // 0 or -ETIME

void up(&sem) // release

RW locks (RW semaphores):

Any number of readers, up to 1 writer.
Non-recursive.
Sleepable lock (may sleep when acquiring, can sleep while holding it).
May not use in softirq or hardirq context.
There is no interrptible/killable wait, all wait is uninterruptible.

struct rw_semaphore rws;
[static] DECLARE_SEMAPHORE(rws);

void init_rwsem(&rws)

void down_read(&rws) // acquire

void down_write(&rws)

int down_read_trylock(&rws) // true = ok, false = failure

int down_write_trylock(&rws)

void up_read(&rws) // release

void up_write(&rws)

int rwsem_is_locked(&rws) // true if locked

void downgrade_write(&rws) // downgrade write lock to read lock

If locking more than one RW semaphore of the same lock validation class, lockdep will complain. To lock multiple mutexes of the same class, designate their subclasses and lock with the following functions.

void down_read_nested(&rws, int subclass)

void down_write_nested(&rws, int subclass)

void down_write_nest_lock(&rws, nest_lock) // signal: ok to take rws despite already holding nest_lock of the same class

Seqlocks:

Sequential locks.

· Good when lots of readers, few writers

· Writers take priority over readers

· So readers cannot starve writers

"Write" side (in write_seqlock_xxx) internally uses spinlock and runs locked code with spinlock held.
"Read" side does not use spinlock.
"Write" side is subject to lock validator, "read" side is not.

Internally increments a variable inside the lock. When it is odd, write lock is held.

Writer usage

Reader usage

write_seqlock(&sq)

... write data ...

write_sequnlock(&sq)

do {

start = read_seqbegin(&sq);

... read data ...
} while (read_seqretry(&sq, start))

seqlock_t sq;
[static] DEFINE_SEQLOCK(sq);
unsigned int start;

void seqlock_init(& sq)

void write_seqlock(&sq)

int write_tryseqlock(&sq) // true if locked

void write_sequnlock(&sq)

void write_seqlock_irqsave(&sq, flags)

void write_sequnlock_irqrestore(&sq, flags)

void write_seqlock_irq(&sq)

void write_sequnlock_irq(&sq)

void write_seqlock_bh(&sq)

void write_sequnlock_bh(&sq)

void write_seqcount_begin(seqcount_t *s) // caller uses his own mutex or other lock, built-in spinlock is not used

void write_seqcount_end(seqcount_t *s) // ...

void write_seqcount_barrier(seqcount_t *s) // invalidate in-progress read operations

start = read_seqbegin(const &sq)

int read_seqretry(const &sq, unsigned start)

start = read_seqbegin_irqsave(&sq, flags)

start = read_seqretry_irqrestore(&sq, iv, flags)

start = __read_seqcount_begin(const seqcount_t *s) // like read_seqbegin, but does not have smp_rmb at the end

start = read_seqcount_begin(const seqcount_t *s) // equivalent to read_seqbegin

start = raw_seqcount_begin(const seqcount_t *s) // get seqcount without waiting for it to go even, plus smp_rmb

int __read_seqcount_retry(const seqcount_t *s, unsigned start) // like read_seqretry, but does not issue smp_rmb
// (caller must issue rmb before the call)

int read_seqcount_retry(const seqcount_t *s, unsigned start) // equivalent to read_seqretry

Lock validator (lockdep):

http://lwn.net/Articles/185666/
Documentation/lockdep-design.txt

Enable CONFIG_DEBUG_LOCK_ALLOC.

Every lock is assigned a class (= “key address”):

· For statically allocated lock it is the address of the lock

· For dynamic locks it is the spot of their init (spin_lock_init, mutex_init etc.) routine

Validator checks for:

· ClassA lock taken after ClassB if previously locks of these two classes were ever taken in inverse order

· ClassX lock is taken when ClassX lock is already held.

o Exemption: use functions like mutex_lock_nested and specify subclass.
Validator treats “class+subclass” as a separate class.

· Locks released in the order that is not reverse of their acquisition.
(A stack of currently-held locks is maintained, so any lock being released should be at the top of the stack; anything else means that something strange is going on.)

· Any (spin)lock ever acquired by hardirq can never be acquired when interrupts (hardirqs) are enabled.

o Conversely: any lock ever acquired with hardirqs enabled is “hardirq-unsafe”
and cannot be acquired in hardirq.

· Any (spin)lock ever acquired by softirq can never be acquired when softirqs are enabled.

o Conversely: any lock ever acquired with softirqs enabled is “softirq-unsafe” and cannot be acquired in hardirq.

· Cannot: (holding) hardirq-safe -> (acquire) hardirq-unsafe.

o Since hardirq-unsafe -> hardirq-safe is valid, so reverse is invalid, as it would cause lock inversion.
May be too strong, but it is checked against nevertheless.

· Cannot: (holding) softirq-safe -> (acquire) softirq-unsafe.

Gotchas:

· Unloading of kernel modules causes class ids to be available for reuse and can produce false warning and leakage of class table (that can overflow and result in a message).

· Static initialization of large number of locks (e.g. array of structures with per-entry lock) can cause class table to overflow. Solution: initialize locks dynamically.

Aids:

· /proc/lockdep_stats

· /proc/lockdep

Kernel threads:

struct task_struct* task;

task = kthread_create(thread_func, void* data, const char namefmt*, ...);
if (! IS_ERR(task)) wake_up_process(task);

Initially SCHED_NORMAL, nice 0.
Can alter with sched_setscheduler[_nocheck](…).

// create thread bound to NUMA memory node

task = kthread_create_on_node(thread_func, void* data, const char namefmt*,

cpu_to_node(cpu), ...);
if (! IS_ERR(task)) wake_up_process(task);

task = kthread_run(thread_func, void* data, const char namefmt*, ...);

int retval = kthread_stop(task);

Returns thread's retval or -EINTR if the process had never started.

int thread_func(void* data)

{

/* check if kthread_stop had been called */

if (kthread_should_stop())

return retval; /* return status to kthread_stop caller */

do_exit();

}

Priority ranges:

task_struct* p;
p->prio

All:

0 (highest) … MAX_PRIO-1 (lowest, 139)

MAX_PRIO = 140

Realtime:

0 (highest) … MAX_RT_PRIO-1 (lowest, 99)
MAX_PRIO = MAX_USER_RT_PRIO = 100

Timeshared:

MAX_RT_PRIO (100) … MAX_PRIO-1 (139)

Default priority for irq threads (PREEMPT_RT):

MAX_USER_RT_PRIO/2 (50)

CPUs:

NR_CPUS – maximum number in the system // eventually will be run-time variable nr_cpu_ids (nr_cpu_ids <= NR_CPUS)

unsigned long my_percpu[NR_CPUS];

int cpu = get_cpu(); // disable preemption and get current processor

my_percpu[cpu]++;

put_cpu(); // enable preemption

int smp_processor_id(); // unsafe if preemption is not disabled

Per-CPU data interface, static form usable only in kernel but not in loadable modules:

DECLARE_PER_CPU(vartype, varname) // declare per-CPU variable (not required locally, DEFINE_PER_CPU is ok)
DEFINE_PER_CPU(vartype, varname) // define and allocate per-CPU variable

DECLARE_PER_CPU_SHARED_ALIGNED(...) // cacheline aligned
DEFINE_PER_CPU_SHARED_ALIGNED(...)

DECLARE_PER_CPU_ALIGNED(...) // cacheline aligned
DEFINE_PER_CPU_ ALIGNED(...)

DECLARE_PER_CPU_PAGE_ALIGNED(...) // page-aligned
DEFINE_PER_CPU_ PAGE_ALIGNED(...)

EXPORT_PER_CPU_SYMBOL(varname) // export for use by modules
EXPORT_PER_CPU_SYMBOL_GPL(varname)

get_cpu_var(varname)++; // disable preemption, access variable and increment it

put_cpu_var(varname); // reenable preemption

per_cpu(varname, cpu) = 0; // access per-cpu variable instance for particular CPU (does not disable

// preemption and provides by itself no locking)

Per-CPU data interface, dynamic form usable both in kernel and in loadable modules:

void* alloc_percpu(vartype); // on failure returns NULL

void* __alloc_percpu(size_t size, size_t align);

void free_percpu(void* data);

A* p = alloc_percpu(A); // same as below

A* p = alloc_percpu(sizeof(A), __alignof(__A)); // ...

Then can do get_cpu_var(p) and put_cpu_var(p).

this_cpu_read , this_cpu_write, this_cpu_add, this_cpu_sub etc. – operations performed with interrupts or preemption disabled

CPU sets:

typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;
cpumask_t cpumask;

unsigned long* bp = coumask_bits(& cpumask);

const struct cpumask *const cpu_possible_mask; // populatable, e.g. can be hot-plugged, fixed at boot time

const struct cpumask *const cpu_present_mask; // populated

const struct cpumask *const cpu_online_mask; // available to scheduler

const struct cpumask *const cpu_active_mask; // available to migration (a subset of online, currently identical)

#define num_online_cpus() cpumask_weight(cpu_online_mask)

#define num_possible_cpus() cpumask_weight(cpu_possible_mask)

#define num_present_cpus() cpumask_weight(cpu_present_mask)

#define num_active_cpus() cpumask_weight(cpu_active_mask)

#define cpu_online(cpu) cpumask_test_cpu((cpu), cpu_online_mask)

#define cpu_possible(cpu) cpumask_test_cpu((cpu), cpu_possible_mask)

#define   cpu_present(cpu)                    cpumask_test_cpu((cpu), cpu_present_mask)
#define   cpu_active(cpu)                       cpumask_test_cpu((cpu), cpu_active_mask)

#define   cpu_is_offline(cpu)                 unlikely(!cpu_online(cpu))

cpumask_t mask = { CPU_BITS_NONE };
mask_t mask = cpu_none_mask;

mask_t mask = cpu_all_mask;
cpumask_set_cpu(unsigned int cpu, & mask)
cpumask_clear_cpu(unsigned int cpu, & mask)
int   cpumask_test_cpu(unsigned int cpu, & mask)
int   cpumask_test_and_set_cpu(unsigned int cpu, & mask)
int   cpumask_test_and_clear_cpu(unsigned int cpu, & mask)

cpumask_clear(&mask);
cpumask_setall(& mask);

cpumask_and(&mask, &m1, &m2);
cpumask_or(&mask, &m1, &m2);
cpumask_xor(&mask, &m1, &m2);
cpumask_andnot(&mask, &m1, &m2);
cpumask_complement(&mask, &m1);

bool cpumask_[is]empty(& mask);
bool cpumask_[is]subset(& mask, & superset_mask);

#define cpu_isset(cpu, cpumask) test_bit((cpu), (cpumask).bits)

cpumask_copy(&dst, &src);

bool cpumask_equal(&m1, &m2)

int    cpumask_parse_user(const char __user *buf, int len, &mask)                                           // 0 = ok, otherwise -errno
int    cpumask_parselist_user(const char __user *buf, int len, &mask)
int    cpulist_parse(const char *buf, & mask)

int cpumask_scnprintf(char *buf, int len, &mask)
int cpulist_scnprintf(char *buf, int len, &mask)

unsigned int   cpumask_first(&mask)                           // returns >= nr_cpu_ids if none
unsigned int   cpumask_next(int n, &mask)                // returns >= nr_cpu_ids if none
unsigned int   cpumask_next_zero(int n, &mask)      // returns >= nr_cpu_ids if none

for_each_cpu(cpu, mask) {...}                                         // iterate over every cpu in a mask
for_each_cpu_not(cpu, mask) {...}                                // iterate over every cpu not in a mask
for_each_cpu_and(cpu, m1, m2) {...}                           // iterate over every cpu in both masks

#define for_each_possible_cpu(cpu) for_each_cpu((cpu), cpu_possible_mask)

#define for_each_online_cpu(cpu) for_each_cpu((cpu), cpu_online_mask)

#define for_each_present_cpu(cpu) for_each_cpu((cpu), cpu_present_mask)

cpumask_of_node(cpu_to_node(cpu))

Calling function on other CPUs:

void myfunc(void* info);

Handler function is invoked called on remote CPUs in IPI interrupt context,
on local CPU – with interrupts disabled.

The following functions must be called with interrupts enabled (or can deadlock).
They return 0 if successful, otherwise –errno (e.g. –ENXIO if CPU is not online).
If wait is true, waits for completion of the call.

int smp_call_function_single(int cpuid, myfunc, void *info, int wait);
int smp_call_function_any(&mask, myfunc, void *info, int wait);

void smp_call_function_many(&mask, myfunc, void *info, bool wait);

Execute myfunc(info) on all CPUs in mask except current CPU.
Preemption must be disabled when calling smp_call_function_many.

May not call it from bottom-half or interrupt handler.

int smp_call_function(myfunc, void *info, int wait);

Call myfunc(info) on all CPUs in cpu_online_mask except current CPU.
May not call smp_call_function from bottom-half or interrupt handler.
Temporary disables preemption internally.

int on_each_cpu(myfunc, void *info, int wait);

Call myfunc(info) on all CPUs in cpu_online_mask and current CPU.
May not call smp_call_function from bottom-half or interrupt handler.
Temporary disables preemption internally.

void on_each_cpu_mask(&mask, myfunc, void *info, bool wait);

Call myfunc(info) on all CPUs in the mask including current CPU if it is in the mask.
May not call smp_call_function from bottom-half or interrupt handler.

Preemption must be disabled when calling on_each_cpu_mask.

void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info), myfunc, void *info, bool wait, gfp_t gfp_flags);

For each CPU in cpu_online_mask, invoke locally cond_func(cpu, info) and for those CPUs for which it returns true, perform call remotely myfunc(info) and optionally wait. Flags gpf_flags are used to allocate temporary CPU mask block.

May not call on_each_cpu_cond from bottom-half or interrupt handler.
Temporary disables preemption internally.

Stop Machine:

int stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)

Freeze the machine on all CPUs and execute fn(data).

This schedules a thread to be run on each CPU in the cpus mask (or each online CPU if cpus is NULL) at the highest priority, which threads also disable interrupts on their respective CPUs. The result is that no one is holding a spinlock or inside any other preempt-disable region when fn runs. This is effectively a very heavy lock equivalent to grabbing every spinlock (and more).

fn(data) is executed only on one CPU, with interrupts disabled.

Memory zones:

ZONE_DMA	pages that can be used for DMA	x86-32: below 16 MB
ZONE_DMA32	pages that can be used for DMA, but by 32-bit devices only
ZONE_NORMAL	normal pages, permanently mapped into kernel memory	on x86-32: 16 MB to 896 MB on some other architectures: all memory
ZONE_HIGHMEM	pages not permanently mapped into kernel memory	on x86-32: all memory above 896 MB on some other architectures: empty
ZONE_MOVABLE	(related to hot-plugging)

Normal allocations are usually satisfied from ZONE_NORMAL, but on shortage can satisfy from ZONE_DMA.

x64 has only ZONE_DMA, ZONE_DMA32 and ZONE_NORMAL.

Memory allocation (kmaloc):

void* kmalloc(size_t size, gfp_t flags)
kfree(p)

void* kmalloc_array(size_t n, size_t size, gfp_t flags)

void* kcalloc(size_t n, size_t size, gfp_t flags) // zeroes out

void* kzalloc(size_t size, gfp_t flags) // zeroes out

void* kmalloc_node(size_t size, gfp_t flags, int node) // from specific NUMA node

void* kzalloc_node(size_t size, gfp_t flags, int node) // zeroes out

Common flag combinations:

GFP_KERNEL	Default, most usual. Can block and can be called only in process context. Can do block IO and file system calls inside.
GFP_ATOMIC	High-priority allocation. Does not sleep. Use in interrupt handlers, bottom halves (softirqs, tasklets) or when holding a spinlock or in other situations when cannot sleep. Will use emergency pool if necessary.
GFP_NOWAIT	Like GFP_ATOMIC, but will not fallback to emergency pool.
GFP_NOIO	Like GFP_KERNEL, but will not do block IO. Use in block IO code when want to avoid recursion.
GFP_NOFS	Like GFP_KERNEL (can block and initiate disk IO), but will not do file system IO. Use in file system code when want to avoid recursion.
GFP_USER	Normal allocation, can block. Use to allocate memory for user-space processes.
GFP_HIGHUSER	Allocate from ZONE_HIGHMEM. Can block. Used to allocate memory for user-space processes.
GFP_HIGHUSER_MOVABLE
GFP_TRANSHUGE
GFP_DMA	Allocate from ZONE_DMA. Should only be used for kmalloc caches, otherwise use slab created with SLAB_DMA
GFP_DMA32	Allocate from ZONE_DMA32

Their meanings:

GFP_KERNEL	__GFP_WAIT \| __GFP_IO \| __GFP_FS
GFP_ATOMIC	__GFP_HIGH
GFP_NOWAIT	GFP_ATOMIC & ~__GFP_HIGH
GFP_NOIO	__GFP_WAIT
GFP_NOFS	__GFP_WAIT \| __GFP_IO
GFP_USER	__GFP_WAIT \| __GFP_IO \| __GFP_FS \| __GFP_HARDWALL
GFP_HIGHUSER	_GFP_WAIT \| __GFP_IO \| __GFP_FS \| __GFP_HARDWALL \| __GFP_HIGHMEM
GFP_HIGHUSER_MOVABLE	__GFP_WAIT \| __GFP_IO \| __GFP_FS \| __GFP_HARDWALL \| __GFP_HIGHMEM \| __GFP_MOVABLE
GFP_TRANSHUGE	GFP_HIGHUSER_MOVABLE \| __GFP_COMP \| __GFP_NOMEMALLOC \| __GFP_NORETRY \| __GFP_NOWARN \| __GFP_NO_KSWAPD
GFP_DMA	__GFP_DMA
GFP_DMA32	__GFP_DMA32
GFP_TEMPORARY	__GFP_WAIT \| __GFP_IO \| __GFP_FS \| __GFP_RECLAIMABLE
GFP_IOFS	__GFP_IO \| __GFP_FS

Basic flags:

___GFP_DMA	Allocate from ZONE_DMA
___GFP_DMA32	Allocate from ZONE_DMA32
___GFP_HIGHMEM	Allocate from ZONE_HIGHMEM
___GFP_MOVABLE	Allocate from ZONE_MOVABLE
___GFP_WAIT	Allocator can sleep
___GFP_HIGH	High-priority: allocator can access emergency pools
___GFP_IO	Allocator can start block IO
___GFP_FS	Allocator can start filesystem IO
___GFP_COLD	Request cold-cache pages instead of trying to return cache-warm pages
___GFP_NOWARN	Allocator does not print failure warnings
___GFP_REPEAT	Allocator repeats (once) the allocation if it fails, but the allocation can potentially fail
___GFP_NOFAIL	Allocator repeats allocation indefinitely, the allocation cannot fail. Deprecated, no new users allowed.
___GFP_NORETRY	Allocator never retries the allocation if it fails
___GFP_MEMALLOC	User reserves
___GFP_COMP	The allocator adds compound page metadata (used internally by hugetlb code)
___GFP_ZERO	Zero out page content
___GFP_NOMEMALLOC	Allocator does not fall back on reserves (takes precedence over MEMALLOC if both are specified)
___GFP_HARDWALL	The allocator enforces "hardwall" cpuset boundaries
___GFP_THISNODE	Allocate NUMA node-local memory ponly, no fallback
___GFP_RECLAIMABLE	The allocator marks the pages reclaimable
___GFP_KMEMCG	Allocation comes from a memcg-accounted resource
___GFP_NOTRACK	Don't track with kmemcheck
___GFP_NO_KSWAPD
___GFP_OTHER_NODE	On behalf of other node
___GFP_WRITE	Allocator intends to dirty page

Memory allocation (slab allocator):

cache (pool) -> set of slab blocks (each is several pages) -> objects in each slab

struct kmem_cache*   kmem_cache_create(
                                                "my_struct",
                                                 sizeof(my_struct),
                                                 size_t align,                                           // required alignment, often L1_CACHE_BYTES, or 0 if SLAB_HWCACHE_ALIGN
                                                 unsigned long flags, // SLAB_xxx, see below
                                                 void (*ctor)(void*)) // can be NULL

void kmem_cache_destroy(struct kmem_cache* cache)

SLAB_DEBUG_FREE	DEBUG: perform extensive check on free
SLAB_RED_ZONE	DEBUG: insert "red zones" around allocated memory to help detect buffer overruns
SLAB_POISON	DEBUG: fill a slab with known value (0xA5) to catch access to uninitialized memory
SLAB_HWCACHE_ALIGN	Align objects within a slab to a cache line
SLAB_CACHE_DMA	Allocate from ZONE_DMA
SLAB_STORE_USER	DEBUG: store the last owner for bug hunting
SLAB_PANIC	Panic if allocation fails
SLAB_DESTROY_BY_RCU	Defer freeing slabs to RCU (read more in header file)
SLAB_MEM_SPREAD	Spread some memory over cpuset
SLAB_TRACE	Trace allocations and frees
SLAB_DEBUG_OBJECTS
SLAB_NOLEAKTRACE	Avoid kmemleak tracing
SLAB_NOTRACK
SLAB_FAILSLAB	Fault injection mark
SLAB_RECLAIM_ACCOUNT	Objects are reclaimable
SLAB_TEMPORARY	Objects are short-lived

void*    kmem_cache_alloc(cache, gfp_t flags);
void*    kmem_cache_alloc_node(cache, gfp_t flags, int node)
void      kmem_cache_free(cache, void* p);

Memory allocation (virtually contiguous pages):

Allocated pages are virtually contiguous and not necessarily physically contiguous.
Less performant than kalloc() since need to update page tables and flush TLBs.

void* vmalloc(unsigned long size);

void* vzalloc(unsigned long size); // zeroed out

void vfree(const void* addr);

void* vmalloc_user(unsigned long size); // memory for user-space, zeroed out

void* vmalloc_node(unsigned long size, int node); // allocate pages in specific NUMA node

void* vzalloc_node(unsigned long size, int node);

void* vmalloc_exec(unsigned long size); // executable memory

void* vmalloc_32(unsigned long size); // 32-bit addressable memory

void* vmalloc_32_user(unsigned long size); // 32-bit addressable memory for user-space, zeroed-out

void* vmap(struct page **pages, unsigned int count, unsigned long flags, pgprot_t prot);

void vunmap(const void* addr); // release mapping

void vmalloc_sync_all(void);

Memory allocation (physical pages):

struct page* alloc_pages(gfp_t flags, unsigned int order)

Allocate (2 ^ order) contiguous physical pages, return pointer to first pages's page structure.
On failure returns NULL.

unsigned long page_address(struct page* page) // get physical address

unsigned long __get_free_pages(gfp_t gfp_flags, unsigned int order)
unsigned long __get_free_page(flags)

Same as alloc_pages, but return physical address or 0.

unsigned long get_zeroed_page(gfp_t flags)

Allocate one zeroed page, return physical address or 0.

#define __get_dma_pages(flags, order) __get_free_pages((flags) | GFP_DMA, (order))

void __free_pages(struct page *page, unsigned int order)
void __free_page(struct page *page)

void free_pages(unsigned long addr, unsigned int order)
void free_page(unsigned long addr)

void* alloc_pages_exact(size_t size, gfp_t flags)
void free_pages_exact(void*physaddr, size_t size)

Allocate contiguous physical pages to hold size bytes, return physical address or NULL on failure.

struct page*    alloc_pages_node(int nid, gfp_t flags, unsigned int order)
struct page*    alloc_pages_exact_node(int nid, gfp_t flags, unsigned int order)
void*                               alloc_pages_exact_nid(int nid, size_t size, gfp_t flags)

Mapping physical pages into kernel space:

void* kmap(struct page* page)

If page is in low memory (ZONE_NORMAL, ZONE_DMA) its virtual address is simply returned.

If page is in ZONE_HIGHMEMORY, a mapping is created and virtual address is returned.
May sleep.

void kunmap(struct page* page)

Unmap mapping created by kmap.

void* kmap_atomic(struct page* page)

void kunmap(struct page* page)

kmap_atomic/kunmap_atomic is significantly faster than kmap/kunmap because no global lock is needed and because the kmap code must perform a global TLB invalidation when the kmap pool wraps.

However when holding an atomic kmap it is not legal to sleep, so atomic kmaps are appropriate for short, tight code paths only.

struct page* kmap_to_page(void * kva)

PFN:

unsigned long pfn = page_to_pfn(page);

if (pfn_valid(pfn))

page = pfn_to_page(pfn);

void get_page(page);

....

void put_page(page);

void *kva = page_address(page)

get kv-address for a page in lowmem zone

or for a pages in himem zone mapped into kernel address space

otherwise returns NULL

if (virt_addr_valid(kva)) // for kva in lowmem zone

page = virt_to_page(kva);

Pin user pages in memory:

note: obsolete, should use get_user_pages_locked|unlocked|fast

struct page *pages;

down_read(&current->mm->mmap_sem);

long nr = get_user_pages(current, current->mm, (unsigned long) buf, npages, int write, int force, &pages, NULL);

up_read(&current->mm->mmap_sem);

.... modify ...

lock_page(page)

set_page_dirty(page)

unlock_page(page)

page_cache_release(page)

Map kernel memory into userspace (e.g. for mmap):

down_read(&current->mm->mmap_sem);

err = remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,

unsigned long pfn, unsigned long size, pgprot_t prot)

err = remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, vma->vm_end - vma->vm_start, vma->vm_page_prot)

up_read(&current->mm->mmap_sem);

Map kernel memory to userspace:

static struct vm_operations_struct xxx_mmap_ops = {

.open = xxx_mm_open,

.close = xxx_mm_close,

};

static void xxx_mm_open(struct vm_area_struct *vma)

{

struct file *file = vma->vm_file;

struct socket *sock = file->private_data;

struct sock *sk = sock->sk;

if (sk)

atomic_inc(&pkt_sk(sk)->mapped);

}

static void xxx_mm_close(struct vm_area_struct *vma)

{

struct file *file = vma->vm_file;

struct socket *sock = file->private_data;

struct sock *sk = sock->sk;

if (sk)

atomic_dec(&pkt_sk(sk)->mapped);

}

.......

if (vma->vm_pgoff)

return -EINVAL;

size = vma->vm_end - vma->vm_start;

start = vma->vm_start;

for (i = 0; i < ....; i++) {

struct page *page = virt_to_page(po->pg_vec[i]);

int pg_num;

for (pg_num = 0; pg_num < ....; pg_num++, page++) {

err = vm_insert_page(vma, start, page);

if (unlikely(err))

goto out;

start += PAGE_SIZE;

}

vma->vm_ops = &xxx_mmap_ops;

Memory leak detection (kmemleak):

http://www.kernel.org/doc/Documentation/kmemleak.txt

Page allocations and ioremap are not tracked.

Only kmalloc, kmem_cache_alloc (slab allocations) and vmalloc.

Build with CONFIG_DEBUG_KMEMLEAK

To disable by default:

build with CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF

boot with kmemleak=on

mount -t debugfs nodev /sys/kernel/debug

cat /sys/kernel/debug/kmemleak

Clear the list of all current possible memory leaks:

echo clear > /sys/kernel/debug/kmemleak

Trigger an intermediate memory scan:

echo scan > /sys/kernel/debug/kmemleak

cat /sys/kernel/debug/kmemleak

Disable default thread that scans every 10 minutes:

echo 'scan=off' > /sys/kernel/debug/kmemleak

Dump the information about object at <addr>:

echo 'dump=<addr>' > /sys/kernel/debug/kmemleak

Also /proc/slabinfo

Time:

HZ tic ks (jiffies) per second, typically 100 or 1000, also can be NOHZ.
User-visible value is USER_HZ, so when communicating to userland scale like x / (HZ / USER_HZ).

Jiffies has unsigned long and 64-bit values (jiffies, jiffies_64).

unsigned long volatile jiffies; // wraps around if BITS_PER_LONG < 64
u64 jiffies_64;

On x86 they are separate (but jiffies can overlay lower word of jiffies_64).
On x64 they are the same thing.

jiffies may be read atomically.
jiffies_64 on 32-bit architectures (BITS_PER_LONG < 64) must be read under seqlock_t jiffies_lock.

Can always use get_jiffies_64(), on 32-bit architectures it reads data under seqlock, on 64-bit architectures just reads it.

#define time_after(a,b) ((long)(b) - (long)(a) < 0))

#define time_before(a,b) time_after(b,a)

#define time_after_eq(a,b) ((long)(a) - (long)(b) >= 0))

#define time_before_eq(a,b) time_after_eq(b,a)

#define time_in_range(a,b,c) (time_after_eq(a,b) && time_before_eq(a,c))

#define time_in_range_open(a,b,c) (time_after_eq(a,b) && time_before(a,c))

#define time_after64(a,b) ((__s64)(b) - (__s64)(a) < 0))

#define time_before64(a,b) time_after64(b,a)

#define time_after_eq64(a,b) ((__s64)(a) - (__s64)(b) >= 0))

#define time_before_eq64(a,b) time_after_eq64(b,a)

#define time_is_before_jiffies(a) time_after(jiffies, a)

#define time_is_after_jiffies(a) time_before(jiffies, a)

#define time_is_before_eq_jiffies(a) time_after_eq(jiffies, a)

#define time_is_after_eq_jiffies(a) time_before_eq(jiffies, a)

unsigned int jiffies_to_msecs(const unsigned long j);

unsigned int jiffies_to_usecs(const unsigned long j);

unsigned long msecs_to_jiffies(const unsigned int m);

unsigned long usecs_to_jiffies(const unsigned int u);

unsigned long timespec_to_jiffies(const struct timespec *value);

void jiffies_to_timespec(const unsigned long jiffies, struct timespec *value);

unsigned long timeval_to_jiffies(const struct timeval *value);

void jiffies_to_timeval(const unsigned long jiffies, struct timeval *value);

clock_t jiffies_to_clock_t(unsigned long x);

clock_t jiffies_delta_to_clock_t(long delta)

unsigned long clock_t_to_jiffies(unsigned long x);

u64 jiffies_64_to_clock_t(u64 x);

u64 nsec_to_clock_t(u64 x);

u64 nsecs_to_jiffies64(u64 n);

unsigned long nsecs_to_jiffies(u64 n);

struct timespec {

__kernel_time_t tv_sec;
long tv_nsec;
}

Delays:

set_current_state(TASK_[UN]INTERRUPTIBLE)
long remaining = schedule_timeout(10 * HZ)

unsigned long wait_till = jiffies + 5 * HZ;
while (time_before(jiffies, wait_till))
cond_resched(); // invoke scheduler if need_resched is true, e.g. there is some higher-priority task

void   mdelay(unsigned long msecs)              // busy-wait using calibrated loops
void   udelay(unsigned long usecs)
void   ndelay(unsigned long nsecs)

const enum hrtimer_mode mode = HRTIMER_MODE_REL;

ktime_t expires;

int schedule_hrtimeout(&expires, mode)

int schedule_hrtimeout_range(&expires, unsigned long delta, mode)

int schedule_hrtimeout_range_clock(&expires, unsigned long delta, mode, int clock)

On timer tick:

* Event handler for periodic ticks

void tick_handle_periodic(struct clock_event_device *dev)

{

int cpu = smp_processor_id();

ktime_t next;

tick_periodic(cpu);

if (dev->mode != CLOCK_EVT_MODE_ONESHOT)

return;

* Setup the next period for devices, which do not have

* periodic mode:

next = ktime_add(dev->next_event, tick_period);

for (;;) {

if (!clockevents_program_event(dev, next, false))

return;

* Have to be careful here. If we're in oneshot mode,

* before we call tick_periodic() in a loop, we need

* to be sure we're using a real hardware clocksource.

* Otherwise we could get trapped in an infinite

* loop, as the tick_periodic() increments jiffies,

* when then will increment time, posibly causing

* the loop to trigger again and again.

if (timekeeping_valid_for_hres())

tick_periodic(cpu);

next = ktime_add(next, tick_period);

}

* Periodic tick

static void tick_periodic(int cpu)

{

if (tick_do_timer_cpu == cpu) {

write_seqlock(&jiffies_lock);

/* Keep track of the next tick event */

tick_next_period = ktime_add(tick_next_period, tick_period);

do_timer(1);

write_sequnlock(&jiffies_lock);

}

update_process_times(user_mode(get_irq_regs()));

profile_tick(CPU_PROFILING);
}

* Must hold jiffies_lock

void do_timer(unsigned long ticks)

{

jiffies_64 += ticks;

update_wall_time();

calc_global_load(ticks);

}

* Called from the timer interrupt handler to charge one tick to the current

* process. user_tick is 1 if the tick is user time, 0 for system.

void update_process_times(int user_tick)

{

struct task_struct *p = current;

int cpu = smp_processor_id();

/* Note: this timer irq context must be accounted for as well. */

account_process_tick(p, user_tick);

run_local_timers();

rcu_check_callbacks(cpu, user_tick);

printk_tick();

#ifdef CONFIG_IRQ_WORK

if (in_irq())

irq_work_run();

#endif

scheduler_tick();

run_posix_cpu_timers(p);

}

* Called by the local, per-CPU timer interrupt on SMP.

void run_local_timers(void)

{

hrtimer_run_queues();

raise_softirq(TIMER_SOFTIRQ);

}

* Account a single tick of cpu time.

* @p: the process that the cpu time gets accounted to

* @user_tick: indicates if the tick is a user or a system tick

void account_process_tick(struct task_struct *p, int user_tick)

{

cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);

struct rq *rq = this_rq();

if (sched_clock_irqtime) {

irqtime_account_process_tick(p, user_tick, rq);

return;

}

if (steal_account_process_tick())

return;

if (user_tick)

account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);

else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))

account_system_time(p, HARDIRQ_OFFSET, cputime_one_jiffy,

one_jiffy_scaled);

else

account_idle_time(cputime_one_jiffy);

}

Print messages, dump stack etc.:

printk(KERN_WARNING "format string", args...)

KERN_EMERG
KERN_ALERT
KERN_CRIT
KERN_ERR

KERN_WARNING
KERN_NOTICE
KERN_INFO
KERN_DEBUG

circular buffer, default size 16K (LOG_BUF_LEN), configurable with CONFIG_LOG_BUF_SHIFT

dump_stack()

__schedule_bug(task)

debug_show_held_locks(task)

print_modules()

print_irqtrace_events(task)

#include <linux/ratelimit.h>

// depreceated since rate-limiting state is global for all printk_ratelimit() call sites

if (printk_ratelimit())

printk(...)

// better way: limit for a particular call site

printk_ratelimited(KERN_WARNING "format-string", args)

pr_xxx(format, args):

pr_emerg	KERN_EMERG (0)
pr_alert	KERN_ALERT (1)
pr_crit	KERN_CRIT (2)
pr_err	KERN_ERR (3)
pr_warn pr_warning	KERN_WARNING (4)
pr_notice	KERN_NOTICE (5)
pr_info	KERN_INFO (6)
pr_debug pr_devel if DEBUG is defined	KERN_DEBUG(7)
pr_cont	KERN_CONT = continue line with no NL char
	KERN_DEFAULT (d)

to enable all messages: echo 8 >/proc/sys/kernel/printk

default is "warning (4) and up"

content of /proc/sys/kernel/printk (full): <current> <default> <minimum> <boot-time-default>

on boot: loglevel=8

printk_once(...)

printk_emerg_once(...) and other printk_xxx_once(...)

pr_xxx_ratelimited(....)

vprintk(...)

vprintk_emit(...)

hex_dump_to_buffer(...)

print_hex_dump(...)

print_hex_dump_bytes(...)

print_hex_dump_debug(...)

For drivers:

#include <linux/device.h>

dev_xxx(const struct device *dev, fmt, args)

e.g. dev_dbg(...)

dev_xxx_once(...)

dev_xxx_ratelimited(dev, fmt, args)

dev_vprintk_emit(...)

dev_WARN(...), dev_WARN_ONCE(...) => includes file/lineno and backtrace

http://elinux.org/Debugging_by_printing

How to selectively enable/disable pr_debug/dev_dbg call sites (via debugfs):

https://www.kernel.org/doc/Documentation/dynamic-debug-howto.txt

/proc/sys/kernel/printk_delay

/proc/sys/kernel/printk_ratelimit

/proc/sys/kernel/printk_ratelimit_burst

http://www.ibm.com/developerworks/linux/library/l-kernel-logging-apis/index.html

printk formats: https://www.kernel.org/doc/Documentation/printk-formats.txt

int %d or %x

unsigned int %u or %x

long %ld or %lx

unsigned long %lu or %lx

long long %lld or %llx

unsigned long long %llu or %llx

size_t %zu or %zx

ssize_t %zd or %zx

s32 %d or %x

u32 %u or %x

s64 %lld or %llx

u64 %llu or %llx

pointer %p

Also has formats for:

symbolic decoding of pointers (symbol + offset)

physical addresses (phys_addr_t)

DMA addresses (dma_addr_t)

raw buffers as hex string

MAC/FDDI addresses

IPv4/IPv6 addresses

UUID/GUID

dentry names

struct clk

bitmaps

Printing to ftrace buffer:

#include <linux/kernel.h>

trace_printk(...)

trace_puts(str) => extra fast

trace_dump_stack()

ftrace_vprintk(...)

Trees:

Read-black tree – self-balancing (semi-balanced) binary search tree:

· each non-leaf node has 1 or 2 children

· value(left child) < value(node) < value(right child)

· depth(deepest path) <= 2 * depth(shallowest path)

struct rb_root {

struct rb_node *rb_node;
};

struct rb_node {

unsigned long __rb_parent_color;

struct rb_node *rb_right;

struct rb_node *rb_left;

};

Typical user node:

struct mytype {

struct rb_node node;

char* mystring;
};

struct rb_root root = RB_ROOT; // { NULL }

#define rb_parent(r) ((struct rb_node *)((r)->__rb_parent_color & ~3))

#define rb_entry(ptr, type, member) container_of(ptr, type, member)

void rb_replace_node(struct rb_node *victim, struct rb_node *new, struct rb_root *root)

void rb_link_node(struct rb_node * node, struct rb_node * parent, struct rb_node ** rb_link)

Searching:

struct mytype *my_search(struct rb_root *root, char *string)

{

struct rb_node *node = root->rb_node;

while (node) {

struct mytype *data = container_of(node, struct mytype, node);

int result;

result = strcmp(string, data->mystring);

if (result < 0)

node = node->rb_left;

else if (result > 0)

node = node->rb_right;

else

return data;

}

return NULL;

}

Remove an existing node:

struct mytype *data = my_search(&mytree, "walrus");

if (data) {

rb_erase(&data->node, &mytree); // void rb_erase(struct rb_node *victim, struct rb_root *tree)

myfree(data);

}

To replace an existing node with the new one with the same key:

void rb_replace_node(struct rb_node *victim, struct rb_node *new, struct rb_root *root);

Inserting into tree:

int my_insert(struct rb_root *root, struct mytype *data)

{

struct rb_node **new = &(root->rb_node), *parent = NULL;

/* Figure out where to put new node */

while (*new) {

struct mytype *this = container_of(*new, struct mytype, node);

int result = strcmp(data->mystring, this->mystring);

parent = *new;

if (result < 0)

new = &((*new)->rb_left);

else if (result > 0)

new = &((*new)->rb_right);

else

return FALSE;

}

/* Add new node and rebalance tree. */

rb_link_node(&data->node, parent, new);

rb_insert_color(&data->node, root);

return TRUE;

}

Iterating through the tree:

struct rb_node* rb_first(const struct rb_root*)

struct rb_node* rb_last(const struct rb_root*)

struct rb_node* rb_next(const struct rb_node*)

struct rb_node* rb_prev(const struct rb_node*)

struct rb_node *xnode;

for (xnode = rb_first(&mytree); xnode; xnode = rb_next(xnode))

printk("key=%s\n", rb_entry(xnode, struct mytype, node)->mystring);

Radix tree: maps "unsigned long" to "void*"

#include <linux/radix-tree.h>

RADIX_TREE(name, gfp_mask);

struct radix_tree_root my_tree;

INIT_RADIX_TREE(my_tree, gfp_mask);

struct radix_tree_root *tree;

unsigned long key;

void *item;

int radix_tree_insert(tree, key, item);

item = radix_tree_lookup(tree, key);

item = radix_tree_delete(tree, key);

Kobjects:

struct kobject {

const char* name;

struct list_head entry;

struct kobject* parent;

struct kset* kset;

struct kobj_type* ktype;

struct sysfs_dirent* sd;

struct kref kref;

unsigned int state_initialized:1;

unsigned int state_in_sysfs:1;

unsigned int state_add_uevent_sent:1;

unsigned int state_remove_uevent_sent:1;

unsigned int uevent_suppress:1;

}

struct kobj_type {

void (*release)(struct kobject *kobj);

const struct sysfs_ops *sysfs_ops;

struct attribute **default_attrs;

const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj);

const void *(*namespace)(struct kobject *kobj);

}

struct kobject* kobject_get(struct kobject *kobj)

If kobj is not NULL, increment reference counter and return kobj.

void kobject_put(struct kobject *kobj)

If kobj is not NULL, decrement reference counter, if goes to zero then release the object.

kobject/kset sample: LINUX_ROOT/samples/kobject

kobj_map:

Handles device numbers.
There are two kobj_maps: one for block devices and one for character devices.
Each struct probe describes device number range starting at dev with length range.

struct kobj_map *    kobj_map_init(kobj_probe_t *base_probe, struct subsystem *s)

int      kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
                              struct module *owner, kobj_probe_t *get, int (*lock)(dev_t, void *), void *data)

void kobj_unmap(struct kobj_map *domain, dev_t dev, unsigned long range)

struct kobject * kobj_lookup(struct kobj_map *domain, dev_t dev, int *index)

Finds the given device number dev on the given map domain. If the owner field is set, we temporarily take a reference on the corresponding module via try_module_get(owner) in order to protect the lock and get calls. If the lock function is present it will be called, and the present probe skipped if it returns an error. Then the get function is called to get the kobject for the given device number. The resulting kobject is returned as value, and the offset in the interval of device numbers is returned via index; module is released via module_put.

Lockup detection:

/proc/sys/kernel/nmi_watchdog

/proc/sys/kernel/watchdog_thresh

Default is 10 seconds, meaning 10 seconds for hard lockup detection and 20 seconds for soft lockup detection.

Hard lockup is lock up in kernel mode with interrupts disabled.

Soft lockup is lock up in kernel mode not letting other tasks to run.

Code is in kernel/watchdog.c

Docs in Documentation/lockup-watchdogs.txt

Using floating point in kernel:

#include <asm/i387.h>

kernel_fpu_begin();
.....
kernel_fpu_end();

RCU:

For read-mostly data, when transient reader/writer inconsistency ok.
Decreases read-side overhead, increases write-side overhead.

#include <linux/rcupdate.h>

// read-side RCU critical section:

// - cannot sleep, but can be preemted if CONFIG_PREEMPT_RCU

// - can be nested

static struct xxx_struct* __rcu xptr;

rcu_read_lock();

p = rcu_dereference(xptr);

// p is valid here

do_something_with(p);

rcu_read_unlock();

// p is invalid here

// update section (synchronous, waiting for RCU sync)

// use spinlocks, semaphores etc. to interlock updaters

// cannot be called from irq or bh context

spin_lock(&updater_lock);

p = xptr;

rcu_assign_pointer(xptr, new_p);

spin_unlock(&updater_lock);

synchronize_rcu(); // wait for grace period

kfree(p);

// update section (asynchronous,non-waiting for RCU sync)

// use spinlocks, semaphores etc. to interlock updaters

spin_lock(&updater_lock);

p = xptr;

rcu_assign_pointer(xptr, new_p);

spin_unlock(&updater_lock);

call_rcu(&p->x_rcu, xxx_rcu_func);

struct xxx_struct {

struct rcu_head x_rcu;

};

void xxx_rcu_func(struct rcu_head *p_rcu_head)

{

struct my_struct *p = container_of(p_rcu_head, struct my_struct, x_rcu);

destroy_my_struct(p);

}

Note: RCU callbacks can be invoked in softirq context. Therefore it cannot block, and any lock acquired within RCU callback must be

acquired elsewhere with spin_lock_irq or spin_lock_bh, to avoid self-deadlock. Watch out for locks indirectly acquired

by API functions (e.g. kfree etc.) called by the callback!

Any lock acquired by RCU callback must be acquired elsewhere with softirq disabled (e.g. spin_lock_bh, spin_lock_irqsave), or self-deadlock will result.

RCU callbacks are usually executed on the same CPU, as scheduled by call_rcu_xxx, but this is not guaranteed (for example, if CPU goes offline, pending callbacks will be transferred to another CPU for execution).

RCU read-side primitives do not necessarily contain memory barriers. Therefore CPU and compiler may reorder code into and out of RCU

read-side critical sections. It is the responsibility of the RCU update-side primitives to deal with this.

What is RCU, Fundamentally? http://lwn.net/Articles/262464

What is RCU? Part 2: Usage http://lwn.net/Articles/263130

RCU part 3: the RCU API http://lwn.net/Articles/264090

The RCU API, 2010 Edition http://lwn.net/Articles/418853

"Is parallel programming hard..."

http://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

http://www.rdrop.com/users/paulmck/RCU

http://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt

http://www.kernel.org/doc/Documentation/RCU/UP.txt

http://www.kernel.org/doc/Documentation/RCU/listRCU.txt

http://www.kernel.org/doc/Documentation/RCU/trace.txt

enumerate list entries

read_lock(&xxx_lock);

list_for_each_entry(p, &xxx_listhead, list) {

...

}

read_unlock(&xxx_lock);

rcu_read_lock();

list_for_each_entry_rcu(p, &xxx_listhead, list) {

...

}

rcu_read_unlock();

add list entry

write_lock(&xxx_lock);

if (p->condition)

list_add(&p->list, xxx_listhead);

else

list_add_tail(&p->list, xxx_listhead);

write_unlock(&xxx_lock);

lock(&xxx_lock);

if (p->condition)

list_add_rcu(&p->list, xxx_listhead);

else

list_add_tail_rcu(&p->list, xxx_listhead);

unlock(&xxx_lock);

delete list entry

write_lock(&xxx_lock);

list_for_each_entry(p, &xxx_listhead, list) {

if (p->condition) {

list_del(&e->list);

write_unlock(&xxx_lock);

return;

}

write_unlock(&xxx_lock);

lock(&xxx_lock);

list_for_each_entry_rcu(p, &xxx_listhead, list) {

if (p->condition) {

list_del_rcu(&p->list);

unlock(&xxx_lock);

call_rcu(&p->rcu, xxx_free);

return;

}

unlock(&xxx_lock);

in-place updates

write_lock(&xxx_lock);

list_for_each_entry(p, &xxx_listhead, list) {

if (p->condition) {

p->a = new_a;

p->b = new_b;

write_unlock(&xxx_lock);

return;

}

write_unlock(&xxx_lock);

lock(&xxx_lock);

list_for_each_entry_rcu(p, &xxx_listhead, list) {

if (p->condition) {

np = copy(p);

if (np == NULL) ...

np->a = new_a;

np->b = new_b;

list_replace_rcu(&p->list, &np->list);

unlock(&xxx_lock);

call_rcu(&p->rcu, xxx_free);

return;

}

unlock(&xxx_lock);

eliminating stale data

If stale data cannot be tolerated, use per-entry “deleted” flag and per-entry spinlock, and re-verify the validity of data.

https://www.kernel.org/doc/Documentation/RCU/checklist.txt

https://www.kernel.org/doc/Documentation/RCU/NMI-RCU.txt

https://www.kernel.org/doc/Documentation/RCU/arrayRCU.txt

https://www.kernel.org/doc/Documentation/RCU/rcu_dereference.txt

https://www.kernel.org/doc/Documentation/RCU/rcubarrier.txt

https://www.kernel.org/doc/Documentation/RCU/lockdep.txt

https://www.kernel.org/doc/Documentation/RCU/lockdep-splat.txt

Attribute	RCU Classic	RCU BH	RCU Sched	Realtime RCU	SRCU	QRCU
Purpose	Wait for RCU read-side critical sections	Wait for RCU-BH read-side critical sections & irqs	Wait for RCU-Sched read-side critical sections, preempt-disable regions, hardirqs & NMIs	Realtime response	Wait for SRCU read-side critical sections, allow sleeping readers	Sleeping readers and fast grace periods
Read-side primitives	rcu_read_lock() rcu_read_unlock()	rcu_read_lock_bh() rcu_read_unlock_bh()	preempt_disable() preempt_enable() (and friends)	rcu_read_lock() rcu_read_unlock()	srcu_read_lock() srcu_read_unlock()	qrcu_read_lock() qrcu_read_unlock()
Update-side primitives (synchronous)	synchronize_rcu() synchronize_net()	synchronize_rcu_bh()	synchronize_sched()	synchronize_rcu() synchronize_net()	synchronize_srcu()	synchronize_qrcu()
Update-side primitives (synchronous expedited)	synchronize_rcu_expedited()	synchronize_rcu_bh_expedited()	synchronize_rcu_sched_expedited()		synchronize_srcu_ expedited()
Update-side primitives (asynchronous/callback)	call_rcu()	call_rcu_bh()	call_rcu_sched()	call_rcu()	N/A	N/A
Update-side primitives (wait for callbacks)	rcu_barrier()	rcu_barrier_bh()	rcu_barrier_sched()	rcu_barrier()	N/A	N/A
Read side constraints	No blocking except preemption ans spinlock acquisition	No bh enabling	No blocking	No blocking except preemption and lock acquisition	No synchronize_srcu()	No synchronize_qrcu()
Read side overhead	Preempt disable/enable (free on non-PREEMPT)	BH disable/enable	Preempt disable/enable (free on non-PREEMPT)	Simple instructions, irq disable/enable	Simple instructions, preempt disable/enable	Atomic increment and decrement of shared variable
Asynchronous update-side overhead (for example, call_rcu())	sub-microsecond	sub-microsecond	sub-microsecond	sub-microsecond	N/A	N/A
Grace-period latency	10s of milliseconds	10s of milliseconds	10s of milliseconds	10s of milliseconds	10s of milliseconds	10s of nanoseconds in absence of readers
Non-PREEMPT_RT implementation	RCU Classic	RCU BH	RCU Classic	N/A	SRCU	N/A
PREEMPT_RT implementation	N/A	Realtime RCU	Forced Schedule on all CPUs	Realtime RCU	SRCU	N/A

RCU	Critical sections	Grace period	Barrier
Classic	rcu_read_lock rcu_read_unlock rcu_dereference rcu_read_lock_held rcu_dereference_check rcu_dereference_protected	synchronize_net synchronize_rcu synchronize_rcu_expedited call_rcu kfree_rcu	rcu_barrier
BH	rcu_read_lock_bh rcu_read_unlock_bh rcu_dereference_bh rcu_read_lock_bh _held rcu_dereference_bh_check rcu_dereference_bh_protected	call_rcu_bh synchronize_rcu_bh synchronize_rcu_bh_expedited	rcu_barrier_bh
Sched	rcu_read_lock_sched rcu_read_unlock_sched rcu_dereference_sched rcu_read_lock_sched_held rcu_dereference_sched_check rcu_dereference_sched_protected [preempt_disable and friends] rcu_read_lock_sched_notrace rcu_read_unlock_sched_notrace	synchronize_sched call_rcu_sched synchronize_sched_expedited	rcu_barrier_sched
SRCU	srcu_read_lock srcu_read_unlock srcu_dereference srcu_dereference_check srcu_read_lock_held	synchronize_srcu call_srcu synchronize_srcu_expedited	srcu_barrier

SRCU initialization/cleanup:

init_srcu_struct
cleanup_srcu_struct

updater uses

readers must use

synchronize_rcu

call_rcu

rcu_read_lock
rcu_read_unlock

synchronize_rcu_bh

call_rcu_bh

rcu_read_lock_bh
rcu_read_unlock_bh

An exception: may also rcu_read_lock() and rcu_read_unlock() instead of rcu_read_lock_bh() and rcu_read_unlock_bh() if local bottom halves are already known to be disabled, for example, in irq or softirq context.

synchronize_sched

call_rcu_sched

disable preemption, possibly by calling

rcu_read_lock_sched
rcu_read_unlock_sched

synchronize_srcu
call_srcu

srcu_read_lock

srcu_read_unlock

(with the same with the same srcu_struct)

· RCU list traversal and pointer dereference primitives (list_xxx_rcu, rcu_dereference_xxx) must be used:

o inside RCU read-side critical section

o or protected by update-side lock.

· synchronize_rcu will only wait for all currently executing rcu_read_lock sections. It does not (necessarily) wait for irqs, NMIs, preempt_disable sections, or idle loops to complete.

· synchronize_irq(irq) waits for pending IRQ handlers on other CPUs

· synchronize_sched waits for pending preempt_disable code sequences, including NMI and non-threaded hardware-interrupt handlers. However, this does not guarantee that softirq handlers will have completed, since in some kernels, these handlers can run in process context, and can block.

Category	Primitives	Purpose
List traversal	list_for_each_entry_rcu()	Iterate over an RCU-protected list from the beginning.
	list_for_each_entry_continue_rcu()	Iterate over an RCU-protected list from the specified element.
	list_entry_rcu()	Given a pointer to a raw list_head in an RCU-protected list, return a pointer to the enclosing element.
	list_first_entry_rcu()	Return the first element of an RCU-protected list.
List update	list_add_rcu()	Add an element to the head of an RCU-protected list.
	list_add_tail_rcu()	Add an element to the tail of an RCU-protected list.
	list_del_rcu()	Delete the specified element from an RCU-protected list, poisoning the ->pprev pointer but not the ->next pointer.
	list_replace_rcu()	Replace the specified element in an RCU-protected list with the specified element.
	list_splice_init_rcu()	Move all elements from an RCU-protected list to another RCU-protected list.
Hlist traversal	hlist_for_each_entry_rcu()	Iterate over an RCU-protected hlist from the beginning.
	hlist_for_each_entry_rcu_bh()	Iterate over an RCU-bh-protected hlist from the beginning.
	hlist_for_each_entry_continue_rcu()	Iterate over an RCU-protected hlist from the specified element.
	hlist_for_each_entry_continue_rcu_bh()	Iterate over an RCU-bh-protected hlist from the specified element.
Hlist update	hlist_add_after_rcu()	Add an element after the specified element in an RCU-protected hlist.
	hlist_add_before_rcu()	Add an element before the specified element in an RCU-protected hlist.
	hlist_add_head_rcu()	Add an element at the head of an RCU-protected hlist.
	hlist_del_rcu()	Delete the specified element from an RCU-protected hlist, poisoning the ->pprev pointer but not the ->next pointer.
	hlist_del_init_rcu()	Delete the specified element from an RCU-protected hlist, initializing the element's reverse pointer after deletion.
	hlist_replace_rcu()	Replace the specified element in an RCU-protected hlist with the specified element.
Hlist nulls traversal	hlist_nulls_for_each_entry_rcu()	Iterate over an RCU-protected hlist-nulls list from the beginning.
Hlist nulls update	hlist_nulls_del_init_rcu()	Delete the specified element from an RCU-protected hlist-nulls list, initializing the element after deletion.
	hlist_nulls_del_rcu()	Delete the specified element from an RCU-protected hlist-nulls list, poisoning the ->pprev pointer but not the ->next pointer.
	hlist_nulls_add_head_rcu()	Add an element to the head of an RCU-protected hlist-nulls list.

Category	Primitives	Purpose
Pointer update	rcu_assign_pointer()	Assign to an RCU-protected pointer.
Pointer access	rcu_dereference()	Fetch an RCU-protected pointer, giving an lockdep-RCU error message if not in an RCU read-side critical section.
	rcu_dereference_bh()	Fetch an RCU-protected pointer, giving an lockdep-RCU error message if not in an RCU-bh read-side critical section.
	rcu_dereference_sched()	Fetch an RCU-protected pointer, giving an lockdep-RCU error message if not in an RCU-sched read-side critical section.
	srcu_dereference()	Fetch an RCU-protected pointer, giving an lockdep-RCU error message if not in the specified SRCU read-side critical section.
	rcu_dereference_protected()	Fetch an RCU-protected pointer with no protection against concurrent updates, giving an lockdep-RCU error message if the specified lockdep condition does not hold. This primitive is normally used when the update-side lock is held.
	rcu_dereference_check()	Fetch an RCU-protected pointer, giving an lockdep-RCU error message if (1) the specified lockdep condition does not hold and (2) not under the protection of rcu_read_lock().
	rcu_dereference_bh_check()	Fetch an RCU-bh-protected pointer, giving an lockdep-RCU error message if (1) the specified lockdep condition does not hold and (2) not under the protection of rcu_read_lock_bh() (2.6.37 or later).
	rcu_dereference_sched_check()	Fetch an RCU-sched-protected pointer, giving an lockdep-RCU error message if (1) the specified lockdep condition does not hold and (2) not under the protection of rcu_read_lock_sched() or friend (2.6.37 or later).
	srcu_dereference_check()	Fetch an SRCU-protected pointer, giving an lockdep-RCU error message if (1) the specified lockdep condition does not hold and (2) not under the protection of the specified srcu_read_lock() (2.6.37 or later).
	rcu_dereference_index_check()	Fetch an RCU-protected integral index, giving an lockdep-RCU error message if the specified lockdep condition does not hold.
	rcu_access_pointer()	Fetch an RCU-protected value (pointer or index), but with no protection against concurrent updates. This primitive is normally used to do pointer comparisons, for example, to check for a NULL pointer.
	rcu_dereference_raw()	Fetch an RCU-protected pointer with no lockdep-RCU checks. Use of this primitive is strongly discouraged. If you must use this primitive, add a comment stating why, just as you would with smp_mb().

Free an object:

void kfree_rcu (ptr, rcu_head) // Helper function.

Initializers:

RCU_INIT_POINTER(ptr, NULL);

If you are creating an RCU-protected linked structure that is accessed by a single external-to-structure RCU-protected pointer, then you may use RCU_INIT_POINTER to initialize the internal RCU-protected pointers, but you must use rcu_assign_pointer to initialize the external-to-structure pointer -after- you have completely initialized the reader-accessible portions of the linked structure.

struct x x = {

.group_leader = &tsk, \

RCU_POINTER_INITIALIZER(real_cred, &init_cred), \

RCU_POINTER_INITIALIZER(cred, &init_cred), \

.comm = INIT_TASK_COMM, \

}

Unloading a module:

RCU and Unloadable Modules

void rcu_barrier() // call from module unload, waits for callbacks to complete

void rcu_barrier_bh()
void rcu_barrier_sched()

This primitive does not necessarily wait for an RCU grace period to complete. For example, if there are no RCU callback queued anywhere in the system , then rcu_barrier is within its rights to return immediately, without waiting for anything, much less an RCU grace period.

If a callback can re-post itself, module unload function should empoy a global flag to disable the reposting before calling rcu_barrier().

Debugging:

CONFIG_PROVE_RCU: check that accesses to RCU-protected data structures are carried out under the proper RCU read-side critical section, while holding the right combination of locks, or whatever other conditions are appropriate.

CONFIG_DEBUG_OBJECTS_RCU_HEAD: check that you don't pass the same object to call_rcu() (or friends) before an RCU grace period has elapsed since the last time that you passed that same object to call_rcu() (or friends).

__rcu sparse checks: tag the pointer to the RCU-protected data structure with __rcu, and sparse will warn you if you access that pointer without the services of one of the variants of rcu_dereference().
RCU-BH has faster grace period than classic RCU (and is helpful e.g. in case of DDoS-attacks cases).

void call_rcu_bh (struct rcu_head * head, void (*func) (struct rcu_head *head))

call_rcu_bh assumes that the read-side critical sections end on completion of a softirq handler. This means that read-side critical sections in process context must not be interrupted by softirqs. This interface is to be used when most of the read-side critical sections are in softirq context. RCU read-side critical sections are delimited by : - rcu_read_lock and rcu_read_unlock, if in interrupt context. OR - rcu_read_lock_bh and rcu_read_unlock_bh, if in process context. These may be nested.

void rcu_read_lock_bh ()

An equivalent of rcu_read_lock, but to be used when updates are being done using call_rcu_bh or synchronize_rcu_bh.

Since both call_rcu_bh and synchronize_rcu_bh consider completion of a softirq handler to be a quiescent state, a process in RCU read-side critical section must be protected by disabling softirqs. Read-side critical sections in interrupt context can use just rcu_read_lock, though this should at least be commented to avoid confusing people reading the code.

Note that rcu_read_lock_bh and the matching rcu_read_unlock_bh must occur in the same context, for example, it is illegal to invoke rcu_read_unlock_bh from one task if the matching rcu_read_lock_bh was invoked from some other task.

RCU-SCHED:

void call_rcu_sched (struct rcu_head * head, void (*func) (struct rcu_head *rcu))

call_rcu_sched assumes that the read-side critical sections end on enabling of preemption or on voluntary preemption. RCU read-side critical sections are delimited by : - rcu_read_lock_sched and rcu_read_unlock_sched, OR anything that disables preemption. These may be nested.

void rcu_read_lock_sched ( void)

An equivalent of rcu_read_lock, but to be used when updates are being done using call_rcu_sched or synchronize_rcu_sched. Read-side critical sections can also be introduced by anything that disables preemption, including local_irq_disable and friends.

Note that rcu_read_lock_sched and the matching rcu_read_unlock_sched must occur in the same context, for example, it is illegal to invoke rcu_read_unlock_sched from process context if the matching rcu_read_lock_sched was invoked from an NMI handler.

void synchronize_sched()

Wait until an rcu-sched grace period has elapsed.

Control will return to the caller some time after a full rcu-sched grace period has elapsed, in other words after all currently executing rcu-sched read-side critical sections have completed. These read-side critical sections are delimited by rcu_read_lock_sched() and rcu_read_unlock_sched(), and may be nested. Note that preempt_disable(), local_irq_disable(), and so on may be used in place of rcu_read_lock_sched().

This means that all preempt_disable code sequences, including NMI and non-threaded hardware-interrupt handlers, in progress on entry will have completed before this primitive returns. However, this does not guarantee that softirq handlers will have completed, since in some kernels, these handlers can run in process context, and can block.

Note that this guarantee implies further memory-ordering guarantees. On systems with more than one CPU, when synchronize_sched() returns, each CPU is guaranteed to have executed a full memory barrier since the end of its last RCU-sched read-side critical section whose beginning preceded the call to synchronize_sched(). In addition, each CPU having an RCU read-side critical section that extends beyond the return from synchronize_sched() is guaranteed to have executed a full memory barrier after the beginning of synchronize_sched() and before the beginning of that RCU read-side critical section. Note that these guarantees include CPUs that are offline, idle, or executing in user mode, as well as CPUs that are executing in the kernel.

Furthermore, if CPU A invoked synchronize_sched(), which returned to its caller on CPU B, then both CPU A and CPU B are guaranteed to have executed a full memory barrier during the execution of synchronize_sched() -- even if CPU A and CPU B are the same CPU (but again only if the system has more than one CPU).

This primitive provides the guarantees made by the (now removed) synchronize_kernel() API. In contrast, synchronize_rcu() only guarantees that rcu_read_lock() sections will have completed. In "classic RCU", these two guarantees happen to be one and the same, but can differ in realtime RCU implementations.

RCU Misc:

void call_rcu_tasks (struct rcu_head * head, void (*func) (struct rcu_head *head))

call_rcu_tasks assumes that the read-side critical sections end at a voluntary context switch (not a preemption!), entry into idle, or transition to usermode execution. As such, there are no read-side primitives analogous to rcu_read_lock and rcu_read_unlock because this primitive is intended to determine that all tasks have passed through a safe state, not so much for data-strcuture synchronization.

synchronize_rcu_expedited()
synchronize_sched_expedited()

Complete faster, but more expensive.
Cannot be called from a CPU-hotplug notifier or while holding a lock that is acquired by CPU-hotplug notifier.

SRCU:

srcu_xxx may only be invoked from a process context.

However can enter read-side critical section in a hardirq or exception handler with srcu_read_lock_raw, then exit it in the task that was interrupted with srcu_read_unlock_raw.

SRCU overhead is amortized only over those updates sharing a given srcu_struct, rather than being globally amortized as they are for other forms of RCU. Therefore, SRCU should be used in preference to rw_semaphore only in extremely read-intensive situations, or in situations requiring SRCU's read-side deadlock immunity or low read-side realtime latency.

PREEMPT_RT patch

CONFIG_PREEMPT_NONE server

CONFIG_PREEMPT_VOLUNTARY desktop (preemption in select points only)
CONFIG_PREEMPT low-latency desktop (preemption anywhere except spinlocks and “no preempt” sections)
CONFIG_PREEMPT_RT_BASE
CONFIG_PREEMPT_RT_FULL

In (RT || CONFIG_IRQ_FORCED_THREADING && boot param threadirqs), interrupts are threaded, unless IRQF_NO_THREAD is set in struct irqaction.flags.

Flag IRQF_NO_SOFTIRQ_CALL disables softirq processing when exiting primary handler – the handler must not request BH/softirq processing.

local_irq_disable() local_irq_enable() local_irq_save(flags) local_irq_restore(flags) irqs_disabled() irqs_disabled_flags() local_save_flags(flags)	On some architectures, do not actually disable interrupts in RT kernel. Must not be used in IRQF_NO_THREAD handlers . To actually disable interrupts, use raw_local_xxx() version. Note: cannot acquire normal spinlock_t with interrupts disabled. Still disables preemption.
spinlock_t rwlock_t	In RT kernel, implemented via rt_mutex. Critical sections are (or may be) preemptible, however ought not schedule voluntarily. spinlock_irq_save() etc. do not disable interrupts. Priority inheritance is used to prevent priority inversion. Traditional behavior is offered by raw_spinlock_t and raw_rwlock_t. Use “regular” method calls (such as spin_lock()) on raw_xxx.
seqlock_t	In RT kernel, critical sections are preemptible. Priority inheritance on write side. Read side is non-blocking, so no heavy priority inversion issue here.
struct semaphore	In RT kernel, implements priority inheritance. Traditional behavior is offered by compat_semaphore.
struct rw_semaphore	In RT kernel, subject to priority inheritance, and only one task can hold reader lock, but that task can acquire read lock recursively. Traditional behavior is offered by compat_rw_semaphore.
DEFINE_PER_CPU_LOCKED(type, name) DECLARE_PER_CPU_LOCKED(type, name) get_per_cpu_locked(var, cpu) put_per_cpu_locked(var, cpu) per_cpu_lock(var, cpu) per_cpu_locked(var, cpu)	Associate spinlock_t with per-cpu variable. Get variable, but first acquire the spinlock. Get associated spinlock. Get variable without acquiring the lock, either because it was already acquired, or because making RCU read-side reference to the variable, so do not need a lock.

See more:
http://lwn.net/Articles/146861/?format=printable

In RT mode:

o spinlocks (spinlock_t and rwlock_t) are preemptible, see <linux/spinlock_rt.h>

o RCU read-side critical sections (rcu_read_lock()) are preemptible

o therefore cannot acquire spinlock with preemption or interrupts disabled (trylock is ok)

o spinlock_irq_save(spinlock_t) does not disable interrupts

o to disable preemption and/or interrupts use spin_lock(raw_spinlock_t) and spinlock_irq_save(raw_spinlock_t)

o avoid the use of raw_spinlock_t unless really necessary

o preemption can be disabled via preempt_disable(), get_cpu_var() or disabling interrupts

o since spinlocks can now sleep, be careful when accessing current->state and other scheduling state data

Lazy preempt was introduced to mitigate execution of SCHED_OTHER tasks on RT kernel:

RT sucks throughput wise for ordinary SCHED_OTHER tasks. One major issue is the wakeup of tasks which are right away preempting the waking task while the waking task holds a lock on which the woken task will block right after having preempted the wakee. In mainline this is prevented due to the implicit preemption disable of spin/rw_lock held regions. On RT this is not possible due to the fully preemptible nature of sleeping spinlocks.

Though for a SCHED_OTHER task preempting another SCHED_OTHER task this is really not a correctness issue. RT folks are concerned about SCHED_FIFO/RR tasks preemption and not about the purely fairness driven SCHED_OTHER preemption latencies.

So I introduced a lazy preemption mechanism which only applies to SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the existing preempt_count each tasks sports now a preempt_lazy_count which is manipulated on lock acquiry and release. This is slightly

incorrect as for lazyness reasons I coupled this on migrate_disable/enable so some other mechanisms get the same treatment (e.g. get_cpu_light).

Now on the scheduler side instead of setting NEED_RESCHED this sets NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and therefor allows to exit the waking task the lock held region before the woken task preempts. That also works better for cross CPU wakeups as the other side can stay in the adaptive spinning loop.

For RT class preemption there is no change. This simply sets NEED_RESCHED and forgoes the lazy preemption counter.

migrate_disable(…)

if RT_FULL SMP, migrate_disable =

preempt_lazy_disable(); // lazy-disable preemption (if CONFIG_PREEMPT_LAZY, and for SCHED_OTHER/SCHED_OTHER only)

… // disable current CPU hot-removal (pin_current_cpu)

p->migrate_disable = 1; // disable task migration to another CPU

if RT_FULL UP, migrate_disable = noop (just barrier)

in all other cases migrate_disable = preempt_disable

spin_lock(lock)

migrate_disable(); // disable task migration to another CPU (or if not RT_FULL, simply preempt_disable)
rt_spin_lock(lock); // acquire underlying rt_mutex, can sleep

spin_lock_bh(lock)

local_bh_disable(); // disable BH processing

migrate_disable(); // disable task migration to another CPU (or if not RT_FULL, simply preempt_disable)
rt_spin_lock(lock); // acquire underlying rt_mutex, can sleep

Sequential files:

struct file (readonly, sequential access) -> seq_file -> provider

seq_file serves as an adapter between consumer and producer.

Reading is driven by the consumer (i.e. struct file* object).

seq_file translates read requests into the calls to producer walking the structure.

File read operations pull in data from the provider.

Every file "read" operation starts by consuming the existing the buffer first.

By default, buffer size is one page.

Once the buffer is empty, "read" then fills in the buffer by calling a producer.

Approximate sequence on every buffer fill-in:

start

next - show

...

next - show

stop

#ifdef CONFIG_PROC_FS

static const struct file_operations xxx_file_operations = {

.open = xxx_open,

.read = seq_read,

.llseek = seq_lseek,

.release = seq_release

};

static int xxx_open(struct inode *inode, struct file *file)

{

return seq_open(file, &xxx_op);

}

static const struct seq_operations xxx_op = {

.start = xxx_start,

.next = xxx_next,

.stop = xxx_stop,

.show = xxx_show

};

static void *xxx_start(struct seq_file *m, loff_t *pos)

{

// lock the structure

// use pos as input and forward to *pos

// may save context in m->private for use by next/show/stop

// return "next/show.arg" or NULL

}

static void *xxx_next(struct seq_file *m, void *arg, loff_t *pos)

{

// ++*pos;

// return "next/show.arg" or NULL

}

static void xxx_stop(struct seq_file *m, void *arg)

{

// unlock the structure

}

static int xxx_show(struct seq_file *m, void *arg)

{

seq_printf(m, "0x%p", ...);
seq_puts(m, "abc");

seq_putc(m, '\n');

}

One-item (simplified) case on top of seq_file:

static const struct file_operations xxx_fops = {

.open = xxx_open,

.read = seq_read,

.llseek = seq_lseek,

.release = single_release,

};

static int xxx_open(struct inode *inode, struct file *file)

{

return single_open(file, xxx_show, NULL);

}

static int xxx_show(struct seq_file *m, void *v)

{

seq_printf(m, "0x%p", ...);

seq_puts(m, "abc");

seq_putc(m, '\n');

return 0;

}

Procfs:

#include <linux/proc_fs.h>

struct proc_dir_entry *pde;

pde = proc_create_data("pde", S_IRWXU, NULL, &xxx_file_operations, NULL);

mde = proc_mkdir("dir", NULL);

spde = proc_create_data("spde1", S_IRWXU, spde, &xxx_file_operations, NULL);

...

remove_proc_entry(spde);

remove_proc_entry(mde);

remove_proc_entry(pde);

Kprobes:

#include <linux/kprobes.h>

struct kprobe kp = {

// .addr = do_fork,

// .symbol_name = "do_fork",

.pre_handler = pre_handler,

.post_handler = post_handler

};

ret = register_kprobe(&kp);

if (ret < 0) {

...

return ret;

}

....

unregister_kprobe(&kp);

static int __kprobes pre_handler(struct kprobe *p, struct pt_regs *regs)

{

...

return 0;

}

static void __kprobes post_handler(struct kprobe *p, struct pt_regs *regs, unsigned long flags)

{

...

}

kprobes sample: LINUX_ROOT/samples/kprobes

Jprobes:

struct jprobe jp = {

.entry = jp_handler_func,

};

jp.kp.addr = (kprobe_opcode_t *)func;

ret = register_jprobe(&jp);

if (ret < 0) {

...

return ret;

}

...

unregister_jprobe(&jp);

static void __kprobes jp_handler_func(long r0, long r1)

{

if (r0 == FUNC_ARG1 && r1 == FUNC_ARG2)

...

jprobe_return();

}

Return probes:

static int __kprobes

kretprobe_handler(struct kretprobe_instance *ri, struct pt_regs *regs)

{

if (regs_return_value(regs) == ...)

...

return 0;

}

struct kretprobe rp = {

.handler = kretprobe_handler,

};

rp.kp.addr = (kprobe_opcode_t *)func;

ret = register_kretprobe(&rp);

if (ret < 0) {

...

return ret;

}

...

unregister_kretprobe(&rp);

Hardware breakpoints to intercerpt data access/code execution:

sample: LINUX_ROOT/samples/hw_breakpoint

register_wide_hw_breakpoint(...), unregister_wide_hw_breakpoint(...)

Notifiers:

struct atomic_notifier_head {

spinlock_t lock;

struct notifier_block __rcu *head;

};

struct blocking_notifier_head {

struct rw_semaphore rwsem;

struct notifier_block __rcu *head;

};

struct notifier_block {

notifier_fn_t notifier_call;

struct notifier_block __rcu *next;

int priority;

};

ATOMIC_NOTIFIER_HEAD(h);

BLOCKING_NOTIFIER_HEAD(h);

ATOMIC_INIT_NOTIFIER_HEAD(h);

BLOCKING_INIT_NOTIFIER_HEAD(h);

int atomic_notifier_chain_register(nh, nb);

int atomic_notifier_chain_unregister(nh, nb);

int atomic_notifier_call_chain(nh, unsigned long val, void *v);

int blocking_notifier_chain_register(nh, nb);

int blocking_notifier_chain_unregister(nh, nb);

int blocking_notifier_call_chain(nh, unsigned long val, void *v);

int notifier_fn(struct notifier_block *nb, unsigned long action, void *data)

{

// return NOTIFY_OK

// return NOTIFY_STOP

}

SRCU notifiers are using sleepable RCUs instead of rwsem.

debugfs:

#include <linux/debugfs.h>

#if defined(CONFIG_DEBUG_FS)

static struct dentry *dir = 0;

static u32 hello = 0;

dir = debugfs_create_dir(“test", NULL);

if (!dir) ...

junk = debugfs_create_u32("hello", 0666, dir, &hello);

if (!junk) ...

...

debugfs_remove_recursive(dir);

==================================

static int xxx_write_op(void *data, u64 value)

{

sum += value;

return 0;

}

DEFINE_SIMPLE_ATTRIBUTE(xxx_fops, NULL, xxx_write_op, "%llu\n");

dir = debugfs_create_dir("test", NULL);

if (!dir) ...

junk = debugfs_create_file( "xxx", 0666, dir, NULL, &xxx_fops);

if (!junk) ...

....

debugfs_remove_recursive(dir);

debugfs_create_u8(name, mode, parent, u8 *value);

debugfs_create_u16(name, mode, parent, u16 *value);

debugfs_create_u32(name, mode, parent, u32 *value);

debugfs_create_u64(name, mode, parent, u64 *value);

debugfs_create_x8(name, mode, parent, u8 *value);

debugfs_create_x16(name, mode, parent, u16 *value);

debugfs_create_x32(name, mode, parent, u32 *value);

debugfs_create_x64(name, mode, parent, u64 *value);

debugfs_create_size_t(name, mode, parent, size_t *value);

debugfs_create_atomic_t(name, mode, parent, atomic_t *value);

debugfs_create_bool(name, mode, parent, u32 *value);

debugfs_create_blob(name, mode, parent, struct debugfs_blob_wrapper *blob);

debugfs_create_regset32(name, mode, parent, struct debugfs_regset32 *regset)

void debugfs_print_regs32(struct seq_file *s, const struct debugfs_reg32 *regs, int nregs, void __iomem *base, char *prefix);

debugfs_create_u32_array(name, mode, parent, u32 *array, u32 elements);

debugfs_create_devm_seqfile(struct device *dev, const char *name, parent,

int (*read_fn)(struct seq_file *s, void *data));

Userspace/kernel communications:

· standard file systems: procfs, sysfs, debugfs, sysctl, configfs

· custom device

· custom file system

· kernel socket, netlink socket

· kfifo

LINUX_ROOT/samples/kfifo

LINUX_ROOT/include/linux/kfifo.h

· mmap + other shared memory (map user page to kernel space or vice versa)

· futex, signal

· syscall

http://www.ibm.com/developerworks/linux/library/l-tasklets/index.html
http://www.linuxjournal.com/article/5833

http://web.archive.org/web/20051023100157/http://kernelnewbies.org/documents/kdoc/kernel-locking/lklockingguide.html
http://www.crashcourse.ca/wiki/index.php/Kernel_topics
http://www.crashcourse.ca/wiki/index.php/Linux_kernel
https://www.kernel.org/doc/htmldocs/kernel-api
http://www.ibm.com/developerworks/linux/library/l-task-killable

http://www.makelinux.net/ldd3 ch 5
https://www.kernel.org/pub/linux/kernel/people/rusty/kernel-locking/c214.html

### kthreads, waiting for completion, priorities
### lists 85-96
### signal processing by thread (especially kill and stop)