Macros, types and conventions:

 

long  IS_ERR(const void *ptr)            // pointer has value in “error” range (>= (unsigned long) –MAX_ERRNO)
long  IS_ERR_OR_NULL(const void *ptr)
long  PTR_ERR(const void *ptr)
void* ERR_PTR(long error)

 

likely(cond)

unlikely(uncond)

 

BUG_ON(condition)          // can compile away, so do not create side effects in condition
BUILD_BUG_ON(condition)   
// build-time checking

 

typecheck(__u64, a)        // build-time checking that a is of type __u64

container_of(member_ptr, container_typename, member_name)

 

BITS_PER_LONG              // 32 or 64

 


 

No floating point operations in kernel (there are routines to save/restore fp context).

 

Interrupt stack (optional. present only on some architectures), normally 1 page (on x86/x64 – 4K).

 

Kernel stack: on x86/x64 usually 2 pages, i.e. 8 KB.
At the bottom of the stack is struct thread_info.
On stack overflow will silently overwrite it with disastrous consequences.

 

struct thread_info* current_thread_info();
current_thread_info()->preempt_count

struct task struct* task = current_thread_info()->task;
struct task struct* task = current;

 

/* task->state */

TASK_RUNNING                 0      // currently running or on runqueue waiting to run

TASK_INTERRUPTIBLE           1      // in interruptible sleep

TASK_UNINTERRUPTIBLE         2      // in non-interruptible sleep

__TASK_STOPPED               4      // stopped due to SIGSTOP, SIGTSTP, SIGTTIN or SIGTTOU

__TASK_TRACED                8      // being traced by debugger etc. via ptrace

 

/* in task->exit_state */

EXIT_ZOMBIE                  16        // process has terminated, lingering around to be reaped by parent

EXIT_DEAD          32        // final state after parent collects status with wait4 or waitpid syscalls

 

/* in task->state again */

TASK_DEAD          64

TASK_WAKEKILL                128       // wake up on fatal signals

TASK_WAKING                  256

 

#define TASK_KILLABLE        (TASK_WAKEKILL | TASK_UNINTERRUPTIBLE)

#define TASK_STOPPED         (TASK_WAKEKILL | __TASK_STOPPED)

#define TASK_TRACED          (TASK_WAKEKILL | __TASK_TRACED)

#define TASK_NORMAL          (TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE)

 

set_task_state(task, state)

 

Iterate parent tasks

(note: if parent terminates, task is reparented to member of the group or to init)

 

for (task = current;  task != &init_task;  task = task->parent)
    { ... }

 

Iterate all tasks in the system (expensive):

for_each_process(task)

    { ... }

 

task = next_task(task);
task = prev_task(task)

 

Iterate over children tasks:

 

struct list_head* list;

list_for_each(list, &current->children)
{

    task = list_entry(list, struct task_struct, sibling);

}

 

Annotation for functions that can sleep. This macro will print a stack trace if it is executed in an atomic

context (spinlock, irq-handler, ...):

 

might_sleep()

 


 

Modifiers (annotations):

 

__init                                                          place code into the section discarded after kernel or module init

__read_mostly                                              place data in “read mostly” section to reduce cache line interference
__write_once                                               same as __read_mostly

__page_aligned_bss                                      align data on page boundary and allocate it from bss-page-aligned section

notrace                                                        do not instrument this function for profiling

__irq_entry                                                  allocate code in section irqentry-text (only if graph tracer is enabled)

__kprobes                                                     allocate code in section kprobes-text (only if CONFIG_KPROBES)

asmlinkage                                                    assembler-level call linkage

__lockfunc                                                    allocate code in section spinlock-text

 

__kernel                                                      designate as kernel-space address

__user, __iomem, __rcu                               disable dereference (only used by __CHECKER__, see  sparse(1))

 

__nocast                                                      sparse will issue warning if attempted type-conversion to this var (e.g. signed int value to unsigned var)
                __bitwise                                                       sparse will issue warning if __bitwise variables are mixed with other integers

__safe                                                          sparse: not used in the codebase

__force                                                        sparse: disable warning (e.g. when casting int to bitwise)

               __le32, __be32                                               sparse: little endian, big endian

 

__must_hold(x)                                            sparse: the specified lock is held on function entry and exit

__acquires(x)                                                sparse: function acquires the specified lock

__releases(x)                                                sparse: function releases the specified lock

__acquire(x)

__release(x)

__cond_lock(x,c)

 

Sparse:

git://git.kernel.org/pub/scm/devel/sparse/sparse.git
make
make install

then run kernel make:

make C=1                         # run sparse only on files that need to be recompiled
make C=2                         # run sparse on all files, whether recompiled or not
make C=2 CF="-D__CHECK_ENDIAN__"     # pass optional parameters to sparse



Contexts:

 

·          process (e.g. syscall), preemptible context

·          process, non-preemptible context

·          bottom-half (DPC/fork-like, but may be threaded, aka “soft interrupt” – e.g. tasklet)

·          primary interrupt handler (irq, aka "top half", aka “hard interrupt”)

·          PREEMPT_RT only: secondary (threaded) interrupt handler, per-IRQ thread

 

There are no IPL/IRQL levels.

SOFTIRQ is executed with hardware interrupts disabled
(but tasklet handler enables interrupts before entering the tasklet routine).

 

Bottom-half processing can be triggered by:

·          return from hardware interrupt (irq_exit)

·          local_bh_enable(), including indirect calls

·          ksoftirqd kernel thread (per-CPU, runs at low priority)

·          any code that explicitly calls do_softirq(), __do_softirq(), call_softirq() or invoke_softirq(),
such as network code

 

Note again that Linux is NOT "true" IPL/IRQL-based system, and local_irq_enable() or unlocking a spinlock DO NOT cause SOFTIRQ processing to happen.

 

Task switch can be triggered by:

 

·          REI to user-space

·          REI to kernel-space if (current->preempt_count == 0 && need_resched())

·          unlocking a lock, including indirect unlock or any other call to preempt_enable(),
if result is (current->preempt_count == 0 && need_resched())

·          explicit voluntary call to schedule(), including yield() and blocking functions

 

Preemption mask:

 

preempt-mask.gif

preempt_count()       current_thread_info()->preempt_count

preempt_count()               Returns preemption disable count,  this includes count masks added
                                                       by currently active hardirqs, softirqs and NMI. Low byte is preempt
                                                       disable count (if kernel is configured preemptible, otherwise 0),
                                                       byte 1 is softirq count, next 10 bits is hardirq count,
                                                       then 1 bit for NMI, then one bit for PREEMPT_ACTIVE.

 

There is also preempt_lazy_count (useful for PREEMPT_RT)

SOFTIRQ is changed by 1 on entering/leaving softirq processing and is changed by 2 on local_bh_enable/local_bh_disable, to distinguish between softirq-active and softirq-disabled states

PREEMPT_ACTIVE                                                    rescheduling is being performed (so do not reenter the scheduler)

 

Context control functions:

 

preempt_disable()                              increments PREEMPT field
preempt_enable()                               decrement PREEMT, if goes to 0, will check for need_reschdule()

 

local_bh_disable()                               increments SOFTIRQ field by 2 (whereas entering softirq increments it
                        only by 1, to distinguish between softirq-active/softirq-disabled states)

local_bh_enable()                                If goes to 0, will check for pending bottom halves (softirqs)

 

local_irq_disable()                              disable hw interrupts unconditionally (on some arch can be lazy)
local_irq_enable()                               enable hw interrupts unconditionally

 

unsigned long flags;

local_irq_save(flags)                         save previous state into flags and disable hw interrupts
local_irq_restore(flags)                  restore hw-interrupt-enable state from flags

Other functions:

 

hard_irq_disable()                              Really disable interrupts, rather than relying on lazy-disabling on
                                                                        some architectures.

preempt_enable_no_resched()    Decrement preemption count. If goes to 0, do not check for any
                                                                        pending reschedules.

 

in_nmi()                                                     Non-zero if executing NMI handler.

 

in_irq()                                                       Non-zero if currently executing  primary (hardirq) hardware

                                                                        interrupt  handler, 0 if in process context or bottom half or NMI.

 

in_softirq()                                              Non-zero if processing softirq or have BH disabled.

 

in_servng_softirq()                         Non-zero if processing softirq.

 

in_interrupt()                                        Non-zero if in NMI or primary interrupt (hardirq) or in bottom-half context, or if have BH disabled.

irqs_disabled()                                     Non-zero if local interrupt delivery is disabled.

 

in_atomic()                                              Checks (preempt_count() & ~PREEMPT_ACTIVE), cannot sense if

                                                                        spinlocks are held in non-preemptable kernel, so use only with
                                                                        great caution. Do not use in drivers.

 

synchronize_hardirq(irq)            Wait for pending hard IRQ handlers to complete on other CPUs

 

synchronize_irq(irq)                       Wait for pending hard & threaded IRQ handlers to complete on other CPUs

 

Warning: on non-preemptible systems (CONFIG_PREEMPT=n) spin_lock does not change preemption mask.

 




Bottom halves:

 

top half = primary interrupt context

bottom half = deferred interrupt processing (similar to DPC or fork), can be threaded

 

Methods:

 

original BH

removed in 2.5

Task queues

removed in 2.5

Softirq

since 2.3

Tasklet

since 2.3

Work queues

since 2.3

Timers

 

 

Original BH (removed in 2.5):

 

32-bit request mask and 32 lists of requests, each level globally synchronized across all CPUs.

 

Task queues  (removed in 2.5):

 

Set of queues.

 




Softirqs (BH):

 

Set of NR_SOFTIRQS statically defined bottom halves allocated at kernel compile time.

Cannot register extra softirq levels dynamically.

Handler can run on any CPU.

 

Rarely used directly (tasklets are more common), but tasklets are layered on Softirq.

 

Defined levels:

 

HI_SOFTIRQ

0

High-priority tasklets

TIMER_SOFTIRQ

1

Timers

NET_TX_SOFTIRQ

2

Send network packets

NET_RX_SOFTIRQ

3

Receive network packets

BLOCK_SOFTIRQ

4

Block devices: done

BLOCK_IOPOLL_SOFTIRQ

5

Block devices: poll

TASKLET_SOFTIRQ

6

Normal priority tasklets

SCHED_SOFTIRQ

7

Scheduler

(just inter-CPU balancing,

not the same as SOFTINT IPL$_RESCHED)

HRTIMER_SOFTIRQ

8

High-resolution timer

RCU_SOFTIRQ

9

RCU callbacks

 

 

Level to name:

 

extern char *softirq_to_name[NR_SOFTIRQS];

 

Register handler:

 

void my_action(struct softirq_action *);

open_softirq(HRTIMER_SOFTIRQ, my_action);

 

Handler runs with interrupts enabled and cannot sleep.

 

Raise softirq:

 

Marks softirq level as pending.
Temporarily disables interrupts internally.

 

raise_softirq(NET_TX_SOFTIRQ);

 

Equivalent to:

 

                        unsigned long flags;

 

                        local_irq_save(flags);

                        raise_softirq_irqoff(nr);

                        local_irq_restore(flags);

If interrupts are already disabled, can do:

 

raise_softirq_irqoff(NET_TX_SOFTIRQ);

 

Try to send a softirq to a remote cpu. 

If this cannot be done, the work will be queued to the local cpu.

 

void  send_remote_softirq(struct call_single_data* cp, int cpu, int softirq);

 


 

Softirq processing logic:

 

do_softirqd is invoked:

 

·          on return from hardware interrupt (irq_exit)

·          local_bh_enable()

·          in the ksoftirqd kernel thread (per-CPU, runs at low priority)

·          in any code that explicitly calls do_softirq(), __do_softirq(), call_softirq() or invoke_softirq(),
  such as network code

 


void do_softirq(void)

{

    __u32 pending;

    unsigned long flags;

    struct softirq_action* h;

 

     if (in_interrupt())

        return;

 

    local_irq_save(flags);

 

    pending = local_softirq_pending();

 

    if (pending)

    {

        ... on x86 and some other architectures ...

        ... switch to per-CPU softirq stack ...

 

        set_softirq_pending(0);

        local_irq_enable();

 

        h = softirq_vec;

 

        do

        {

           if (pending & 1)

               h->action(h);

            h++;

            pending >>= 1;

        } while (pending);   

    }

 

    ... recheck if still pending, and if so retry the loop few times ...

    ... after which wakeup softirqd for execution on a worker thread ...

 

    local_irq_restore(flags);

}

 

Thus:

 

·          Multiple softirqs (of same or different levels) can concurrently execute on different processors.

·          Single pass on the same CPU goes from high-priority (0) to low-priority (9) softirqs, but in a larger scheme of things low-priority softirq can get executed earlier than high-priority softirq.

(E.g. if another processor picks up lower-priority softirq first, or if high-priority softirq is added while scan pass is already in progress, or if tasklets are relegated to the threaded execution, of due to ksoftirqd.)

·          softirq’s are prioritized for ordering only (and even then are subject to postponing for ksoftirqd thread).

high-priority softirq does not interrupt lower-priority softirq.

 

"While a (softirq) handler runs, softirqs on the current processor are disabled.
A softirq never preempts another softirq (on the same CPU).
The only event that can preempt a softirq is an interrupt handler."

 

 



Tasklets (BH):

 

Tasklets

·          built on top of softirqs

·          can be scheduled only as one instance at a time (and also only on one processor at a time)

·          if handler is already running on CPU1 and tasklet gets scheduled again on CPU2, second instance of handler can run (on CPU2) concurrently with the first one (running on CPU1)

·          handler runs with all interrupts enabled

·          does not have process context (no files etc.)

·          cannot sleep

·          per-processor queues tasklet_vec (for regular tasklets) and tasklet_hi_vec (for high-priority tasklets)

 

Rules:

·          if tasklet_schedule is called, then tasklet is guaranteed to be executed on some CPU at least once after this

·          if tasklet is already scheduled, but its execution is still not started, it will be executed only once

·          if tasklet is already running on another CPU (or schedule is called from tasklet itself), it is scheduled again

·          tasklet is strictly serialized with respect to itself, but not to another tasklets;
  for inter-tasklet synchronization, use spinlocks.

 

 

      struct tasklet_struct

       {

              struct tasklet_struct *next;

              unsigned long state;           // bits: TASKLET_STATE_SCHED (0), TASKLET_STATE_RUN (1)

              atomic_t count;            // if non-zero, disabled and cannot run
                                                                                                                                                  // otherwise enabled and can run if scheduled

              void (*func)(unsigned long);

              unsigned long data;

}

 

#define DECLARE_TASKLET(name, func, data) \

                struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(0), func, data }

 

#define DECLARE_TASKLET_DISABLED(name, func, data) \

               struct tasklet_struct name = { NULL, 0, ATOMIC_INIT(1), func, data }

 

void   tasklet_init(struct tasklet_struct *t, void (*func)(unsigned long), unsigned long data)

 

int     tasklet_trylock(struct tasklet_struct *t)                                       // sets RUN bit on MP, return  true if was not set

void   tasklet_unlock(struct tasklet_struct *t)                                       // clear RUN bit

void   tasklet_unlock_wait(struct tasklet_struct *t)                                                 // spin while RUN bit is set

 

No-op on uniprocessor.

 

void   tasklet_schedule(struct tasklet_struct *t)                                  

void   tasklet_hi_schedule(struct tasklet_struct *t)

 

If SCHED is already set, do nothing.

Otherwise insert into queue and signal softirq.

 

void   tasklet_kill(struct tasklet_struct *t);

 

Cannot be called from interrupt.
Acquire SCHED bit (i.e. wait till owner clears it, e.g after removing from queue), spin while RUN bit is set, clear SCHED bit.

 

tasklet state at call to tasklet_kill

outcome

not queued

returns fine

queued, disabled

will lock up

enabled, queued

removed from queue
handler completed execution
attempts to re-schedule would not pass

enabled, running

 

void   tasklet_kill_immediate(struct tasklet_struct *t, unsigned int cpu);

 

Can be called only for CPU in state CPU_DEAD.

Remove tasklet even if it is in SCHED state.

void   tasklet_disable_nosync(struct tasklet_struct *t)

 

Disables tasklet by atomic_inc(t->count) and returns immediately.

void   tasklet_disable(struct tasklet_struct *t)

 

Disables tasklet by atomic_inc(t->count) and calls tasklet_unlock_wait().

 

void   tasklet_enable(struct tasklet_struct *t)
void   tasklet_hi_enable(struct tasklet_struct *t)

 

Enables tasklet by atomic_dec(t->count).

 

Tasklet block:

func(data)
disable_count
flags: sched (“scheduled”), run (“currently running”)

Handler logic:

 

tasklet_schedule

softirq handler


if (!sched)

{

        sched = 1

        queue + softirq

}


copy:  list = per-cpu queue

clear per-cpu queue

 

for (each entry)

{

        if (run)
        {

requeue:

                put back on the queue

                request softirq

        }

        else

        {

                run = 1

 

                if (disable_count != 0)
                {

                        run = 0

                        goto requeue

                }

                else

                {

                        sched = 0

                        func(data)

                        run = 0

                }

        }

}

 



Work queues (BH):

 

·          items executed by a worker thread, in process context and can sleep

·          default threads are per-CPU event/n, but queue can also have dedicated processing thread or threads

 

wq = create_workqueue("my_queue");
wq = create_singlethread_workqueue("my_queue");
wq = create_freezable_workqueue("my_queue");

NULL if fails.

create_workqueue  creates queue and worker threads, one per CPU, named my_queue/n.

 

pre-existing queues:

 

system_wq

Used by schedule[_delayed]_work[_on]().
Per-CPU multithreaded.

For short items.

system_long_wq

Similar to system_wq, but for long items.
Queue flushing may take relatively long time.

system_freezable_wq

Similar to system_wq, but freezable.

system_unbound_wq

Workers are not bound to any specific CPU, not concurrency managed, and all queued works are executed immediately as long as max_active limit is not reached and resources are available.

 

int       keventd_up(void)

 

Check if system_wq exists (not NULL).

 

void  destroy_workqueue( struct workqueue_struct* wq);

 

void    workqueue_set_max_active(wq, int max_active)

 

Adjust maximum number of concurrently active works.
Do not call from interrupt context.

 

struct workqueue_struct*  wq;

work_struct  work;                                                   

void   my_func(struct work_struct *pWork);

 

DECLARE_WORK(work, myfunc);                                                                       // compile-time initialization

INIT_WORK(&work, myfunc)                                                       // run-time initialization

INIT_WORK_ONSTACK(&work, myfunc)                               // ... same for on-stack work object

PREPARE_WORK(&work, new_myfunc)                               // change the routine in an already initialized descriptor

 

bool    queue_work(wq, pWork)                                                                       // queue on current CPU

bool    schedule_work(pWork)                                                   // same as queue_work  on system_wq

 

Returns TRUE if was queued, FALSE if was already on the queue.

 

If queue is being drained with drain_queue, will not actually insert in the queue unless called from the context of a "chained work", i.e. work executing on this queue, but will not reflect it in return status.

 

To pass parameters to work routine, embed struct_work in larger structure.

 

bool    queue_work_on(int cpu, wq, pWork)                                          // queue for processing on  specified CPU

bool    schedule_work_on(int cpu, pWork)                                              // same as queue_work_on  on system_wq

 

Returns TRUE if was queued, FALSE if was already on the queue.
Caller must ensure the CPU cannot go away.
If queue is being drained with drain_queue, (see above).


struct delayed_work dwork

struct delayed_work *pDelayedWork

void   my_delayed_work_timer_func(unsigned long data);

 

DECLARE_DELAYED_WORK(dwork, my_delayed_work_timer_func)               // compile-time initialization

INIT_DELAYED_WORK(&dwork, my_delayed_work_timer_func)                                              // run-time initialization

INIT_DELAYED_WORK_ONSTACK(&dwork, my_delayed_work_timer_func)                      // ... same for on-stack dwork object

PREPARE_ DELAYED _WORK(&dwork, new_my_delayed_func)                          // change the routine in an already initialized struct

 

/* Deferrable is similar to delayed, but may postpone waking up CPU if sleeping */

DECLARE_DEFERRABLE_WORK(dwork, my_delayed_work_timer_func)       // compile-time initialization

INIT_DEFERRABLE_WORK(&dwork, my_delayed_work_timer_func)              // run-time initialization

INIT_DEFERRABLE_WORK_ONSTACK(&dwork, my_delayed_work_timer_func)              // ... same for on-stack dwork object

PREPARE_ DELAYED _WORK(&dwork, new_my_delayed_func)                          // change the routine in an already initialized struct

 

bool    queue_delayed_work(wq, struct delayed_work *work, unsigned long delay)

bool    schedule_delayed_work(struct delayed_work *work, unsigned long delay)

bool    queue_delayed_work_on(int cpu, wq, struct delayed_work *work, unsigned long delay)

bool    schedule_delayed_work_on(int cpu, struct delayed_work *work, unsigned long delay)

 

Similar to queue_work/schedule_work, but additional parameter delay (number of system ticks) specifies minimum time to delay before executing the item. If queue is being drained with drain_queue, (see above).

 

bool    mod_delayed_work(wq, pDelayedWork, unsigned long delay)

bool    mod_delayed_work_on(int cpu, wq, pDelayedWork, unsigned long delay)

 

Modify delay of or queue a delayed work on specific CPU or any CPU.

 

If pDelayedWork is idle, equivalent to queue_delayed_work_on(); otherwise, modify pDelayedWork 's timer so that it expires after delay.  If delay is zero, pDelayedWork is guaranteed to be scheduled immediately regardless of its current state.

 

Returns FALSE if pDelayedWork was idle and queued, TRUE if pDelayedWork was pending and its timer was modified.

 

Safe to call from any context including interrupt handler.

 

bool    cancel_work_sync(pWork)

 

Cancel pWork and wait for its execution to finish. 

 

Can be used even if the work re-queues itself or migrates to another workqueue. 

On return, pWork is guaranteed to be not pending or executing on any CPU.

 

Returns TRUE if pWork was pending, FALSE otherwise.

 

Caller must ensure that the workqueue on which pWork was last queued can't be destroyed before this function returns.


bool    cancel_delayed_work_sync(pDelayedWork)

 

Cancel a delayed work and wait for it to finish.

Similar to cancel_work_sync(), but for delayed works.

 

bool    cancel_delayed_work(pDelayedWork)

 

Cancel a pending delayed work. 

Returns TRUE if pDelayedWork was pending and canceled, FALSE if wasn't pending. 

 

Work callback function may still be running on return, unless function returns TRUE and the work doesn't re-arm itself.  Explicitly flush or use cancel_delayed_work_sync() to wait on it.

 

Safe to call from any context including IRQ handler.

bool    flush_work(pWork)

 

Wait for a work to finish executing the last queueing instance.

Wait until pWork has finished execution. 

pWork is guaranteed to be idle on return if it hasn't been requeued since flush started.

 

Returns TRUE if flush_work() waited for the work to finish execution, FALSE if it was already idle.

 

bool    flush_delayed_work(pDelayedWork)

 

Wait for a delayed work to finish executing the last queueing.

 

Delayed timer is cancelled and the pending work is queued for immediate execution. 

Like flush_work(), this function only considers the last queueing instance of pDelayedWork.

 

Returns TRUE if waited for the work to finish execution, FALSE if it was already idle.

 

void    flush_workqueue(wq)

void    flush_scheduled_work(void)                                        // same as flush_workqueue(system_wq)

 

Ensure that any scheduled work has run to completion.

Forces execution of the workqueue and blocks until its completion.

 

Sleep until all works which were queued on entry have been handled, but not livelocked by new incoming ones.

 

This is typically used in driver shutdown handlers.

 

void    drain_workqueue(wq)

 

Wait until the workqueue becomes empty. 

 

While draining is in progress, only chain queueing is allowed. 

In other words, only currently pending or running work items on wq can queue further work items on it. 

 

wq is flushed repeatedly until it becomes empty.  The number of flushing is determined by the depth of chaining and should be relatively short.

 

int       schedule_on_each_cpu(work_func_t  func)

 

Execute a function synchronously on each online CPU

 

Executes func on each online CPU using the system workqueue and blocks until all CPUs have completed.

Very slow.

 

Returns 0 on success, -errno on failure.

 

int       execute_in_process_context(work_func_t  fn, struct execute_work*  pExecuteWork)

 

Execute the routine within user context.

 

Executes the function immediately if process context is available, otherwise schedules the function for delayed execution.

Returns:     0 if function was executed immediately (i.e. was not in interrupt).

Returns:     1 if was in interrupt and attempted to schedule function for execution.

 

bool    workqueue_congested(unsigned int cpu, wq)

 

Test whether wq's cpu workqueue for cpu is congested. 

 

There is no synchronization around this function and the test result is unreliable and only useful as advisory hints or for debugging.

unsigned int    work_cpu(pWork)

 

Return the last known associated cpu for pWork: CPU number if pWork was ever queued, WORK_CPU_NONE otherwise.

 

unsigned int    work_busy(pWork)

 

Test whether a work is currently pending or running

 

There is no synchronization around this function and the test result is unreliable and only useful as advisory hints or for debugging. Especially for reentrant wqs, the pending state might hide the running state.

 

Returns or'd bitmask of WORK_BUSY_* bits.

 


 

Timers (BH):

Called via TIMER_SOFTIRQ .
Timers are not repetitive (once expires, need to be restarted).

void myfunc(unsigned long mydata)  { ... }

[static] DEFINE_TIMER(tmr, myfunc, 0, mydata);

struct timer_list  tmr;
init_timer(& tmr);
init_timer_onstack(& tmr);         // if variable is on stack, also call destroy_timer_on_stack(& tmr)
init_timer_deferable(& tmr);      // will not casue CPU to wake up from power sleep
tmr.function  =  myfunc;
tmr.data  =  mydata;

tmr.expires  =  jiffies + HZ / 2;

setup_timer(&tmr, myfunc, mydata)
setup_timer_on_stack(&tmr, myfunc, mydata)
setup_deferable_timer_on_stack(&tmr, myfunc, mydata)

void     add_timer(& tmr)                                     // must not be already pending
void     add_timer_on(&tmr, int cpu)

int        mod_timer(&tmr, unsigned long expires)

Equivalent to   del_timer(tmr); tmr.expires  = expires; add_timer(tmr);
The only safe way to queue timer when there are multiple unsynchronized concurrent users of the same timer.
If timer was inactive, returns 0, if was active, returns 1.

int        mod_timer_pending(&tmr, unsigned long expires)

Modify an already pending timer.
For active timers the same as mod_timer().
But will not reactivate or modify an inactive timer.

int        mod_timer_pinned(&tmr, unsigned long expires)

 

Update the expire field of an active timer (if the timer is inactive it will be activated) and ensure that the timer is scheduled on the current CPU.

 

This does not prevent the timer from being migrated when the current CPU goes offline.  If this is a problem for

you, use CPU-hotplug notifiers to handle it correctly, for example, cancelling the timer when the corresponding CPU goes

offline.

 

Equivalent to:   del_timer(timer); timer->expires = expires; add_timer(timer);

int        timer_pending(&tmr)    {  return   tmr.entry.next != NULL;  }                                 // no inherent synchronization

int        del_timer(&tmr)

Deactivate timer.  OK to call either for active and inactive timers.
If was inactive, return 0, otherwise (if was active) return 1.
Guarantees that the timer won't be executed in the future, but does not wait for completion of an already running handler.


 

int        del_timer_sync(&tmr)

Deactivate a timer and wait for the handler to finish.

 

Synchronization rules: Callers must prevent restarting of the timer, otherwise this function is meaningless.

It must not be called from interrupt contexts unless the timer is an irqsafe one.

The caller must not hold locks which would prevent completion of the timer's handler.

The timer's handler must not call add_timer_on().

For non-irqsafe timers, you must not hold locks that are held in interrupt context while calling this function. Even if the lock has nothing to do with the timer in question.  Here's why:

 

   CPU0                                 CPU1

   ----                                 ----

                                      <SOFTIRQ>

                                      call_timer_fn();

                                         base->running_timer = mytimer;

 spin_lock_irq(somelock);

                                         <IRQ>

                                            spin_lock(somelock);

 del_timer_sync(mytimer);

   while (base->running_timer == mytimer);

 

Now del_timer_sync() will never return and never release somelock. The interrupt on the other CPU is waiting to grab somelock but it has interrupted the softirq that CPU0 is waiting to finish.

int        try_to_del_timer_sync(&tmr)

Try to deactivate a timer. Upon successful (ret >= 0) exit the timer is not queued and the handler is not running on any CPU.


 

Hi-res timers:


#include <linux/ktime.h>

#include <linux/hrtimer.h>

 

/* nanoseconds */

typedef union ktime {  s64 tv64;  }  ktime_t;

 

KTIME_SEC_MAX

 

ktime_t  ktime_set(s64 sec, unsigned long ns)

ktime_t  timespec_to_ktime(struct timespec ts)

ktime_t  timespec64_to_ktime(struct timespec64 ts)

ktime_t  timeval_to_ktime(struct timeval tv)

 

s64  ktime_to_ns(kt)

s64  ktime_to_us(kt)

s64  ktime_to_ms(kt)

struct timespec  ktime_to_timespec(kt)

struct timespec  ktime_to_timespec64(kt)

struct timeval  ktime_to_timeval(kt)

 

ktime_t   ktime_add(kt1, kt2)

ktime_t   ktime_sub(kt1, kt2)

ktime_t   ktime_add_ns(kt, unsigned long ns)

ktime_t   ktime_sub_ns(kt, unsigned long ns)

ktime_t   ktime_add_safe(ktime_t kt1, ktime_t kt2)

 

ktime_t  ktime_add_ms(kt, u64 ms)

ktime_t  ktime_sub_us(kt, u64 us)

 

int   ktime_compare(kt1, kt2)

bool  ktime_equal(kt1, kt2)

bool  ktime_after(kt1, kt2)

bool  ktime_before(kt1, kt2)

 

 

hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);

timer->function = ftimer;

hrtimer_start(timer, ktime_set(0, ns), HRTIMER_MODE_REL);

...

hrtimer_cancel(timer);  // waits for the handler to finish

 

static enum hrtimer_restart ftimer(struct hrtimer *timer)

{

                        missed = hrtimer_forward_now(timer, ktime_set(0, ns);

 

                        ....

 

                        return HRTIMER_NORESTART;  // or HRTIMER_RESTART

}

 


struct hrtimer *timer;

clockid_t which_clock =  CLOCK_MONOTONIC or CLOCK_REALTIME;

enum hrtimer_mode mode =

HRTIMER_MODE_ABS                       // absolute

HRTIMER_MODE_REL                        // relative to now

HRTIMER_MODE_ABS_PINNED   // absolute + bound to this CPU

HRTIMER_MODE_REL_PINNED // relative + bound to this CPU

 

void   hrtimer_init(timer, which_clock, mode)

void   hrtimer_init_on_stack(timer, which_clock, mode)

void  destroy_hrtimer_on_stack(timer)

int   hrtimer_start(timer, kt, mode)

int   hrtimer_start_range_ns(timer, kt unsigned long range_ns, mode)

 

u64   hrtimer_forward(timer, ktime_t now, ktime_t interval)

u64  hrtimer_forward_now(timer, ktime_t interval)

 

ktime_t  now = timer->base->get_time()

ktime_t  now = hrtimer_cb_get_time(timer)

 

void  hrtimer_set_expires(timer, kt)

void  hrtimer_set_expires_range(timer, kt, ktime_t delta)                                  // delta is extra tolerance

void  hrtimer_set_expires_range_ns(timer, kt, unsigned long delta_ns)

void  hrtimer_add_expires(timer, kt)

void  hrtimer_add_expires_ns(timer, u64 ns)

 

ktime_t  hrtimer_get_softexpires(timer)                                                   // without added tolerance (i.e. soft)

ktime_t  hrtimer_get_expires(timer)                                                            // with tolerance, i.e. soft + delta

 

ktime_t  hrtimer_expires_remaining(timer)

void   hrtimer_get_res(which_clock, struct timespec *tp)

 



Wait queues:

 

Replacement for sleep/wake.

Exclusive waiters (those with WQ_FLAG_EXCLUSIVE) count against wakeup limit counter, non-exclusive do not.

 

wait_queue_head_t   wq;

init_waitqueue_head(& wq);

 

Sample wait cycle:

 

[static] DEFINE_WAIT(mywait);                                             // create wait queue entry

 

add_wait_queue(&wq, &mywait);                  // add it to wait queue

while (! (condition))
{

        /* sets current task state */

        /*     TASK_INTERRUPTIBLE: signal wakes process up */

        /*     TASK_UNINTERRUPTIBLE: does not */

        prepare_to_wait(&wq, &mywait, TASK_INTERRUPTIBLE);
       

        if (signal_pending(current))
                ... handle signal ...

 

        schedule();

}

finish_wait(&wq, &mywait);                              // sets task state to TASK_RUNNING and removed mywait off wq

 

Note: instead of default_wake_function can supply own wake function in mywait.func(mywait.private).

 

Helper macros:

 

void   wait_event(&wq, condition)

Uninterruptible wait.

 

void   wait_event_lock_irq(&wq, condition, spinlock_t  lock)

void   wait_event_lock_irq_cmd(&wq, condition, spinlock_t  lock,  cmd)

 

Uninterruptible wait.
Condition checked under the lock.

Expected to be called with the lock held, returns with the lock held.

Command cmd is executed right before going to sleep.

 

int   wait_event_interruptible(&wq, condition)

int   wait_event_interruptible_lock_irq(&wq, condition, spinlock_t  lock)
int   wait_event_interruptible_lock_irq_cmd(&wq, condition, spinlock_t  lock,  cmd)

Interruptible wait.
Returns 0 if condition evaluated to true, or -ERESTARTSYS if was interrupted.

 

Condition checked under the lock.

Expected to be called with the lock held, returns with the lock held.

Command cmd is executed right before going to sleep.

long   wait_event_timeout(&wq, condition, timeout)                                                 // timeout is in jiffies

 

Uninterruptible wait with timeout.
Returns 0 if timeout elapsed, otherwise remaining time.

long   wait_event_interruptible _timeout(&wq, condition, timeout)                // timeout is in jiffies

 

Interruptible wait with timeout.
Returns -ERESTARTSYS if sleep was interrupted, 0 if timeout elapsed, otherwise remaining time.

 

int    wait_event_interruptible_exclusive(&wq, condition)

 

Interruptible exclusive wait.
Returns 0 if condition evaluated to true, or -ERESTARTSYS if was interrupted.

 

void     __add_wait_queue_exclusive(&wq, &mywait)

void     __add_wait_queue_tail_exclusive(&wq, &mywait)
void     add_wait_queue_exclusive(&wq, &mywait)

void     prepare_to_wait_exclusive(&wq, &mywait, TASK_[UN]INTERRUPTIBLE)

 

General routines/macros to set up for exclusive wait.

 

int   wait_event_interruptible_exclusive_locked(&wq, condition)

int   wait_event_interruptible_exclusive_locked_irq(&wq, condition)

 

Interruptible wait.

Must be called with wq.lock held.
Spinlock is unlocked while sleeping but condition testing is done while the lock

 is held and when this macro exits the lock is held.

      wait_event_interruptible_exclusive_locked uses  spin_lock()/spin_unlock().
     
wait_event_interruptible_exclusive_locked_irq uses  spin_lock_irq()/spin_unlock_irq().
Returns -ERESTARTSYS if sleep was interrupted, 0 if condition was satisfied.

 

int   wait_event_killable(&wq, &mywait)

 

Semi-interruptible (only by fatal signals) wait with timeout.
Returns 0 if condition evaluated to true, or -ERESTARTSYS if was interrupted.

 

int   wait_on_bit(void *word, int bit, int (*action)(void *), unsigned sleepmode)
int   wait_on_bit_lock(void *word, int bit, int (*action)(void *), unsigned sleepmode)

wait_on_bit : wait for a bit to be cleared.

wait_on_bit_lock : wait for a bit to be cleared when wanting to set it.

 

action is a function used to sleep.

 

Wake up:

 

wake_up_all(&wq);                                                   // wakes up all waiters

wake_up(&wq);                                                           // wakes up upto one waiter, same as wake_up_nr(&wq, 1)
wake_up_nr(&wq,  5);                                             // wakes up upto specified number of waiters

 

Only exclusive waiters (with WQ_FLAG_EXCLUSIVE) count against the counter.

 

/* same but with queue spinlock held */

unsigned long flags;

spin_lock_irqsave(&q->lock, flags);

wake_up_locked(&wq);                                         // wakes up upto one waiter
wake_up_all locked(&wq);                                  // wakes up upto one waiter
spin_unlock_irqrestore(&q->lock, flags);

 

void   wake_up_bit(void* word, int bit)

 

Wake up waiter on a bit.

 

Remove from wait queue:

 

void   remove_wait_queue(&wq, &mywait)

 

 



Completion variables:

 

Wait queue plus "done" flag.

 

                 struct completion {

                             unsigned int done;

                             wait_queue_head_t wait;

          };

 

struct competion  comp;

DECLARE_COMPLETION(comp)
DECLARE_COMPLETION_ONSTACK(comp)
#define  INIT_COMPLETION(x)       ((x).done = 0)                                  // reinitialize for reuse

void   init_completion(& comp)

 

void                      wait_for_completion(&comp)

int                         wait_for_completion_interruptible(&comp)

int                         wait_for_completion_killable(&comp)

unsigned long   wait_for_completion_timeout(&comp, unsigned long timeout)

long                      wait_for_completion_interruptible_timeout(&comp, unsigned long timeout)

long                      wait_for_completion_killable_timeout(&comp, unsigned long timeout)

bool                     try_wait_for_completion(&comp)

bool                     completion_done(&comp)

 

void                       complete(&comp)

void                       complete_all(&comp)


 



Atomic operations on 32-bit counters:

 

#define ACCESS_ONCE(x)    (*(volatile typeof(x) *)&(x))                     // note: does not imply memory barrier

 

typedef struct { volatile int counter; } atomic_t;

 

atomic_t v = ATOMIC_INIT(77);
int i;

 

int       atomic_read(pv,  i)                                     // no implied barriers !

void    atomic_set(pv,  i)                                          // ...

void    atomic_add(i,  pv)                                       // ...
void    atomic_sub(i,  pv)                                        // ...
void    atomic_inc(pv)                                               // ...
void    atomic_dec(pv)                                              // ...

 

No implied barriers  (compiler or memory) !!
For some operations, use helper functions

void    smp_mb__before_atomic_dec()

void    smp_mb__after_atomic_dec()

 

void    smp_mb__before_atomic_inc()

void    smp_mb__after_atomic_inc()


For others use  smp_wmb, smp_rmb, smp_mb or barrier.

 

int       atomic_sub_and_test(i, pv)                 // true if result is zero, performs memory barriers before and after
int       atomic_dec_and_test(pv)                      // ...
int       atomic_inc_and_test(pv)                       // ...

 

int       atomic_add_return(i, pv)                      // return result, performs memory barriers before and after
int       atomic_sub_return(i, pv)                       // ...
int       atomic_inc_return(pv)                                                           // ...
int       atomic_dec_return(pv)                                                          // ...

 

int       atomic_add_negative(i, pv)                 // *pv += i;  return true if result is negative, performs memory barriers before and after

 

int       atomic_xchg(pv, i)                                      // returns previous value, performs memory barriers before and after

int       atomic_cmpxchg(pv, old, new)                                 // CAS, returns actual old value, performs memory barriers before and after

 

int       atomic_add_unless(pv, int a, int mark)                      // if (*pv != mark) { *pv += a;  return true; }  else { return false; }   and does barriers

void    atomic_clear_mask(bitmask,  pv)

void    atomic_set_mask(bitmask,  pv)

 

int      _atomic_dec_and_lock(pv , spinlock_t *lock)

 

Atomically decrement *pv and if it drops to zero, atomically acquire lock and return true.
If does not drop to zero, just decrement, do not acquire lock and return false.

Atomic operations on 64-bit counters:

Most 32-bit architectures do not support 64-bit atomic operations, although x86_32 does.

atomic64_t
long
ATOMIT64_INIT(...)
atomic64_read(...)

etc.

Atomic operations on bits:

 

void    set_bit(unsigned long nr, volatile unsigned long *addr)                                                  // no barriers implied

void    clear_bit(unsigned long nr, volatile unsigned long *addr)                                              // ...

void    change_bit(unsigned long nr, volatile unsigned long *addr)                                         // ... (flips the bit)

 

                        void smp_mb__before_clear_bit(void)

                        void smp_mb__after_clear_bit(void)

int test_bit(unsigned long nr, __const__ volatile unsigned long *addr);

 

int       test_and_set_bit(unsigned long nr, volatile unsigned long *addr)                                  // returns boolean, performs memory barriers

int       test_and_clear_bit(unsigned long nr, volatile unsigned long *addr)

int       test_and_change_bit(unsigned long nr, volatile unsigned long *addr)

 

int       test_and_set_bit_lock(unsigned long nr, unsigned long *addr)                                                 // acquire-release semantics

void    clear_bit_unlock(unsigned long nr, unsigned long *addr)                                               // ...

Non-atomic operations on bits:

 

void    __set_bit(unsigned long nr, volatile unsigned long *addr)

void    __clear_bit(unsigned long nr, volatile unsigned long *addr)

void    __change_bit(unsigned long nr, volatile unsigned long *addr)

 

int       __test_and_set_bit(unsigned long nr, volatile unsigned long *addr)

int       __test_and_clear_bit(unsigned long nr, volatile unsigned long *addr)

int       __test_and_change_bit(unsigned long nr, volatile unsigned long *addr)

 

void     __clear_bit_unlock(unsigned long nr, unsigned long *addr)                                        // implements release barrier

 

unsigned long    find_first_bit(const unsigned long *addr, unsigned long size)

unsigned long    find_first_zero_bit(const unsigned long *addr, unsigned long size)

unsigned long    find_next_bit(const unsigned long *addr, unsigned long size, unsigned long offset)

 

if not found, returns value >= size.

 

int     ffs(int x)
                        int     ffz(int x)

 



Memory barriers:

rmb, wmb, mb, read_barrier_depends  – emit on UP and MP
smp_rmb, smp_wmb, smp_mb, smp_read_barrier_depends  – emit MP ; on UP emit only barrier()
barrier()  –
compiler barrier

p = ...
read_barrier_depends();                      // a weaker form of read barrier, only affecting dependent data
a = *p;

[can drift down]
LOCK barrier
[cannot drift up]

[cannot drift down]
UNLOCK barrier
[can drift up]

Therefore UNLOCK + LOCK = full MB, but LOCK + UNLOCK is not (may drift into the middle and get reverse-ordered there).

smp_mb__before_spinlock() does not let stores before critical section to dirft inside it

set_current_state(TASK_UNINTERRUPTIBLE)  does: smp_mb(),  current->state = TASK_UNINTERRUPTIBLE
wake_up does smp_mb

mmiowb()  –  order writes to IO space (PCI etc.)



Spinlocks:

 

Not recursive.
 
Disables preemption on local CPU.
 
Optionally disables hardware interrupts or bottom-half processing.
 
While holding a spinlock, must not do anything that will cause sleep. Do not touch pageable (e.g. user) memory, kmalloc(GFP_KERNEL), any semaphore functions or any of the schedule functions, down_interruptible() and down().
 
Ok to call down_trylock() and up() as they do never sleep.
 
Initialization:
 
spinlock_t lock = SPIN_LOCK_UNLOCKED;    // obsolete, incompatible with RT
 
Or:    DEFINE_SPINLOCK(name)
 
Or:    void spin_lock_init(&lock);
 
Disable-restore interrupts on local CPU:
 
unsigned long flags;
spin_lock_irqsave(&lock, flags);
spin_unlock_irqrestore(&lock, flags);
 
Performs  local_irq_save(); preempt_disable().
 
If knew for sure interrupts were enabled, can do instead:
 
spin_lock_irq(&lock);
spin_unlock_irq(&lock);
 
Performs  local_irq_disable(); preempt_disable().
 
If spinlock is never acquired from interrupt handle, can use the form that does not disable interrupts:
 
spin_lock(&lock);
spin_unlock(&lock);
 
int spin_trylock(&lock);
 
Performs  preempt_disable().
 
Disable bottom half processing, but do not disable hardware interrupts:
 
spin_lock_bh(&lock);
spin_unlock_bh(&lock);
 
int spin_trylock_bh(&lock);
 
Performs  local_bh_disable(); preempt_disable().
 
Nested forms for lockdep:
 
spin_lock_nested(&lock, subclass)
spin_lock_irqsave_nested(&lock, flags, subclass)
 
spin_lock_nest_lock(&lock, nest_lock)               // signal: ok to take lock despite already holding nest_lock of the same class
 


 

RW spinlocks:

 

Not recursive.
Disables preemption on local CPU.
Optionally disables hardware interrupts or bottom-half processing.
 
No reader -> writer conversion.
Readers are favored over writers and can starve them, in this case plain spinlock_t may be prefereable to rwlock_t.
 
While holding a spinlock, must not do anything that will cause sleep. Do not touch pageable (e.g. user) memory, kmalloc(GFP_KERNEL), any semaphore functions or any of the schedule functions, down_interruptible() and down().
 
Ok to call down_trylock() and up() as they do never sleep.
 
Initialization:
 
rwlock_t lock = RW_LOCK_UNLOCKED;      // obsolete, incompatible with RT
 
Or:    DEFINE_RWLOCK(lock)
 
Or:    void rwlock_init(&lock);
 
Disable-restore interrupts on local CPU:
 
unsigned long flags;
read_lock_irqsave(&lock, flags);
read_unlock_irqrestore(&lock, flags);
 
write_lock_irqsave(&lock, flags);
write_unlock_irqrestore(&lock, flags);
 
Performs  local_irq_save(); preempt_disable().
 
If knew for sure interrupts were enabled, can do instead:
 
read_lock_irq(&lock);
read_unlock_irq(&lock);
 
write_lock_irq(&lock);
write_unlock_irq(&lock);
 
Performs  local_irq_disable(); preempt_disable().
 
If spinlock is never acquired from interrupt handle, can use the form that does not disable interrupts:
 
read_lock(&lock);
read_unlock(&lock);
 
write_lock(&lock);
write_unlock(&lock);
 
int read_trylock(&lock);
int write_trylock(&lock);
 
Performs  preempt_disable().
 
Disable bottom half processing, but do not disable hardware interrupts:
 
read_lock_bh(&lock);
read_unlock_bh(&lock);
 
write_lock_bh(&lock);
write_unlock_bh(&lock);
 
Performs  local_bh_disable(); preempt_disable().
 
 


Local-global spinlock:

 

·          Global data: very fast and scalable read locking, but very slow write locking.

·          Per-CPU data: fast write locking on local CPU, possible write locking of other CPU’s data, very slow write locking of all CPUs data.

 

struct lglock   lg;
DEFINE_STATIC_LGLOCK(lg);

void    lg_lock_init(&lg, “mylg”);


void    lg_local_lock(&lg);

void    lg_local_unlock(&lg);


void    lg_local_lock_cpu(&lg, int cpu);

void    lg_local_unlock_cpu(&lg, int cpu);


void    lg_global_lock(&lg);

void    lg_global_unlock(&lg);

 



Mutexes:

 

Sleepable lock (may sleep when acquiring, can sleep while holding it).
May not use in softirq or hardirq context.
Non-recursive.
Task may not exit with mutex held.

struct mutex mtx;
[static] DEFINE_MUTEX(mtx);

void   mutex_init(&mtx);
void   mutex_destroy(&mtx);

void   mutex_lock(&mtx)
int      mutex_trylock(&mtx)
void   mutex_unlock(&mtx)                                                                           // must be unlocked by owner only
int      mutex_is_locked(&mtx)

 

int      mutex_lock_interruptible(&mtx)                                                        // 0 if acquired, otherwise -EINTR

int      mutex_lock_killable(&mtx)                                                                 // …

 

int      atomic_dec_and_mutex_lock(atomic_t* cnt, &mtx)

 

If decrements cnt to 0, return true holding mutex.

Otherwise return false not holding mutex.

 

If locking more than one mutex of the same lock validation class, lockdep will complain. To lock multiple mutexes of the same class, designate their subclasses and lock with the following functions.

 

unsigned int subclass;

void   mutex_lock_nested(&mtx, subclass)

void   mutex_lock_interruptible_nested(&mtx, subclass)

void   mutex_lock_killable_nested(&mtx, subclass)
void   mutex_lock_nest_lock(&mtx, nest_lock)           
// signal: ok to take mtx despite already holding nest_lock of the same class

 

 



Mutexes with priority inheritance:

 

struct rt_mutex rmx;
[static] DEFINE_RT_MUTEX(rmx);

 

void    rt_mutex_init(&rmx)

void    rt_mutex_destroy(&rmx)

void    rt_mutex_lock(&rmx)

int       rt_mutex_trylock(&rmx)

void    rt_mutex_unlock(&rmx)

int       rt_mutex_is_locked(&rmx)

int       rt_mutex_timed_lock(&rmx,  struct hrtimer_sleeper *timeout,  int detect_deadlock)

 

success = 0, otherwise –EINTR or –ETIMEOUT or -EDEADLK

 

int       rt_mutex_lock_interruptible(&rmx,  int detect_deadlock)

 

success = 0, otherwise –EINTR or -EDEADLK

 


 

Semaphores:

Sleepable lock (may sleep when acquiring, can sleep while holding it).
May not use in softirq or hardirq context.

struct semaphore sem;
[static] DEFINE_SEMAPHORE(sem);

void    sema_init(&sem, int val)                                                       // val is maximum number of holders

void    down(&sem)                                                                           // acquire

int       down_trylock(&sem)                                                            // inverted result:  0 if acquired, 1 if did not

int       down_interruptible(&sem)                                                 // 0 or -EINTR

int       down_killable(&sem)                                                           // 0 or -EINTR

int       down_timeout(&sem, long jiffies)                                      // 0 or -ETIME

void    up(&sem)                                                                                // release



RW locks (RW semaphores):

Any number of readers, up to 1 writer.
Non-recursive.
Sleepable lock (may sleep when acquiring, can sleep while holding it).
May not use in softirq or hardirq context.
There is no interrptible/killable wait, all wait is uninterruptible.

struct rw_semaphore rws;
[static] DECLARE_SEMAPHORE(rws);

void    init_rwsem(&rws)

void    down_read(&rws)                                                               // acquire

void    down_write(&rws)

 

int      down_read_trylock(&rws)                                                  // true = ok, false = failure

int      down_write_trylock(&rws)

 

void    up_read(&rws)                                                                      // release

void    up_write(&rws)

 

int      rwsem_is_locked(&rws)                                                      // true if locked

 

void    downgrade_write(&rws)                                                // downgrade write lock to read lock

 

If locking more than one RW semaphore of the same lock validation class, lockdep will complain. To lock multiple mutexes of the same class, designate their subclasses and lock with the following functions.

 

void    down_read_nested(&rws, int subclass)

void    down_write_nested(&rws, int subclass)

void    down_write_nest_lock(&rws, nest_lock)           // signal: ok to take rws despite already holding nest_lock of the same class

 



Seqlocks:

Sequential locks.

·          Good when lots of readers, few writers

·          Writers take priority over readers

·          So readers cannot starve writers

"Write" side (in write_seqlock_xxx) internally uses spinlock and runs locked code with spinlock held.
"Read" side does not use spinlock.
"Write" side is subject to lock validator, "read" side is not.

Internally increments a variable inside the lock. When it is odd, write lock is held.

Writer usage

Reader usage


write_seqlock(&sq)

... write data ...

write_sequnlock(&sq)


do {

        start = read_seqbegin(&sq);

        ... read data ...
} while (read_seqretry(&sq,  start))


seqlock_t sq;
[static] DEFINE_SEQLOCK(sq);
unsigned int start;

void   seqlock_init(& sq)

void   write_seqlock(&sq)

int      write_tryseqlock(&sq)                                                       // true if locked

void   write_sequnlock(&sq)

 

void   write_seqlock_irqsave(&sq, flags)

void   write_sequnlock_irqrestore(&sq, flags)

 

void    write_seqlock_irq(&sq)

void    write_sequnlock_irq(&sq)

 

void    write_seqlock_bh(&sq)

void    write_sequnlock_bh(&sq)

void     write_seqcount_begin(seqcount_t *s)                   // caller uses his own mutex or other lock, built-in spinlock is not used

void     write_seqcount_end(seqcount_t *s)                        // ...

 

void     write_seqcount_barrier(seqcount_t *s)              // invalidate in-progress read operations

 

start  =  read_seqbegin(const &sq)

int          read_seqretry(const &sq, unsigned start)

 

start  =  read_seqbegin_irqsave(&sq, flags)

start  =  read_seqretry_irqrestore(&sq, iv, flags)

 

start  =  __read_seqcount_begin(const seqcount_t *s)                                   // like read_seqbegin, but does not have smp_rmb at the end

start  =  read_seqcount_begin(const seqcount_t *s)                                      // equivalent to read_seqbegin

start  =  raw_seqcount_begin(const seqcount_t *s)                                       // get seqcount without waiting for it to go even, plus smp_rmb


int            __read_seqcount_retry(const seqcount_t *s, unsigned start)    
// like read_seqretry, but does not issue smp_rmb
                                                                                                                                  // (caller must issue rmb before the call)

int           read_seqcount_retry(const seqcount_t *s, unsigned start)         // equivalent to read_seqretry

 


 

 

Lock validator (lockdep):

http://lwn.net/Articles/185666/
Documentation/lockdep-design.txt

Enable CONFIG_DEBUG_LOCK_ALLOC.

Every lock is assigned a class (= “key address”):

·          For statically allocated lock it is the address of the lock

·          For dynamic locks it is the spot of their init (spin_lock_init, mutex_init etc.) routine

Validator checks for:

·          ClassA lock taken after ClassB if previously locks of these two classes were ever taken in inverse order

·          ClassX lock is taken when ClassX lock is already held.

o    Exemption: use functions like mutex_lock_nested and specify subclass.
Validator treats “class+subclass” as a separate class.

·          Locks released in the order that is not reverse of their acquisition.
(A stack of currently-held locks is maintained, so any lock being released should be at the top of the stack; anything else means that something strange is going on.)

·          Any (spin)lock ever acquired by hardirq can never be acquired when interrupts (hardirqs) are enabled.

o    Conversely: any lock ever acquired with hardirqs enabled is “hardirq-unsafe”
and cannot be acquired in hardirq.

·          Any (spin)lock ever acquired by softirq can never be acquired when softirqs are enabled.

o    Conversely: any lock ever acquired with softirqs enabled is “softirq-unsafe” and cannot be acquired in hardirq.

·          Cannot:  (holding) hardirq-safe -> (acquire) hardirq-unsafe.

o    Since hardirq-unsafe -> hardirq-safe is valid, so reverse is invalid, as it would cause lock inversion.
May be too strong, but it is checked against nevertheless.

·          Cannot:  (holding) softirq-safe -> (acquire) softirq-unsafe.

Gotchas:

·          Unloading of kernel modules causes class ids to be available for reuse and can produce false warning and leakage of class table (that can overflow and result in a message).

·          Static initialization of large number of locks (e.g. array of structures with per-entry lock) can cause class table to overflow. Solution: initialize locks dynamically.

Aids:

·          /proc/lockdep_stats

·          /proc/lockdep




Kernel threads:

 

struct task_struct* task;

 

task = kthread_create(thread_func, void* data, const char namefmt*, ...);
if (! IS_ERR(task))  wake_up_process(task);

 

Initially SCHED_NORMAL, nice 0.
Can alter with sched_setscheduler[_nocheck](…).

 

// create thread bound to NUMA memory node

task = kthread_create_on_node(thread_func, void* data, const char namefmt*,

                              cpu_to_node(cpu), ...);
if (! IS_ERR(task))  wake_up_process(task);

 

task = kthread_run(thread_func, void* data, const char namefmt*, ...);

 

int retval = kthread_stop(task);

 

Returns thread's retval or -EINTR if the process had never started.

 

int thread_func(void* data)

{

    /* check if kthread_stop had been called */

    if (kthread_should_stop())

        return retval;    /* return status to kthread_stop caller */

 

    do_exit();

}

 

 

Priority ranges:

 

task_struct* p;
p->prio

 

All:

    0 (highest) … MAX_PRIO-1 (lowest, 139)

    MAX_PRIO = 140

 

Realtime:

    0 (highest) … MAX_RT_PRIO-1 (lowest, 99)
    MAX_PRIO = MAX_USER_RT_PRIO = 100

 

Timeshared:

    MAX_RT_PRIO (100) … MAX_PRIO-1 (139)

 

Default priority for irq threads (PREEMPT_RT):

    MAX_USER_RT_PRIO/2  (50)

 

 


 

CPUs:

 

NR_CPUS  –  maximum number in the system              // eventually will be run-time variable nr_cpu_ids (nr_cpu_ids <= NR_CPUS)

 

unsigned long my_percpu[NR_CPUS];

 

int cpu = get_cpu();                                                  // disable preemption and get current processor

my_percpu[cpu]++;

put_cpu();                                                                 // enable preemption

 

int   smp_processor_id();                                        // unsafe if preemption is not disabled

 

Per-CPU data interface, static form usable only in kernel but not in loadable modules:

 

DECLARE_PER_CPU(vartype, varname)                         // declare per-CPU variable (not required locally, DEFINE_PER_CPU is ok)
DEFINE_PER_CPU(vartype, varname)                           // define and allocate per-CPU variable

 

DECLARE_PER_CPU_SHARED_ALIGNED(...)                 // cacheline aligned
DEFINE_PER_CPU_SHARED_ALIGNED(...)

DECLARE_PER_CPU_ALIGNED(...)                                   // cacheline aligned
DEFINE_PER_CPU_ ALIGNED(...)

DECLARE_PER_CPU_PAGE_ALIGNED(...)                      // page-aligned
DEFINE_PER_CPU_ PAGE_ALIGNED(...)

 

EXPORT_PER_CPU_SYMBOL(varname)                      // export for use by modules
EXPORT_PER_CPU_SYMBOL_GPL(varname)

 

get_cpu_var(varname)++;                                     // disable preemption, access variable and increment it

put_cpu_var(varname);                                          // reenable preemption

 

per_cpu(varname, cpu) = 0;                                // access per-cpu variable instance for particular CPU (does not disable

                                                                                  //    preemption and provides by itself no locking)

 

Per-CPU data interface, dynamic form usable both in kernel and in loadable modules:

 

void*   alloc_percpu(vartype);                                             // on failure returns NULL

void*   __alloc_percpu(size_t size, size_t align);

void    free_percpu(void* data);

 

A* p = alloc_percpu(A);                                                       // same as below

A* p = alloc_percpu(sizeof(A), __alignof(__A));             // ...

 

Then can do get_cpu_var(p) and put_cpu_var(p).

 

this_cpu_read , this_cpu_write, this_cpu_add, this_cpu_sub etc. – operations performed with interrupts or preemption disabled

CPU sets:

 

typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;
cpumask_t  cpumask;


unsigned long* bp = coumask_bits(& cpumask);

 

const struct cpumask *const   cpu_possible_mask;   // populatable, e.g. can be hot-plugged, fixed at boot time      

const struct cpumask *const   cpu_present_mask;     // populated

const struct cpumask *const   cpu_online_mask;       // available to scheduler

const struct cpumask *const   cpu_active_mask;        // available to migration (a subset of online, currently identical)

 

#define   num_online_cpus()                 cpumask_weight(cpu_online_mask)

#define   num_possible_cpus()              cpumask_weight(cpu_possible_mask)

#define   num_present_cpus()               cpumask_weight(cpu_present_mask)

#define   num_active_cpus()                  cpumask_weight(cpu_active_mask)

 

#define   cpu_online(cpu)                      cpumask_test_cpu((cpu), cpu_online_mask)

#define   cpu_possible(cpu)                   cpumask_test_cpu((cpu), cpu_possible_mask)

#define   cpu_present(cpu)                    cpumask_test_cpu((cpu), cpu_present_mask)
#define   cpu_active(cpu)                       cpumask_test_cpu((cpu), cpu_active_mask)


#define   cpu_is_offline(cpu)                 unlikely(!cpu_online(cpu))

 

cpumask_t  mask  = { CPU_BITS_NONE };
mask_t mask = cpu_none_mask;

mask_t mask = cpu_all_mask;
cpumask_set_cpu(unsigned int cpu, & mask)
cpumask_clear_cpu(unsigned int cpu, & mask)
int   cpumask_test_cpu(unsigned int cpu, & mask)
int   cpumask_test_and_set_cpu(unsigned int cpu, & mask)
int   cpumask_test_and_clear_cpu(unsigned int cpu, & mask)

 

cpumask_clear(&mask);
cpumask_setall(& mask);

cpumask_and(&mask, &m1, &m2);
cpumask_or(&mask, &m1, &m2);
cpumask_xor(&mask, &m1, &m2);
cpumask_andnot(&mask, &m1, &m2);
cpumask_complement(&mask, &m1);

bool  cpumask_[is]empty(& mask);
bool  cpumask_[is]subset(& mask, & superset_mask);

#define cpu_isset(cpu, cpumask) test_bit((cpu), (cpumask).bits)

 

cpumask_copy(&dst, &src);

bool  cpumask_equal(&m1, &m2)

 

int    cpumask_parse_user(const char __user *buf, int len, &mask)                                           // 0 = ok, otherwise -errno
int    cpumask_parselist_user(const char __user *buf, int len, &mask)
int    cpulist_parse(const char *buf, & mask)

 

int    cpumask_scnprintf(char *buf, int len,  &mask)
int    cpulist_scnprintf(char *buf, int len,  &mask)

 

unsigned int   cpumask_first(&mask)                           // returns >= nr_cpu_ids if none
unsigned int   cpumask_next(int n, &mask)                // returns >= nr_cpu_ids if none
unsigned int   cpumask_next_zero(int n, &mask)      // returns >= nr_cpu_ids if none

 

for_each_cpu(cpu, mask) {...}                                         // iterate over every cpu in a mask
for_each_cpu_not(cpu, mask) {...}                                // iterate over every cpu not in a mask
for_each_cpu_and(cpu, m1, m2) {...}                           // iterate over every cpu in both masks

 

#define   for_each_possible_cpu(cpu)                       for_each_cpu((cpu), cpu_possible_mask)

#define   for_each_online_cpu(cpu)                                                   for_each_cpu((cpu), cpu_online_mask)

#define   for_each_present_cpu(cpu)                        for_each_cpu((cpu), cpu_present_mask)

 

cpumask_of_node(cpu_to_node(cpu))

 

Calling function on other CPUs:

 

void myfunc(void* info);

 

Handler function is invoked called on remote CPUs in IPI interrupt context,
on local CPU – with interrupts disabled.

 

The following functions must be called with interrupts enabled (or can deadlock).
They return 0 if successful, otherwise –errno (e.g. –ENXIO if CPU is not online).
If wait is true, waits for completion of the call.

 

int      smp_call_function_single(int cpuid, myfunc, void *info, int wait);
int      smp_call_function_any(&mask, myfunc, void *info, int wait);

 

void   smp_call_function_many(&mask, myfunc, void *info, bool wait);


Execute myfunc(info) on all CPUs in mask except current CPU.
Preemption must be disabled when calling smp_call_function_many.

May not call it from bottom-half or interrupt handler.

 

int      smp_call_function(myfunc, void *info, int wait);

 

Call myfunc(info) on all CPUs in cpu_online_mask except current CPU.
May not call smp_call_function from bottom-half or interrupt handler.
Temporary disables preemption internally.

 

int      on_each_cpu(myfunc, void *info, int wait);

 

Call myfunc(info) on all CPUs in cpu_online_mask and current CPU.
May not call smp_call_function from bottom-half or interrupt handler.
Temporary disables preemption internally.


void   on_each_cpu_mask(&mask, myfunc, void *info, bool wait);

 

Call myfunc(info) on all CPUs in the mask including current CPU if it is in the mask.
May not call smp_call_function from bottom-half or interrupt handler.

Preemption must be disabled when calling on_each_cpu_mask.

 

void    on_each_cpu_cond(bool (*cond_func)(int cpu, void *info), myfunc, void *info, bool wait, gfp_t gfp_flags);

 

For each CPU in cpu_online_mask, invoke locally cond_func(cpu, info) and for those CPUs for which it returns true, perform call remotely myfunc(info) and optionally wait. Flags gpf_flags are used to allocate temporary CPU mask block.

May not call on_each_cpu_cond from bottom-half or interrupt handler.
Temporary disables preemption internally.

 

Stop Machine:

 

int     stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)

 

Freeze the machine on all CPUs and execute fn(data).


This schedules a thread to be run on each CPU in the cpus mask (or each online CPU if cpus is NULL) at the highest priority, which threads also disable interrupts on their respective CPUs. The result is that no one is holding a spinlock or inside any other preempt-disable region when fn runs. This is effectively a very heavy lock equivalent to grabbing every spinlock (and more).

 

fn(data) is executed only on one CPU, with interrupts disabled.




Memory zones:

 

ZONE_DMA

pages that can be used for DMA

x86-32: below 16 MB

ZONE_DMA32

pages that can be used for DMA,
but by 32-bit devices only

 

ZONE_NORMAL

normal pages,
permanently mapped into kernel memory

on x86-32: 16 MB to 896 MB
on some other architectures: all memory

ZONE_HIGHMEM

pages not permanently mapped into kernel memory

on x86-32: all memory above 896 MB
on some other architectures: empty

ZONE_MOVABLE

(related to hot-plugging)

 


Normal allocations are usually satisfied from ZONE_NORMAL, but on shortage can satisfy from ZONE_DMA.

x64 has only ZONE_DMA, ZONE_DMA32 and ZONE_NORMAL.

Memory allocation (kmaloc):

void*   kmalloc(size_t size, gfp_t flags)
kfree(p)

void*   kmalloc_array(size_t n, size_t size, gfp_t flags)

void*   kcalloc(size_t n, size_t size, gfp_t flags)                                    // zeroes out

void*   kzalloc(size_t size, gfp_t flags)                                                          // zeroes out

 

void*   kmalloc_node(size_t size, gfp_t flags, int node)                                          // from specific NUMA node

void*   kzalloc_node(size_t size, gfp_t flags, int node)                                            // zeroes out

 

Common flag combinations:

GFP_KERNEL

Default, most usual.
Can block and can be called only in process context.
Can do block IO and file system calls inside.

GFP_ATOMIC

High-priority allocation. Does not sleep.
Use in interrupt handlers, bottom halves (softirqs, tasklets) or when holding a spinlock
or in other situations when cannot sleep.
Will use emergency pool if necessary.

GFP_NOWAIT

Like GFP_ATOMIC, but will not fallback to emergency pool.

GFP_NOIO

Like GFP_KERNEL, but will not do block IO.
Use in block IO code when want to avoid recursion.

GFP_NOFS

Like GFP_KERNEL (can block and initiate disk IO), but will not do file system IO.
Use in file system code when want to avoid recursion.

GFP_USER

Normal allocation, can block.
Use to allocate memory for user-space processes.

GFP_HIGHUSER

Allocate from ZONE_HIGHMEM. Can block.
Used to allocate memory for user-space processes.

GFP_HIGHUSER_MOVABLE

 

GFP_TRANSHUGE

 

GFP_DMA

Allocate from ZONE_DMA.
Should only be used for kmalloc caches, otherwise use slab created with SLAB_DMA

GFP_DMA32

Allocate from ZONE_DMA32


Their meanings:

GFP_KERNEL

__GFP_WAIT | __GFP_IO | __GFP_FS

GFP_ATOMIC

__GFP_HIGH

GFP_NOWAIT

GFP_ATOMIC & ~__GFP_HIGH

GFP_NOIO

__GFP_WAIT

GFP_NOFS

__GFP_WAIT | __GFP_IO

GFP_USER

__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL

GFP_HIGHUSER

_GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | __GFP_HIGHMEM

GFP_HIGHUSER_MOVABLE

__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL | __GFP_HIGHMEM | __GFP_MOVABLE

GFP_TRANSHUGE

GFP_HIGHUSER_MOVABLE | __GFP_COMP | __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | __GFP_NO_KSWAPD

GFP_DMA

__GFP_DMA

GFP_DMA32

__GFP_DMA32

GFP_TEMPORARY

__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_RECLAIMABLE

GFP_IOFS

__GFP_IO | __GFP_FS

 

Basic flags:

___GFP_DMA

Allocate from ZONE_DMA

___GFP_DMA32

Allocate from ZONE_DMA32

___GFP_HIGHMEM

Allocate from ZONE_HIGHMEM

___GFP_MOVABLE

Allocate from ZONE_MOVABLE

___GFP_WAIT

Allocator can sleep

___GFP_HIGH

High-priority: allocator can access emergency pools

___GFP_IO

Allocator can start block IO

___GFP_FS

Allocator can start filesystem IO

___GFP_COLD

Request cold-cache pages instead of trying to return cache-warm pages

___GFP_NOWARN

Allocator does not print failure warnings

___GFP_REPEAT

Allocator repeats (once) the allocation if it fails, but the allocation can potentially fail

___GFP_NOFAIL

Allocator repeats allocation indefinitely, the allocation cannot fail.

Deprecated, no new users allowed.

___GFP_NORETRY

Allocator never retries the allocation if it fails

___GFP_MEMALLOC

User reserves

___GFP_COMP

The allocator adds compound page metadata (used internally by hugetlb code)

___GFP_ZERO

Zero out page content

___GFP_NOMEMALLOC

Allocator does not fall back on reserves (takes precedence over MEMALLOC if both are specified)

___GFP_HARDWALL

The allocator enforces "hardwall" cpuset boundaries

___GFP_THISNODE

Allocate NUMA node-local memory ponly, no fallback

___GFP_RECLAIMABLE

The allocator marks the pages reclaimable

___GFP_KMEMCG

Allocation comes from a memcg-accounted resource

___GFP_NOTRACK

Don't track with kmemcheck

___GFP_NO_KSWAPD

 

___GFP_OTHER_NODE

On behalf of other node

___GFP_WRITE

Allocator intends to dirty page

 


Memory allocation (slab allocator):

cache (pool) -> set of slab blocks (each is several pages) -> objects in each slab

struct kmem_cache*   kmem_cache_create(
                                                "my_struct",
                                                 sizeof(my_struct),
                                                 size_t align,                                          
// required alignment, often L1_CACHE_BYTES, or 0 if SLAB_HWCACHE_ALIGN
                                                 unsigned long flags,
// SLAB_xxx, see below
                                                 void (*ctor)(void*)) 
// can be NULL

void   kmem_cache_destroy(struct kmem_cache*  cache)

SLAB_DEBUG_FREE

DEBUG: perform extensive check on free

SLAB_RED_ZONE

DEBUG: insert "red zones" around allocated memory to help detect buffer overruns

SLAB_POISON

DEBUG: fill a slab with known value (0xA5) to catch access to uninitialized memory

SLAB_HWCACHE_ALIGN

Align objects within a slab to a cache line

SLAB_CACHE_DMA

Allocate from ZONE_DMA

SLAB_STORE_USER

DEBUG: store the last owner for bug hunting

SLAB_PANIC

Panic if allocation fails

SLAB_DESTROY_BY_RCU

Defer freeing slabs to RCU (read more in header file)

SLAB_MEM_SPREAD

Spread some memory over cpuset

SLAB_TRACE

Trace allocations and frees

SLAB_DEBUG_OBJECTS

 

SLAB_NOLEAKTRACE

Avoid kmemleak tracing

SLAB_NOTRACK

 

SLAB_FAILSLAB

Fault injection mark

SLAB_RECLAIM_ACCOUNT

Objects are reclaimable

SLAB_TEMPORARY

Objects are short-lived

 

void*    kmem_cache_alloc(cache, gfp_t flags);
void*    kmem_cache_alloc_node(cache, gfp_t flags, int node)
void      kmem_cache_free(cache, void* p);

Memory allocation (virtually contiguous pages):

Allocated pages are virtually contiguous and not necessarily physically contiguous.
Less performant than kalloc() since need to update page tables and flush TLBs.

void*     vmalloc(unsigned long size);

void*     vzalloc(unsigned long size);                                                             // zeroed out

void       vfree(const void* addr);

 

void*     vmalloc_user(unsigned long size);                                                  // memory for user-space, zeroed out

 

void*     vmalloc_node(unsigned long size, int node);                                // allocate pages in specific NUMA node

void*     vzalloc_node(unsigned long size, int node);

 

void*     vmalloc_exec(unsigned long size);                                                 // executable memory

void*     vmalloc_32(unsigned long size);                                                    // 32-bit addressable memory

void*     vmalloc_32_user(unsigned long size);                                          // 32-bit addressable memory for user-space, zeroed-out

 

void*     vmap(struct page **pages, unsigned int count, unsigned long flags, pgprot_t prot);

void        vunmap(const void* addr);                                                            // release mapping

 

void        vmalloc_sync_all(void);

Memory allocation (physical pages):

struct page*  alloc_pages(gfp_t  flags,  unsigned int order)

Allocate (2 ^ order)  contiguous physical pages, return pointer to first pages's page structure.
On failure returns NULL.

unsigned long    page_address(struct page*  page)                            // get physical address

unsigned long     __get_free_pages(gfp_t gfp_flags, unsigned int order)
unsigned long     __get_free_page(flags)

Same as alloc_pages, but return physical address or 0.

unsigned long    get_zeroed_page(gfp_t flags)

                Allocate one zeroed page, return physical address or 0.

#define __get_dma_pages(flags, order)    __get_free_pages((flags) | GFP_DMA, (order))

void     __free_pages(struct page *page, unsigned int order)
void     __free_page(struct page *page)

void     free_pages(unsigned long addr, unsigned int order)
void     free_page(unsigned long addr)

void*   alloc_pages_exact(size_t size, gfp_t flags)
void     free_pages_exact(void*physaddr, size_t size)

Allocate contiguous physical pages to hold size bytes, return physical address or NULL on failure.

struct page*    alloc_pages_node(int nid, gfp_t flags, unsigned int order)
struct page*    alloc_pages_exact_node(int nid, gfp_t flags, unsigned int order)
void*                               alloc_pages_exact_nid(int nid, size_t size, gfp_t flags)


Mapping physical pages into kernel space:

void*   kmap(struct page* page)

 

If page is in low memory (ZONE_NORMAL, ZONE_DMA) its virtual address is simply returned.

If page is in ZONE_HIGHMEMORY, a mapping is created and virtual address is returned.
May sleep.

 

void      kunmap(struct page* page)

 

Unmap mapping created by kmap.

 

void*    kmap_atomic(struct page* page)

void      kunmap(struct page* page)

kmap_atomic/kunmap_atomic is significantly faster than kmap/kunmap because no global lock is needed and because the kmap code must perform a global TLB invalidation when the kmap pool wraps.

 

However when holding an atomic kmap it is not legal to sleep, so atomic kmaps are appropriate for short, tight code paths only.


struct page*   kmap_to_page(void *  kva)

 

 

PFN:

 

unsigned long pfn = page_to_pfn(page);

 

if (pfn_valid(pfn))

       page = pfn_to_page(pfn);

 

void get_page(page);

....

void put_page(page);

 

void *kva = page_address(page)

       get kv-address for a page in lowmem zone

       or for a pages in himem zone mapped into kernel address space

       otherwise returns NULL

 

                               if (virt_addr_valid(kva))                                                                    // for kva in lowmem zone

                                                             page = virt_to_page(kva);

 

Pin user pages in memory:

 

       note: obsolete, should use get_user_pages_locked|unlocked|fast

 

struct page *pages;

down_read(&current->mm->mmap_sem);

long nr = get_user_pages(current, current->mm, (unsigned long) buf, npages, int write, int force, &pages, NULL);

up_read(&current->mm->mmap_sem);

....  modify ...

lock_page(page)

set_page_dirty(page)

unlock_page(page)

page_cache_release(page)

 

Map kernel memory into userspace (e.g. for mmap):

 

down_read(&current->mm->mmap_sem);

err = remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,

             unsigned long pfn, unsigned long size, pgprot_t prot)

 

err = remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff, vma->vm_end - vma->vm_start, vma->vm_page_prot)

up_read(&current->mm->mmap_sem);

Map kernel memory to userspace:

 

static struct vm_operations_struct xxx_mmap_ops = {

       .open  =              xxx_mm_open,

       .close  =        xxx_mm_close,

};

 

static void xxx_mm_open(struct vm_area_struct *vma)

{

       struct file *file = vma->vm_file;

       struct socket *sock = file->private_data;

       struct sock *sk = sock->sk;

      

       if (sk)

                                     atomic_inc(&pkt_sk(sk)->mapped);

}

 

static void xxx_mm_close(struct vm_area_struct *vma)

{

       struct file *file = vma->vm_file;

       struct socket *sock = file->private_data;

       struct sock *sk = sock->sk;

      

       if (sk)

                                     atomic_dec(&pkt_sk(sk)->mapped);

}

 

.......

 

 

                               if (vma->vm_pgoff)

                                                             return -EINVAL;

 

                               size = vma->vm_end - vma->vm_start;

 

                               start = vma->vm_start;

                               for (i = 0; i < ....; i++) {

                                                             struct page *page = virt_to_page(po->pg_vec[i]);

                                                             int pg_num;

 

                                                             for (pg_num = 0; pg_num < ....; pg_num++, page++) {

                                                                                            err = vm_insert_page(vma, start, page);

                                                                                            if (unlikely(err))

                                                                                                                          goto out;

                                                                                            start += PAGE_SIZE;

                                                             }

                               }

                               vma->vm_ops = &xxx_mmap_ops;

 

 

Memory leak detection (kmemleak):

 

http://www.kernel.org/doc/Documentation/kmemleak.txt

 

Page allocations and ioremap are not tracked.

Only kmalloc, kmem_cache_alloc (slab allocations) and vmalloc.

 

Build with CONFIG_DEBUG_KMEMLEAK

 

To disable by default:

 

build with CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF

boot with kmemleak=on

 

mount  -t debugfs  nodev  /sys/kernel/debug

cat  /sys/kernel/debug/kmemleak

 

Clear the list of all current possible memory leaks:

 

echo clear > /sys/kernel/debug/kmemleak

 

Trigger an intermediate memory scan:

 

echo scan > /sys/kernel/debug/kmemleak

cat  /sys/kernel/debug/kmemleak

 

Disable default thread that scans every 10 minutes:

 

echo 'scan=off' > /sys/kernel/debug/kmemleak

Dump the information about object at <addr>:

 

echo 'dump=<addr>' > /sys/kernel/debug/kmemleak

 

Also  /proc/slabinfo




Time:


HZ tic ks (jiffies) per second, typically 100 or 1000, also can be NOHZ.
User-visible value is USER_HZ, so when communicating to userland scale like x  / (HZ / USER_HZ).

Jiffies has unsigned long and 64-bit values (jiffies, jiffies_64).

unsigned long volatile   jiffies;                           // wraps around if BITS_PER_LONG < 64
u64                                    jiffies_64;

On x86 they are separate (but jiffies can overlay lower word of jiffies_64).
On x64 they are the same thing.

jiffies may be read atomically.
jiffies_64 on 32-bit architectures (BITS_PER_LONG < 64) must be read under seqlock_t  jiffies_lock.

Can always use get_jiffies_64(), on 32-bit architectures it reads data under seqlock, on 64-bit architectures just reads it.

#define  time_after(a,b)                           ((long)(b) - (long)(a) < 0))

#define  time_before(a,b)                       time_after(b,a)

 

#define  time_after_eq(a,b)                   ((long)(a) - (long)(b) >= 0))

#define  time_before_eq(a,b)                time_after_eq(b,a)

 

#define  time_in_range(a,b,c)                (time_after_eq(a,b) && time_before_eq(a,c))

 

#define  time_in_range_open(a,b,c)      (time_after_eq(a,b) && time_before(a,c))

 

#define  time_after64(a,b)                     ((__s64)(b) - (__s64)(a) < 0))

#define  time_before64(a,b)                  time_after64(b,a)

 

#define  time_after_eq64(a,b)               ((__s64)(a) - (__s64)(b) >= 0))

#define  time_before_eq64(a,b)            time_after_eq64(b,a)

 

#define  time_is_before_jiffies(a)          time_after(jiffies, a)

#define  time_is_after_jiffies(a)             time_before(jiffies, a)

#define  time_is_before_eq_jiffies(a)    time_after_eq(jiffies, a)

#define  time_is_after_eq_jiffies(a)       time_before_eq(jiffies, a)

 

unsigned int            jiffies_to_msecs(const unsigned long j);

unsigned int            jiffies_to_usecs(const unsigned long j);

unsigned long         msecs_to_jiffies(const unsigned int m);

unsigned long         usecs_to_jiffies(const unsigned int u);

unsigned long         timespec_to_jiffies(const struct timespec *value);

void                         jiffies_to_timespec(const unsigned long jiffies, struct timespec *value);

unsigned long         timeval_to_jiffies(const struct timeval *value);

void                         jiffies_to_timeval(const unsigned long jiffies, struct timeval *value);

 

clock_t                    jiffies_to_clock_t(unsigned long x);

clock_t                    jiffies_delta_to_clock_t(long delta)

 

unsigned long         clock_t_to_jiffies(unsigned long x);

u64                          jiffies_64_to_clock_t(u64 x);

u64                          nsec_to_clock_t(u64 x);

u64                          nsecs_to_jiffies64(u64 n);

unsigned long         nsecs_to_jiffies(u64 n);

 

struct timespec {

        __kernel_time_t    tv_sec;
        long                          tv_nsec;
}

 



Delays:


set_current_state(TASK_[UN]INTERRUPTIBLE)
long remaining = schedule_timeout(10 * HZ)


unsigned long  wait_till  =  jiffies + 5 * HZ;
while (time_before(jiffies, wait_till))
        cond_resched();                                                                       // invoke scheduler if need_resched is true, e.g. there is some higher-priority task


void   mdelay(unsigned long msecs)              // busy-wait using calibrated loops
void   udelay(unsigned long usecs)
void   ndelay(unsigned long nsecs)

const enum hrtimer_mode mode = HRTIMER_MODE_REL;

ktime_t expires;

 

int  schedule_hrtimeout(&expires, mode)

int  schedule_hrtimeout_range(&expires, unsigned long delta, mode)

int  schedule_hrtimeout_range_clock(&expires, unsigned long delta, mode, int clock)



On  timer  tick:

 

/*

 * Event handler for periodic ticks

 */

void tick_handle_periodic(struct clock_event_device *dev)

{

          int cpu = smp_processor_id();

          ktime_t next;

 

          tick_periodic(cpu);

 

          if (dev->mode != CLOCK_EVT_MODE_ONESHOT)

                   return;

          /*

           * Setup the next period for devices, which do not have

           * periodic mode:

           */

          next = ktime_add(dev->next_event, tick_period);

          for (;;) {

                   if (!clockevents_program_event(dev, next, false))

                             return;

                   /*

                    * Have to be careful here. If we're in oneshot mode,

                    * before we call tick_periodic() in a loop, we need

                    * to be sure we're using a real hardware clocksource.

                    * Otherwise we could get trapped in an infinite

                    * loop, as the tick_periodic() increments jiffies,

                    * when then will increment time, posibly causing

                    * the loop to trigger again and again.

                    */

                   if (timekeeping_valid_for_hres())

                             tick_periodic(cpu);

                   next = ktime_add(next, tick_period);

          }

}

 

/*

 * Periodic tick

 */

static void tick_periodic(int cpu)

{

          if (tick_do_timer_cpu == cpu) {

                   write_seqlock(&jiffies_lock);

 

                   /* Keep track of the next tick event */

                   tick_next_period = ktime_add(tick_next_period, tick_period);

 

                   do_timer(1);

                   write_sequnlock(&jiffies_lock);

          }

 

          update_process_times(user_mode(get_irq_regs()));

          profile_tick(CPU_PROFILING);
}

 


/*

 * Must hold jiffies_lock

 */

void do_timer(unsigned long ticks)

{

          jiffies_64 += ticks;

          update_wall_time();

          calc_global_load(ticks);

}

 

/*

 * Called from the timer interrupt handler to charge one tick to the current

 * process.  user_tick is 1 if the tick is user time, 0 for system.

 */

void update_process_times(int user_tick)

{

          struct task_struct *p = current;

          int cpu = smp_processor_id();

 

          /* Note: this timer irq context must be accounted for as well. */

          account_process_tick(p, user_tick);

          run_local_timers();

          rcu_check_callbacks(cpu, user_tick);

          printk_tick();

#ifdef CONFIG_IRQ_WORK

          if (in_irq())

                   irq_work_run();

#endif

          scheduler_tick();

          run_posix_cpu_timers(p);

}

 


/*

 * Called by the local, per-CPU timer interrupt on SMP.

 */

void run_local_timers(void)

{

          hrtimer_run_queues();

          raise_softirq(TIMER_SOFTIRQ);

}

 


/*

 * Account a single tick of cpu time.

 * @p: the process that the cpu time gets accounted to

 * @user_tick: indicates if the tick is a user or a system tick

 */

void account_process_tick(struct task_struct *p, int user_tick)

{

          cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);

          struct rq *rq = this_rq();

 

          if (sched_clock_irqtime) {

                   irqtime_account_process_tick(p, user_tick, rq);

                   return;

          }

 

          if (steal_account_process_tick())

                   return;

 

          if (user_tick)

                   account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);

          else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))

                   account_system_time(p, HARDIRQ_OFFSET, cputime_one_jiffy,

                                          one_jiffy_scaled);

          else

                   account_idle_time(cputime_one_jiffy);

}




Print messages, dump stack etc.:

 

printk(KERN_WARNING "format string", args...)

 

KERN_EMERG
KERN_ALERT
KERN_CRIT
KERN_ERR

KERN_WARNING
KERN_NOTICE
KERN_INFO
KERN_DEBUG

 

circular buffer, default size 16K (LOG_BUF_LEN), configurable with CONFIG_LOG_BUF_SHIFT

 

dump_stack()

__schedule_bug(task)

debug_show_held_locks(task)

print_modules()

print_irqtrace_events(task)

 

#include <linux/ratelimit.h>

// depreceated since rate-limiting state is global for all printk_ratelimit() call sites

if (printk_ratelimit())

printk(...)

 

// better way: limit for a particular call site

printk_ratelimited(KERN_WARNING "format-string", args)

 

pr_xxx(format, args):

 

pr_emerg

KERN_EMERG (0)

pr_alert

KERN_ALERT (1)

pr_crit

KERN_CRIT (2)

pr_err

KERN_ERR (3)

pr_warn
pr_warning

KERN_WARNING (4)

pr_notice

KERN_NOTICE (5)

pr_info

KERN_INFO (6)

pr_debug
pr_devel if DEBUG is defined

KERN_DEBUG(7)

pr_cont

KERN_CONT = continue line with no NL char

 

KERN_DEFAULT (d)

 

to enable all messages: echo 8  >/proc/sys/kernel/printk

default is "warning (4) and up"

content of /proc/sys/kernel/printk (full):  <current>  <default>  <minimum>  <boot-time-default>

on boot: loglevel=8

 

printk_once(...)

printk_emerg_once(...) and other printk_xxx_once(...)

pr_xxx_ratelimited(....)

 

vprintk(...)

vprintk_emit(...)

 

hex_dump_to_buffer(...)

print_hex_dump(...)

print_hex_dump_bytes(...)

print_hex_dump_debug(...)

 

For drivers:

 

#include <linux/device.h>

 

dev_xxx(const struct device *dev, fmt, args)

e.g. dev_dbg(...)

dev_xxx_once(...)

dev_xxx_ratelimited(dev, fmt, args)

dev_vprintk_emit(...)

dev_WARN(...), dev_WARN_ONCE(...) => includes file/lineno and backtrace

 

http://elinux.org/Debugging_by_printing

 

How to selectively enable/disable pr_debug/dev_dbg call sites (via debugfs):

https://www.kernel.org/doc/Documentation/dynamic-debug-howto.txt

 

/proc/sys/kernel/printk_delay

/proc/sys/kernel/printk_ratelimit

/proc/sys/kernel/printk_ratelimit_burst

http://www.ibm.com/developerworks/linux/library/l-kernel-logging-apis/index.html

 

printk formats: https://www.kernel.org/doc/Documentation/printk-formats.txt

 

                                     int                                                       %d or %x

                                     unsigned int  %u or %x

                                     long                                                   %ld or %lx

                                     unsigned long                             %lu or %lx

                                     long long                                        %lld or %llx

                                     unsigned long long                 %llu or %llx

                                     size_t                                                %zu or %zx

                                     ssize_t                                              %zd or %zx

                                     s32                                                     %d or %x

                                     u32                                                    %u or %x

                                     s64                                                     %lld or %llx

                                     u64                                                    %llu or %llx

                                     pointer                                            %p

 

Also has formats for:

 

symbolic decoding of pointers (symbol + offset)

physical addresses (phys_addr_t)

DMA addresses (dma_addr_t)

raw buffers as hex string

MAC/FDDI addresses

IPv4/IPv6 addresses

UUID/GUID

dentry names

struct clk

bitmaps                        

      

Printing to ftrace buffer:

      

#include <linux/kernel.h>

trace_printk(...)

trace_puts(str)  => extra fast

trace_dump_stack()

ftrace_vprintk(...)

 



Trees:

 

Read-black tree – self-balancing (semi-balanced) binary search tree:

 

·          each non-leaf node has 1 or 2 children

·          value(left child) < value(node) < value(right child)

·          depth(deepest path) <= 2 * depth(shallowest path)

 

struct rb_root {

          struct rb_node *rb_node;
};

 

struct rb_node {

          unsigned long  __rb_parent_color;

          struct rb_node *rb_right;

          struct rb_node *rb_left;

};

 

Typical user node:

 

struct mytype {

    struct rb_node node;

    char* mystring;
};

 

struct rb_root root = RB_ROOT;                  // { NULL }

 

#define rb_parent(r)   ((struct rb_node *)((r)->__rb_parent_color & ~3))

 

#define rb_entry(ptr, type, member) container_of(ptr, type, member)

 

void  rb_replace_node(struct rb_node *victim, struct rb_node *new, struct rb_root *root)

void  rb_link_node(struct rb_node * node, struct rb_node * parent, struct rb_node ** rb_link)

 

Searching:

 

  struct mytype *my_search(struct rb_root *root, char *string)

  {

                       struct rb_node *node = root->rb_node;

 

                       while (node) {

                                               struct mytype *data = container_of(node, struct mytype, node);

                                                int result;

 

                                                result = strcmp(string, data->mystring);

 

                                                if (result < 0)

                                                                       node = node->rb_left;

                                                else if (result > 0)

                                                                       node = node->rb_right;

                                                else

                                                                       return data;

                        }

                        return NULL;

  }

 

Remove an existing node:

 

  struct mytype *data = my_search(&mytree, "walrus");

 

  if (data) {

                       rb_erase(&data->node, &mytree);           // void rb_erase(struct rb_node *victim, struct rb_root *tree)

                       myfree(data);

  }

 

To replace an existing node with the new one with the same key:


void rb_replace_node(struct rb_node *victim, struct rb_node *new, struct rb_root *root);

 


                        Inserting into tree:

 

  int my_insert(struct rb_root *root, struct mytype *data)

  {

                       struct rb_node **new = &(root->rb_node), *parent = NULL;

 

                       /* Figure out where to put new node */

                       while (*new) {

                                               struct mytype *this = container_of(*new, struct mytype, node);

                                               int result = strcmp(data->mystring, this->mystring);

 

                                                parent = *new;

                                               if (result < 0)

                                                                       new = &((*new)->rb_left);

                                               else if (result > 0)

                                                                       new = &((*new)->rb_right);

                                               else

                                                                       return FALSE;

                       }

 

                       /* Add new node and rebalance tree. */

                       rb_link_node(&data->node, parent, new);

                       rb_insert_color(&data->node, root);

 

                        return TRUE;

  }

 

Iterating through the tree:

 

struct rb_node*  rb_first(const struct rb_root*)

struct rb_node*  rb_last(const struct rb_root*)

struct rb_node*  rb_next(const struct rb_node*)

struct rb_node*  rb_prev(const struct rb_node*)

 

  struct rb_node *xnode;

  for (xnode = rb_first(&mytree);  xnode;  xnode = rb_next(xnode))

           printk("key=%s\n", rb_entry(xnode, struct mytype, node)->mystring);

 

 

 

Radix tree: maps "unsigned long" to "void*"

 

 #include <linux/radix-tree.h>

 

RADIX_TREE(name, gfp_mask);

 

struct radix_tree_root my_tree;

INIT_RADIX_TREE(my_tree, gfp_mask);

 

struct radix_tree_root *tree;

unsigned long key;

void *item;

 

int  radix_tree_insert(tree, key, item);

item = radix_tree_lookup(tree, key);

item = radix_tree_delete(tree, key);




Kobjects:

 

struct kobject {

          const char*                  name;

          struct list_head             entry;

          struct kobject*              parent;

          struct kset*                 kset;

          struct kobj_type*            ktype;

          struct sysfs_dirent*         sd;

          struct kref                  kref;

          unsigned int    state_initialized:1;

          unsigned int    state_in_sysfs:1;

          unsigned int    state_add_uevent_sent:1;

          unsigned int    state_remove_uevent_sent:1;

          unsigned int    uevent_suppress:1;

}

 

          struct kobj_type {

                   void (*release)(struct kobject *kobj);

                   const struct sysfs_ops *sysfs_ops;

                   struct attribute **default_attrs;

                   const struct kobj_ns_type_operations *(*child_ns_type)(struct kobject *kobj);

                   const void *(*namespace)(struct kobject *kobj);

}

struct kobject*   kobject_get(struct kobject *kobj)

 

If kobj is not NULL, increment reference counter and return kobj.

 

void    kobject_put(struct kobject *kobj)

 

If kobj is not NULL, decrement reference counter, if goes to zero then release the object.

 

kobject/kset sample: LINUX_ROOT/samples/kobject

 

kobj_map:

Handles device numbers.
There are two kobj_maps: one for block devices and one for character devices.
Each struct probe describes device number range starting at dev with length range.

struct kobj_map *    kobj_map_init(kobj_probe_t *base_probe, struct subsystem *s)

int      kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
                              struct module *owner, kobj_probe_t *get, int (*lock)(dev_t, void *), void *data)

void    kobj_unmap(struct kobj_map *domain, dev_t dev, unsigned long range)

struct kobject *    kobj_lookup(struct kobj_map *domain, dev_t dev, int *index)

Finds the given device number dev on the given map domain. If the owner field is set, we temporarily take a reference on the corresponding module via try_module_get(owner) in order to protect the lock and get calls. If the lock function is present it will be called, and the present probe skipped if it returns an error. Then the get function is called to get the kobject for the given device number. The resulting kobject is returned as value, and the offset in the interval of device numbers is returned via index; module is released via module_put.


 

Lockup detection:

 

/proc/sys/kernel/nmi_watchdog

/proc/sys/kernel/watchdog_thresh

 

Default is 10 seconds, meaning 10 seconds for hard lockup detection and 20 seconds for soft lockup detection.

 

Hard lockup is lock up in kernel mode with interrupts disabled.

Soft lockup is lock up in kernel mode not letting other tasks to run.

 

Code is in kernel/watchdog.c

Docs in Documentation/lockup-watchdogs.txt

 



Using floating point in kernel:

#include <asm/i387.h>


kernel_fpu_begin();
.....
kernel_fpu_end();



RCU:

 

For read-mostly data, when transient reader/writer inconsistency ok.
Decreases read-side overhead, increases write-side overhead.

 

#include <linux/rcupdate.h>

 

// read-side RCU critical section:

//   - cannot sleep, but can be preemted if CONFIG_PREEMPT_RCU

//   - can be nested

 

static struct xxx_struct* __rcu xptr;

 

rcu_read_lock();

p = rcu_dereference(xptr);

// p is valid here

do_something_with(p);

rcu_read_unlock();

// p is invalid here

 

// update section (synchronous, waiting for RCU sync)

// use spinlocks, semaphores etc. to interlock updaters

// cannot be called from irq or bh context

 

spin_lock(&updater_lock);

p = xptr;

rcu_assign_pointer(xptr, new_p);

spin_unlock(&updater_lock);

synchronize_rcu();                                                       // wait for grace period

kfree(p);

 

// update section (asynchronous,non-waiting for RCU sync)

// use spinlocks, semaphores etc. to interlock updaters

 

spin_lock(&updater_lock);

p = xptr;

rcu_assign_pointer(xptr, new_p);

spin_unlock(&updater_lock);

call_rcu(&p->x_rcu, xxx_rcu_func);

 

struct xxx_struct {

        struct rcu_head x_rcu;

};

 

void xxx_rcu_func(struct rcu_head *p_rcu_head)

{

       struct my_struct *p = container_of(p_rcu_head, struct my_struct, x_rcu);

       destroy_my_struct(p);

}

 

Note:  RCU callbacks can be invoked in softirq context. Therefore it cannot block, and any lock acquired within RCU callback must be

acquired elsewhere with spin_lock_irq or spin_lock_bh, to avoid self-deadlock. Watch out for locks indirectly acquired

by API functions (e.g. kfree etc.) called by the callback!

 

Any lock acquired by RCU callback must be acquired elsewhere with softirq disabled (e.g. spin_lock_bh, spin_lock_irqsave), or self-deadlock will result.

 

RCU callbacks are usually executed on the same CPU, as scheduled by call_rcu_xxx, but this is not guaranteed (for example, if CPU goes offline, pending callbacks will be transferred to another CPU for execution).

 

RCU read-side primitives do not necessarily contain memory barriers.  Therefore CPU and compiler may reorder code into and out of RCU

read-side critical sections.  It is the responsibility of the RCU update-side primitives to deal with this.

 


What is RCU, Fundamentally? 
http://lwn.net/Articles/262464

What is RCU? Part 2: Usage   http://lwn.net/Articles/263130

RCU part 3: the RCU API      http://lwn.net/Articles/264090

The RCU API, 2010 Edition    http://lwn.net/Articles/418853

 

"Is parallel programming hard..."

http://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html

 

http://www.rdrop.com/users/paulmck/RCU

 

http://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt

http://www.kernel.org/doc/Documentation/RCU/UP.txt

http://www.kernel.org/doc/Documentation/RCU/listRCU.txt

http://www.kernel.org/doc/Documentation/RCU/trace.txt

 

enumerate list entries

read_lock(&xxx_lock);

list_for_each_entry(p, &xxx_listhead, list) {

                        ...

}

read_unlock(&xxx_lock);

rcu_read_lock();

list_for_each_entry_rcu(p, &xxx_listhead, list) {

                        ...

}

rcu_read_unlock();

add list entry

write_lock(&xxx_lock);

if (p->condition)

                        list_add(&p->list, xxx_listhead);

else

                        list_add_tail(&p->list, xxx_listhead);

write_unlock(&xxx_lock);

lock(&xxx_lock);

if (p->condition)

                        list_add_rcu(&p->list, xxx_listhead);

else

                        list_add_tail_rcu(&p->list, xxx_listhead);

unlock(&xxx_lock);

delete list entry

write_lock(&xxx_lock);

list_for_each_entry(p, &xxx_listhead, list) {

                        if (p->condition) {

                                                list_del(&e->list);

                                                write_unlock(&xxx_lock);

                                                return;

                        }

}

write_unlock(&xxx_lock);

lock(&xxx_lock);

list_for_each_entry_rcu(p, &xxx_listhead, list)  {

                        if (p->condition) {

                                                list_del_rcu(&p->list);

                                                unlock(&xxx_lock);

                                                call_rcu(&p->rcu, xxx_free);

                                                return;

                        }

}

unlock(&xxx_lock);

in-place updates

write_lock(&xxx_lock);

list_for_each_entry(p, &xxx_listhead, list) {

                        if (p->condition) {

                                                p->a = new_a;

                                                p->b = new_b;

                                                write_unlock(&xxx_lock);

                                                return;

                        }

}

write_unlock(&xxx_lock);

lock(&xxx_lock);

list_for_each_entry_rcu(p, &xxx_listhead, list)  {

                        if (p->condition) {

                                                np = copy(p);

                                                if (np == NULL)   ...

                                                np->a = new_a;

                                                np->b = new_b;

                                                list_replace_rcu(&p->list, &np->list);

                                                unlock(&xxx_lock);

                                                call_rcu(&p->rcu, xxx_free);

                                                return;

                        }

}

unlock(&xxx_lock);

eliminating stale data

If stale data cannot be tolerated, use per-entry “deleted” flag and per-entry spinlock, and re-verify the validity of data.

 

https://www.kernel.org/doc/Documentation/RCU/checklist.txt

https://www.kernel.org/doc/Documentation/RCU/NMI-RCU.txt

https://www.kernel.org/doc/Documentation/RCU/arrayRCU.txt

https://www.kernel.org/doc/Documentation/RCU/rcu_dereference.txt

https://www.kernel.org/doc/Documentation/RCU/rcubarrier.txt

https://www.kernel.org/doc/Documentation/RCU/lockdep.txt

https://www.kernel.org/doc/Documentation/RCU/lockdep-splat.txt

Attribute

RCU Classic

RCU BH

RCU Sched

Realtime RCU

SRCU

QRCU

Purpose

Wait for RCU read-side critical sections

Wait for RCU-BH read-side critical sections & irqs

Wait for RCU-Sched read-side critical sections, preempt-disable regions, hardirqs & NMIs

Realtime response

Wait for SRCU read-side critical sections, allow sleeping readers

Sleeping readers and fast grace periods

Read-side primitives

rcu_read_lock()
rcu_read_unlock()

rcu_read_lock_bh()
rcu_read_unlock_bh()

preempt_disable()
preempt_enable()
(and friends)

rcu_read_lock()
rcu_read_unlock()

srcu_read_lock()
srcu_read_unlock()

qrcu_read_lock()
qrcu_read_unlock()

Update-side primitives
(synchronous)

synchronize_rcu()
synchronize_net()

synchronize_rcu_bh()

synchronize_sched()

synchronize_rcu()
synchronize_net()

synchronize_srcu()

synchronize_qrcu()

Update-side primitives
(synchronous expedited)

synchronize_rcu_expedited()

synchronize_rcu_bh_expedited()

synchronize_rcu_sched_expedited()

 

synchronize_srcu_ expedited()

 

Update-side primitives
(asynchronous/callback)

call_rcu()

call_rcu_bh()

call_rcu_sched()

call_rcu()

N/A

N/A

Update-side primitives
(wait for callbacks)

rcu_barrier()

rcu_barrier_bh()

rcu_barrier_sched()

rcu_barrier()

N/A

N/A

Read side constraints

No blocking except preemption ans spinlock acquisition

No bh enabling

No blocking

No blocking except preemption and lock acquisition

No synchronize_srcu()

No synchronize_qrcu()

Read side overhead

Preempt disable/enable (free on non-PREEMPT)

BH disable/enable

Preempt disable/enable (free on non-PREEMPT)

Simple instructions, irq disable/enable

Simple instructions, preempt disable/enable

Atomic increment and decrement of shared variable

Asynchronous update-side overhead (for example, call_rcu())

sub-microsecond

sub-microsecond

sub-microsecond

sub-microsecond

N/A

N/A

Grace-period latency

10s of milliseconds

10s of milliseconds

10s of milliseconds

10s of milliseconds

10s of milliseconds

10s of nanoseconds in absence of readers

Non-PREEMPT_RT implementation

RCU Classic

RCU BH

RCU Classic

N/A

SRCU

N/A

PREEMPT_RT implementation

N/A

Realtime RCU

Forced Schedule on all CPUs

Realtime RCU

SRCU

N/A

 

RCU

Critical sections

Grace period

Barrier

Classic

rcu_read_lock
rcu_read_unlock
rcu_dereference
rcu_read_lock_held
rcu_dereference_check
rcu_dereference_protected

synchronize_net
synchronize_rcu
synchronize_rcu_expedited
call_rcu
kfree_rcu

rcu_barrier

BH

rcu_read_lock_bh
rcu_read_unlock_bh
rcu_dereference_bh
rcu_read_lock_bh _held
rcu_dereference_bh_check
rcu_dereference_bh_protected

call_rcu_bh
synchronize_rcu_bh
synchronize_rcu_bh_expedited

rcu_barrier_bh

Sched

rcu_read_lock_sched
rcu_read_unlock_sched
rcu_dereference_sched
rcu_read_lock_sched_held
rcu_dereference_sched_check
rcu_dereference_sched_protected
[preempt_disable and friends]

rcu_read_lock_sched_notrace

rcu_read_unlock_sched_notrace

synchronize_sched

call_rcu_sched

synchronize_sched_expedited

rcu_barrier_sched

SRCU

srcu_read_lock

srcu_read_unlock

srcu_dereference

srcu_dereference_check

srcu_read_lock_held

synchronize_srcu
call_srcu
synchronize_srcu_expedited

srcu_barrier

 

SRCU initialization/cleanup:

init_srcu_struct
cleanup_srcu_struct

 

updater uses

readers must use

synchronize_rcu

call_rcu

rcu_read_lock
rcu_read_unlock

synchronize_rcu_bh

call_rcu_bh

rcu_read_lock_bh
rcu_read_unlock_bh

 

An exception: may also rcu_read_lock() and rcu_read_unlock() instead of rcu_read_lock_bh() and rcu_read_unlock_bh() if local bottom halves are already known to be disabled, for example, in irq or softirq context.

synchronize_sched

call_rcu_sched

disable preemption, possibly by calling

rcu_read_lock_sched
rcu_read_unlock_sched

synchronize_srcu
call_srcu

srcu_read_lock

srcu_read_unlock

(with the same with the same srcu_struct)

 

·          RCU list traversal and pointer dereference primitives (list_xxx_rcu, rcu_dereference_xxx) must be used:

o    inside RCU read-side critical section

o    or protected by update-side lock.

·          synchronize_rcu will only wait for all currently executing rcu_read_lock sections. It does not (necessarily) wait for irqs, NMIs, preempt_disable sections, or idle loops to complete.

·          synchronize_irq(irq) waits for pending IRQ handlers on other CPUs

·          synchronize_sched waits for pending preempt_disable code sequences, including NMI and non-threaded hardware-interrupt handlers.  However, this does not guarantee that softirq handlers will have completed, since in some kernels, these handlers can run in process context, and can block.

 

Category

Primitives

Purpose

List traversal

list_for_each_entry_rcu()

Iterate over an RCU-protected list from the beginning.

list_for_each_entry_continue_rcu()

Iterate over an RCU-protected list from the specified element.

list_entry_rcu()

Given a pointer to a raw list_head in an RCU-protected list, return a pointer to the enclosing element.

list_first_entry_rcu()

Return the first element of an RCU-protected list.

List update

list_add_rcu()

Add an element to the head of an RCU-protected list.

list_add_tail_rcu()

Add an element to the tail of an RCU-protected list.

list_del_rcu()

Delete the specified element from an RCU-protected list, poisoning the ->pprev pointer but not the ->next pointer.

list_replace_rcu()

Replace the specified element in an RCU-protected list with the specified element.

list_splice_init_rcu()

Move all elements from an RCU-protected list to another RCU-protected list.

Hlist traversal

hlist_for_each_entry_rcu()

Iterate over an RCU-protected hlist from the beginning.

hlist_for_each_entry_rcu_bh()

Iterate over an RCU-bh-protected hlist from the beginning.

hlist_for_each_entry_continue_rcu()

Iterate over an RCU-protected hlist from the specified element.

hlist_for_each_entry_continue_rcu_bh()

Iterate over an RCU-bh-protected hlist from the specified element.

Hlist update

hlist_add_after_rcu()

Add an element after the specified element in an RCU-protected hlist.

hlist_add_before_rcu()

Add an element before the specified element in an RCU-protected hlist.

hlist_add_head_rcu()

Add an element at the head of an RCU-protected hlist.

hlist_del_rcu()

Delete the specified element from an RCU-protected hlist, poisoning the ->pprev pointer but not the ->next pointer.

hlist_del_init_rcu()

Delete the specified element from an RCU-protected hlist, initializing the element's reverse pointer after deletion.

hlist_replace_rcu()

Replace the specified element in an RCU-protected hlist with the specified element.

Hlist nulls traversal

hlist_nulls_for_each_entry_rcu()

Iterate over an RCU-protected hlist-nulls list from the beginning.

Hlist nulls update

hlist_nulls_del_init_rcu()

Delete the specified element from an RCU-protected hlist-nulls list, initializing the element after deletion.

hlist_nulls_del_rcu()

Delete the specified element from an RCU-protected hlist-nulls list, poisoning the ->pprev pointer but not the ->next pointer.

hlist_nulls_add_head_rcu()

Add an element to the head of an RCU-protected hlist-nulls list.

 

Category

Primitives

Purpose

Pointer update

rcu_assign_pointer()

Assign to an RCU-protected pointer.

Pointer access

rcu_dereference()

Fetch an RCU-protected pointer, giving an lockdep-RCU error message if not in an RCU read-side critical section.

rcu_dereference_bh()

Fetch an RCU-protected pointer, giving an lockdep-RCU error message if not in an RCU-bh read-side critical section.

rcu_dereference_sched()

Fetch an RCU-protected pointer, giving an lockdep-RCU error message if not in an RCU-sched read-side critical section.

srcu_dereference()

Fetch an RCU-protected pointer, giving an lockdep-RCU error message if not in the specified SRCU read-side critical section.

rcu_dereference_protected()

Fetch an RCU-protected pointer with no protection against concurrent updates, giving an lockdep-RCU error message if the specified lockdep condition does not hold. This primitive is normally used when the update-side lock is held.

rcu_dereference_check()

Fetch an RCU-protected pointer, giving an lockdep-RCU error message if (1) the specified lockdep condition does not hold and (2) not under the protection of rcu_read_lock().

rcu_dereference_bh_check()

Fetch an RCU-bh-protected pointer, giving an lockdep-RCU error message if (1) the specified lockdep condition does not hold and (2) not under the protection of rcu_read_lock_bh() (2.6.37 or later).

rcu_dereference_sched_check()

Fetch an RCU-sched-protected pointer, giving an lockdep-RCU error message if (1) the specified lockdep condition does not hold and (2) not under the protection of rcu_read_lock_sched() or friend (2.6.37 or later).

srcu_dereference_check()

Fetch an SRCU-protected pointer, giving an lockdep-RCU error message if (1) the specified lockdep condition does not hold and (2) not under the protection of the specified srcu_read_lock() (2.6.37 or later).

rcu_dereference_index_check()

Fetch an RCU-protected integral index, giving an lockdep-RCU error message if the specified lockdep condition does not hold.

rcu_access_pointer()

Fetch an RCU-protected value (pointer or index), but with no protection against concurrent updates. This primitive is normally used to do pointer comparisons, for example, to check for a NULL pointer.

rcu_dereference_raw()

Fetch an RCU-protected pointer with no lockdep-RCU checks. Use of this primitive is strongly discouraged. If you must use this primitive, add a comment stating why, just as you would with smp_mb().

 

 

Free an object:

 

void kfree_rcu (ptr, rcu_head)                                                                                          // Helper function.

 

Initializers:

 

RCU_INIT_POINTER(ptr, NULL);

 

If you are creating an RCU-protected linked structure that is accessed by a single external-to-structure RCU-protected pointer, then you may use RCU_INIT_POINTER to initialize the internal RCU-protected pointers, but you must use rcu_assign_pointer to initialize the external-to-structure pointer -after- you have completely initialized the reader-accessible portions of the linked structure.

 

                               struct x x = {

       .group_leader                            = &tsk,                                                                                                          \

       RCU_POINTER_INITIALIZER(real_cred, &init_cred),                                     \

       RCU_POINTER_INITIALIZER(cred, &init_cred),                                                 \

       .comm                                             = INIT_TASK_COMM,                                                                           \

                               }

 

 

Unloading a module:

 

RCU and Unloadable Modules

 

void rcu_barrier()                                                         // call from module unload, waits for callbacks to complete

void rcu_barrier_bh()
void rcu_barrier_sched()

 

This primitive does not necessarily wait for an RCU grace period to complete. For example, if there are no RCU callback queued anywhere in the system , then rcu_barrier is within its rights to return immediately, without waiting for anything, much less an RCU grace period.

 

If a callback can re-post itself, module unload function should empoy a global flag to disable the reposting before calling rcu_barrier().

 

Debugging:

CONFIG_PROVE_RCU: check that accesses to RCU-protected data structures are carried out under the proper RCU read-side critical section, while holding the right combination of locks, or whatever other conditions are appropriate.

 

CONFIG_DEBUG_OBJECTS_RCU_HEAD: check that you don't pass the same object to call_rcu() (or friends) before an RCU grace period has elapsed since the last time that you passed that same object to call_rcu() (or friends).

 

__rcu sparse checks: tag the pointer to the RCU-protected data structure with __rcu, and sparse will warn you if you access that pointer without the services of one of the variants of rcu_dereference().
RCU-BH has faster grace period than classic RCU (and is helpful e.g. in case of DDoS-attacks cases).

void  call_rcu_bh (struct rcu_head * head, void (*func) (struct rcu_head *head))

 

call_rcu_bh assumes that the read-side critical sections end on completion of a softirq handler. This means that read-side critical sections in process context must not be interrupted by softirqs. This interface is to be used when most of the read-side critical sections are in softirq context. RCU read-side critical sections are delimited by : - rcu_read_lock and rcu_read_unlock, if in interrupt context. OR - rcu_read_lock_bh and rcu_read_unlock_bh, if in process context. These may be nested.

 

void rcu_read_lock_bh ()

 

An equivalent of rcu_read_lock, but to be used when updates are being done using call_rcu_bh or synchronize_rcu_bh.

 

Since both call_rcu_bh and synchronize_rcu_bh consider completion of a softirq handler to be a quiescent state, a process in RCU read-side critical section must be protected by disabling softirqs. Read-side critical sections in interrupt context can use just rcu_read_lock, though this should at least be commented to avoid confusing people reading the code.

 

Note that rcu_read_lock_bh and the matching rcu_read_unlock_bh must occur in the same context, for example, it is illegal to invoke rcu_read_unlock_bh from one task if the matching rcu_read_lock_bh was invoked from some other task.


RCU-SCHED:

void  call_rcu_sched (struct rcu_head * head, void (*func) (struct rcu_head *rcu))

call_rcu_sched assumes that the read-side critical sections end on enabling of preemption or on voluntary preemption. RCU read-side critical sections are delimited by : - rcu_read_lock_sched and rcu_read_unlock_sched, OR anything that disables preemption. These may be nested.

 

void rcu_read_lock_sched ( void)

 

An equivalent of rcu_read_lock, but to be used when updates are being done using call_rcu_sched or synchronize_rcu_sched. Read-side critical sections can also be introduced by anything that disables preemption, including local_irq_disable and friends.

 

Note that rcu_read_lock_sched and the matching rcu_read_unlock_sched must occur in the same context, for example, it is illegal to invoke rcu_read_unlock_sched from process context if the matching rcu_read_lock_sched was invoked from an NMI handler.

 

void   synchronize_sched() 

 

Wait until an rcu-sched grace period has elapsed.

 

Control will return to the caller some time after a full rcu-sched grace period has elapsed, in other words after all currently executing rcu-sched read-side critical sections have completed.   These read-side critical sections are delimited by rcu_read_lock_sched() and rcu_read_unlock_sched(), and may be nested.  Note that preempt_disable(), local_irq_disable(), and so on may be used in place of rcu_read_lock_sched().

 

This means that all preempt_disable code sequences, including NMI and non-threaded hardware-interrupt handlers, in progress on entry will have completed before this primitive returns.  However, this does not guarantee that softirq handlers will have completed, since in some kernels, these handlers can run in process context, and can block.

 

Note that this guarantee implies further memory-ordering guarantees. On systems with more than one CPU, when synchronize_sched() returns, each CPU is guaranteed to have executed a full memory barrier since the end of its last RCU-sched read-side critical section whose beginning preceded the call to synchronize_sched().  In addition, each CPU having an RCU read-side critical section that extends beyond the return from synchronize_sched() is guaranteed to have executed a full memory barrier after the beginning of synchronize_sched() and before the beginning of that RCU read-side critical section.  Note that these guarantees include CPUs that are offline, idle, or executing in user mode, as well as CPUs that are executing in the kernel.

 

Furthermore, if CPU A invoked synchronize_sched(), which returned to its caller on CPU B, then both CPU A and CPU B are guaranteed to have executed a full memory barrier during the execution of synchronize_sched() -- even if CPU A and CPU B are the same CPU (but again only if the system has more than one CPU).

 

This primitive provides the guarantees made by the (now removed) synchronize_kernel() API.  In contrast, synchronize_rcu() only guarantees that rcu_read_lock() sections will have completed. In "classic RCU", these two guarantees happen to be one and the same, but can differ in realtime RCU implementations.

 

 

RCU Misc:

 

void call_rcu_tasks (struct rcu_head * head, void (*func) (struct rcu_head *head))

 

call_rcu_tasks assumes that the read-side critical sections end at a voluntary context switch (not a preemption!), entry into idle, or transition to usermode execution. As such, there are no read-side primitives analogous to rcu_read_lock and rcu_read_unlock because this primitive is intended to determine that all tasks have passed through a safe state, not so much for data-strcuture synchronization.

 

 

synchronize_rcu_expedited()
synchronize_sched_expedited()

 

Complete faster, but more expensive.
Cannot be called from a CPU-hotplug notifier or while holding a lock that is acquired by CPU-hotplug notifier.

 

 

SRCU:

 

srcu_xxx may only be invoked from a process context.

 

However can enter read-side critical section in a hardirq or exception handler with srcu_read_lock_raw, then exit it in the task that was interrupted with srcu_read_unlock_raw.

 

SRCU overhead is amortized only over those updates sharing a given srcu_struct, rather than being globally amortized as they are for other forms of RCU. Therefore, SRCU should be used in preference to rw_semaphore only in extremely read-intensive situations, or in situations requiring SRCU's read-side deadlock immunity or low read-side realtime latency.




PREEMPT_RT patch

 

CONFIG_PREEMPT_NONE                 server

CONFIG_PREEMPT_VOLUNTARY     desktop (preemption in select points only)
CONFIG_PREEMPT                              low-latency desktop (preemption anywhere except spinlocks and “no preempt” sections)
CONFIG_PREEMPT_RT_BASE
CONFIG_PREEMPT_RT_FULL

In (RT || CONFIG_IRQ_FORCED_THREADING && boot param threadirqs), interrupts are threaded, unless IRQF_NO_THREAD is set in struct irqaction.flags.

Flag
IRQF_NO_SOFTIRQ_CALL disables softirq processing when exiting primary handler – the handler must not request BH/softirq processing.

 

local_irq_disable()
local_irq_enable()
local_irq_save(flags)
local_irq_restore(flags)
irqs_disabled()
irqs_disabled_flags()
local_save_flags(flags)

On some architectures, do not actually disable interrupts in RT kernel.
Must not be used in
IRQF_NO_THREAD handlers .

 

To actually disable interrupts, use raw_local_xxx() version.
Note: cannot acquire normal spinlock_t with interrupts disabled.

Still disables preemption.

spinlock_t
rwlock_t

In RT kernel, implemented via rt_mutex.
Critical sections are (or may be) preemptible, however ought not schedule voluntarily.
spinlock_irq_save() etc. do not disable interrupts.
Priority inheritance is used to prevent priority inversion.

Traditional behavior is offered by raw_spinlock_t and raw_rwlock_t.
Use “regular” method calls (such as spin_lock()) on raw_xxx.

seqlock_t

In RT kernel, critical sections are preemptible.

Priority inheritance on write side.

Read side is non-blocking, so no heavy priority inversion issue here.

struct semaphore

In RT kernel, implements priority inheritance.

Traditional behavior is offered by
compat_semaphore.

struct rw_semaphore

In RT kernel, subject to priority inheritance, and only one task can hold reader lock, but that task can acquire read lock recursively.

Traditional behavior is offered by
compat_rw_semaphore.

DEFINE_PER_CPU_LOCKED(type, name)
DECLARE_PER_CPU_LOCKED(type, name)

 

get_per_cpu_locked(var, cpu)
put_per_cpu_locked(var, cpu)

per_cpu_lock(var, cpu)

per_cpu_locked(var, cpu)

Associate spinlock_t with per-cpu variable.


Get variable, but first acquire the spinlock.

Get associated spinlock.

 

Get variable without acquiring the lock, either because it was already acquired, or because making RCU read-side reference to the variable, so do not need a lock.

 

See more:
http://lwn.net/Articles/146861/?format=printable

 

In RT mode:

o    spinlocks (spinlock_t and rwlock_t) are preemptible, see <linux/spinlock_rt.h>

o    RCU read-side critical sections (rcu_read_lock()) are preemptible

o    therefore cannot acquire spinlock with preemption or interrupts disabled (trylock is ok)

o    spinlock_irq_save(spinlock_t) does not disable interrupts

o    to disable preemption and/or interrupts use spin_lock(raw_spinlock_t) and spinlock_irq_save(raw_spinlock_t)

o    avoid the use of raw_spinlock_t unless really necessary

o    preemption can be disabled via preempt_disable(), get_cpu_var() or disabling interrupts

o    since spinlocks can now sleep, be careful when accessing current->state and other scheduling state data


Lazy preempt was introduced to mitigate execution of SCHED_OTHER tasks on RT kernel:

 

RT sucks throughput wise for ordinary SCHED_OTHER tasks. One major issue is the wakeup of tasks which are right away preempting the waking task while the waking task holds a lock on which the woken task will block right after having preempted the wakee. In mainline this is prevented due to the implicit preemption disable of spin/rw_lock held regions. On RT this is not possible due to the fully preemptible nature of sleeping spinlocks.

 

Though for a SCHED_OTHER task preempting another SCHED_OTHER task this is really not a correctness issue. RT folks are concerned about SCHED_FIFO/RR tasks preemption and not about the purely fairness driven SCHED_OTHER preemption latencies.

 

So I introduced a lazy preemption mechanism which only applies to SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the existing preempt_count each tasks sports now a preempt_lazy_count which is manipulated on lock acquiry and release. This is slightly

incorrect as for lazyness reasons I coupled this on migrate_disable/enable so some other mechanisms get the same treatment (e.g. get_cpu_light).

 

Now on the scheduler side instead of setting NEED_RESCHED this sets NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and therefor allows to exit the waking task the lock held region before the woken task preempts. That also works better for cross CPU wakeups as the other side can stay in the adaptive spinning loop.

 

For RT class preemption there is no change. This simply sets NEED_RESCHED and forgoes the lazy preemption counter.

 

migrate_disable(…)

 

if RT_FULL  SMP, migrate_disable =

              preempt_lazy_disable();              // lazy-disable preemption (if CONFIG_PREEMPT_LAZY, and for SCHED_OTHER/SCHED_OTHER only)

                 …                                                    // disable current CPU hot-removal (pin_current_cpu)

              p->migrate_disable = 1;              // disable task migration to another CPU

           

if RT_FULL  UP, migrate_disable = noop (just barrier)


in all other cases migrate_disable = preempt_disable

 

spin_lock(lock)

 

migrate_disable();                  // disable task migration to another CPU (or if not RT_FULL, simply preempt_disable)
rt_spin_lock(lock);                 // acquire underlying rt_mutex, can sleep

 

spin_lock_bh(lock)

 

local_bh_disable();                // disable BH processing

migrate_disable();                  // disable task migration to another CPU (or if not RT_FULL, simply preempt_disable)
rt_spin_lock(lock);                 // acquire underlying rt_mutex, can sleep

 

 




Sequential files:

 

struct file (readonly, sequential access) -> seq_file -> provider

 

seq_file serves as an adapter between consumer and producer.

Reading is driven by the consumer (i.e. struct file* object).

seq_file translates read requests into the calls to producer walking the structure.

File read operations pull in data from the provider.

 

Every file "read" operation starts by consuming the existing the buffer first.

By default, buffer size is one page.

Once the buffer is empty, "read" then fills in the buffer by calling a producer.

 

Approximate sequence on every buffer fill-in:

 

start

next - show

...

next - show

stop

 

#ifdef CONFIG_PROC_FS

 

static const struct file_operations xxx_file_operations = {

                               .open                                                = xxx_open,

                               .read                                                 = seq_read,

                               .llseek                                              = seq_lseek,

                               .release                                           = seq_release

};

 

static int xxx_open(struct inode *inode, struct file *file)

{

                               return seq_open(file, &xxx_op);

}

 

static const struct seq_operations xxx_op = {

                               .start                   = xxx_start,

                               .next                   = xxx_next,

                               .stop                   = xxx_stop,

                               .show                 = xxx_show

};

 

static void *xxx_start(struct seq_file *m, loff_t *pos)

{

                               // lock the structure

                               // use pos as input and forward to *pos

                               // may save context in m->private for use by next/show/stop

                               // return "next/show.arg" or NULL

}

 

static void *xxx_next(struct seq_file *m, void *arg, loff_t *pos)

{

                               // ++*pos;

                               // return "next/show.arg" or NULL

}

 

static void xxx_stop(struct seq_file *m, void *arg)

{

                               // unlock the structure

}

 

static int xxx_show(struct seq_file *m, void *arg)

{

                               seq_printf(m, "0x%p", ...);
                              
seq_puts(m, "abc");

                               seq_putc(m, '\n');

}

 

 

One-item (simplified) case on top of seq_file:

 

static const struct file_operations xxx_fops = {

.open                              = xxx_open,

.read                                = seq_read,

.llseek                             = seq_lseek,

.release                         = single_release,

};

 

static int xxx_open(struct inode *inode, struct file *file)

{

return single_open(file, xxx_show, NULL);

}

 

static int xxx_show(struct seq_file *m, void *v)

{

seq_printf(m, "0x%p", ...);

seq_puts(m, "abc");

seq_putc(m, '\n');

return 0;

}

 

Procfs:

#include <linux/proc_fs.h>

struct proc_dir_entry *pde;

 

pde = proc_create_data("pde", S_IRWXU, NULL, &xxx_file_operations, NULL);

mde = proc_mkdir("dir", NULL);

spde = proc_create_data("spde1", S_IRWXU, spde, &xxx_file_operations, NULL);

...

remove_proc_entry(spde);

remove_proc_entry(mde);

remove_proc_entry(pde);

 


 

Kprobes:

 

#include <linux/kprobes.h>

 

struct kprobe kp = {

       // .addr             = do_fork,

       // .symbol_name                      = "do_fork",

       .pre_handler                               = pre_handler,

       .post_handler                             = post_handler

};

 

ret = register_kprobe(&kp);

if (ret < 0) {

       ...

       return ret;

}

 

....

 

unregister_kprobe(&kp);

 

static int __kprobes pre_handler(struct kprobe *p, struct pt_regs *regs)

{

       ...

       return 0;

}

 

static void __kprobes post_handler(struct kprobe *p, struct pt_regs *regs, unsigned long flags)

{

       ...

}

 

kprobes sample:  LINUX_ROOT/samples/kprobes

 

Jprobes:

 

struct jprobe jp = {

       .entry                                               = jp_handler_func,

};

 

jp.kp.addr = (kprobe_opcode_t *)func;

ret = register_jprobe(&jp);

if (ret < 0) {

       ...

       return ret;

}

 

...

 

unregister_jprobe(&jp);

 

static void __kprobes jp_handler_func(long r0, long r1)

{

       if (r0 == FUNC_ARG1 && r1 == FUNC_ARG2)

                                     ...

       jprobe_return();

}

 

Return probes:

 

static int __kprobes

kretprobe_handler(struct kretprobe_instance *ri, struct pt_regs *regs)

{

       if (regs_return_value(regs) == ...)

                                     ...

       return 0;

}

 

 struct kretprobe rp = {

       .handler           = kretprobe_handler,

};

 

rp.kp.addr = (kprobe_opcode_t *)func;

ret = register_kretprobe(&rp);

if (ret < 0) {

       ...

       return ret;

}

...

unregister_kretprobe(&rp);

 


Hardware breakpoints to intercerpt data access/code execution:

 

sample: LINUX_ROOT/samples/hw_breakpoint

register_wide_hw_breakpoint(...), unregister_wide_hw_breakpoint(...)


 

Notifiers:

struct atomic_notifier_head {

       spinlock_t lock;

       struct notifier_block __rcu *head;

};

 

struct blocking_notifier_head {

       struct rw_semaphore rwsem;

       struct notifier_block __rcu *head;

};

 

struct notifier_block {

       notifier_fn_t notifier_call;

       struct notifier_block __rcu *next;

       int priority;

};

 

ATOMIC_NOTIFIER_HEAD(h);

BLOCKING_NOTIFIER_HEAD(h);

 

ATOMIC_INIT_NOTIFIER_HEAD(h);

BLOCKING_INIT_NOTIFIER_HEAD(h);

 

int atomic_notifier_chain_register(nh, nb);

int atomic_notifier_chain_unregister(nh, nb);

int atomic_notifier_call_chain(nh, unsigned long val, void *v);

 

int blocking_notifier_chain_register(nh, nb);

int blocking_notifier_chain_unregister(nh, nb);

int blocking_notifier_call_chain(nh, unsigned long val, void *v);

 

int notifier_fn(struct notifier_block *nb, unsigned long action, void *data)

{

       // return NOTIFY_OK

       // return NOTIFY_STOP

}

 

SRCU notifiers are using sleepable RCUs instead of rwsem.

 


 

debugfs:

#include <linux/debugfs.h>

 

#if defined(CONFIG_DEBUG_FS)

 

static struct dentry *dir = 0;

static u32 hello = 0;

 

dir = debugfs_create_dir(“test", NULL);

if (!dir) ...

 

junk = debugfs_create_u32("hello", 0666, dir, &hello);

if (!junk) ...

 

...

 

debugfs_remove_recursive(dir);

 

==================================

 

static int xxx_write_op(void *data, u64 value)

{

       sum += value;

       return 0;

}

 

DEFINE_SIMPLE_ATTRIBUTE(xxx_fops, NULL, xxx_write_op, "%llu\n");

 

dir = debugfs_create_dir("test", NULL);

if (!dir) ...

 

junk = debugfs_create_file( "xxx", 0666, dir, NULL, &xxx_fops);

if (!junk) ...

 

....

 

debugfs_remove_recursive(dir);

 

debugfs_create_u8(name, mode, parent, u8 *value);

debugfs_create_u16(name, mode, parent, u16 *value);

debugfs_create_u32(name, mode, parent, u32 *value);

debugfs_create_u64(name, mode, parent, u64 *value);

debugfs_create_x8(name, mode, parent, u8 *value);

debugfs_create_x16(name, mode, parent, u16 *value);

debugfs_create_x32(name, mode, parent, u32 *value);

debugfs_create_x64(name, mode, parent, u64 *value);

debugfs_create_size_t(name, mode, parent, size_t *value);

debugfs_create_atomic_t(name, mode, parent, atomic_t *value);

debugfs_create_bool(name, mode, parent, u32 *value);

debugfs_create_blob(name, mode, parent, struct debugfs_blob_wrapper *blob);

debugfs_create_regset32(name, mode, parent, struct debugfs_regset32 *regset)

void debugfs_print_regs32(struct seq_file *s, const struct debugfs_reg32 *regs, int nregs, void __iomem *base, char *prefix);

debugfs_create_u32_array(name, mode, parent, u32 *array, u32 elements);

debugfs_create_devm_seqfile(struct device *dev, const char *name, parent,

                                                                                                                                    int (*read_fn)(struct seq_file *s, void *data));

 


 

Userspace/kernel communications:

 

·          standard file systems: procfs, sysfs, debugfs, sysctl, configfs

·          custom device

·          custom file system

·          kernel socket, netlink socket

·          kfifo

LINUX_ROOT/samples/kfifo

LINUX_ROOT/include/linux/kfifo.h

·          mmap + other shared memory (map user page to kernel space or vice versa)

·          futex, signal

·          syscall

 



http://www.ibm.com/developerworks/linux/library/l-tasklets/index.html
http://www.linuxjournal.com/article/5833

http://web.archive.org/web/20051023100157/http://kernelnewbies.org/documents/kdoc/kernel-locking/lklockingguide.html
http://www.crashcourse.ca/wiki/index.php/Kernel_topics
http://www.crashcourse.ca/wiki/index.php/Linux_kernel
https://www.kernel.org/doc/htmldocs/kernel-api
http://www.ibm.com/developerworks/linux/library/l-task-killable

http://www.makelinux.net/ldd3 ch 5
https://www.kernel.org/pub/linux/kernel/people/rusty/kernel-locking/c214.html

### kthreads, waiting for completion, priorities
### lists 85-96
### signal processing by thread (especially kill and stop)