FreeBSD kernel interfaces

http://www.freebsd.org/doc/en/books/arch-handbook/smp-design.html

http://www.freebsd.org/cgi/man.cgi?query=(9)&sektion=&apropos=1&manpath=FreeBSD+9.0-RELEASE&title=Section%20

“Introduction to Multithreading and Multiprocessing in the FreeBSD SMPng Network Stack”

pcpu.h

man 3 queue tree
man 3 err warn sysexits
man 4 witness

man 9 intro style
man 9 locking mutex mtx_pool lock rwlock rmlock sx condvar sema LOCK_PROFILING
man 9 sleep
man 9 BUS_SETUP_INTR
man 9 malloc
man 9 …

Context	spin mtx	mutex	sx	rwlock	rmlock	sleep
interrupt filter	+
interrupt thread	+	+		+	+
callout	+	+		+
syscall	+	+	+	+	+	+

callout:

· implemented as swi (9) threads

· by default run on primary CPU, unless specific CPU is requested;
but this is undocumented and subject to change

critical section:

· critical_enter() … critical_exit()

· while in critical section, thread cannot be preempted or migrated to another CPU

· useful for protecting per-CPU data structures

· cannot protect system-wide data structures

· elevating to PI_REALTIME in FIFO mode prevents preemption by anything except interrupt thread

affinity:

· permanently bind any thread to any CPU: sched_bind() … sched_unbind()

· temporarily bind current thread to current CPU: sched_pin() … sched_unpin()

Turnstile locks vs. sleep-queue locks:

· Turnstile locks (aka performing bounded sleep): regular mutexes (incl. pool version), rwlock, rmlock

· Sleep locks (aka performing unbounded sleep): sx, condvar, lockmgr, semaphores

· Turnstile locks are short-held

· May not request a sleep-queue lock while holding a turnstile lock

· May not sleep with turnstile lock

· Turnstile locks track current lock holder

· Turnstile locks do priority propagation from waiter to holder; sleep-queue locks do not

· When holding spin mutex, cannot acquire a turnstile or sleep-queue lock

· Inside interrupt filter, cannot do any sleep (i.e. request either turnstile or sleep-queue lock or do sleep)

· Inside interrupt thread, cannot do unbounded sleep

· malloc always triggers acquisition of a sleep lock, even with M_NOWAIT, therefore cannot do malloc while holding turnstile locks

Spin mutex:

· Do not sleep while waiting, only spin.

· Interrupts are disabled (or deferred) while held.

· Because interrupts are being disabled, more costly than regular mutexes.

· Cannot acquire any other locking primitive (except other spin mutex) while holding spin mutex.

· Cannot access userspace while holding spin mutex.

· Options: allow recursive acquisition, no witness, no profiling, no logging.

Mutex:

· If busy, thread will block on turnstile (not sleep).

· Adaptive: will spin if owner is running, will block on turnstile if owner is on run queue.

· Supports priority propagation.

· Cannot sleep while holding a mutex (except for Giant).

· Cannot access userspace while holding a mutex (except for Giant).

· Giant (if required) must be acquired before any other mutexes.

· Options: allow recursive acquisition, no witness, no profiling, no logging.

Pool mutex:

· Pool can contain spin mutexes or regular mutexes.

· Mutex is either associated with address of some structure (void*) or allocatable from the pool.

· Used for protecting short-lived data structures.

· Two standard pools: mtxpool_sleep and mtxpool_lockbuilder.

Reader-writer locks:

· Similar to regular mutexes, but with reader-writer semantics.

· Priority propagation to exclusive (writer) holder only, but not to shared (read) holders.

· Writers cannot recurse.

· Readers can optionally recurse (this is an option flag?), but this can go away in the future.

· Options: no witness, no profiling, no logging.

· Can upgrade and downgrade lock between exclusive and shared.

· Unlike sx, rwlock can be acquired while holding regular mutex.

· Unlike sx, rwlock cannot be held while sleeping.

Read-mostly locks:

· Similar to rwlock, but optimized for infrequent write locking.

· Priority propagation for shared and exclusive holders.

· Readers cannot sleep, writers optionally can.

· Options: no witness, allow recursion, allow writers to sleep.

· Unlike sx, rmlock can be acquired while holding regular mutex.

· When recursing, must not change lock mode

Shared/exclusive locks:

· Can be held during unbounded sleep.

· No priority propagation.

· Less efficient than mutex, rwlock and rmlock.

· Can upgrade and downgrade lock between exclusive and shared.

· Waiting can optionally be interruptible by a signal.

· Options: allow recursive acquisition, no witness, no profiling, no logging.

· When recursing, must not change lock mode

Counting semaphores:

· Less efficient than mutexes and condvars, so depreciated.

Condition variables:

· Used in conjunction with regular mutex (not spin mutex), rwlock or sx.

· Wait can be interruptible or not, timeout or not.

· Can signal one, broadcast or broadcast with priority.

Giant:

· Recursive mutex

· Automatically acquired if driver or file system is not marked MPSAFE

· Can sleep with it (will be released and reacquired by sleep), including waiting on sleepable locks

· Must be locked (if wanted) before any other turnstile locks

· May be acquired before or after sleepable locks

Lockmgr locks:

· Most full-featured of sleepable locks

· Shared and exclusive access

· Options: recurse

· May request a timeout and interruption by a signal

· Allows downgrade, upgrade and exclusive upgrade

· Ownership of the lock can be passed from a thread to the kernel

· No priority propagation

· Can drain all accessing threads in preparation for being deallocated

want: have:	spin mtx	mutex	sx	rwlock	rmlock	sleep
spin mtx	+1					no-3
mutex	+	+1		+	+	no-3
sx	+	+	+2	+	+	+4
rwlock	+	+		+2	+	no-3
rmlock	+	+	+5	+	+2	+5

1. Recursion is defined per lock. Lock order is important.

2. Readers can recurse but writers cannot. Lock order is important.

3. There are calls that automatically release the lock going to sleep and reacquire it on wakeup.

4. Can sleep holding sx lock. Can also call sx_sleep that will temporarily release the lock.

5. Read-mostly locks can be initialized to support sleeping while holding a write lock.

witness:

· cannot acquire locks in opposite sequence order (AB vs. BA),
sequencing history is recorded and checked on every acquisition

· cannot acquire multiple locks of the same type, unless DUPOK is set

o lock type can be specified for mutexes (defaults to mutex name)

o for all other kinds of locks, lock type is the same as the lock name

· even if DUPOK is set, can acquire them only sequentially (i.e. for types A, B as AA but not ABA)

Questions:

· how callout is implemented? is it called by interrupt? softclock? separate thread?

· When acquiring multiple mutexes – sleep while acquiring second is ok?

· mtx_pool_unlock: dissociates from address so mutex is reusable?

· rw_rlock/rw_wlock: can recurse?

· rm_rlock/rm_wlock: what can recurse?

· sx_xlock/sx_slock: what can recurse?

malloc:

· malloc, free, realloc

· MALLOC_DECLARE(M_MYTYPE)
MALLOC_DEFINE(M_MYTYPE, “mytype”, “my type long description”)
bp = malloc(size, M_MYTYPE, M_NOWAIT)

· M_NOWAIT (return NULL) or M_WAITOK (can sleep)

· interrupt handlers should not call malloc, free or realloc

· interrupt threads should use M_NOWAIT

· malloc (9) and uma (9) use non-spin mutex, so cannot call them while holding spin mutex

· contiguous memory allocation: contigmalloc (9)

· debugging: memguard(9) and redzone(9)

· may not call malloc while holding vnode (9) interlock, not even with M_NOWAIT

· Although malloc(9) is not ordered with the vnode locks in the lock order, it is still good practice to not perform allocations under the vnode lock. The reason is that pagedaemon might need to clean and write pages, in particular, belonging to the vnodes. Generally, if you own vnode lock and do something which might kick pagedaemon, you are creating troubles for pagedaemon, which would attempt to lock the vnode with timeout, spending the precious time waiting for lock it cannot obtain. This is less an issue for the new vnode instantiation locations, because vnode cannot have resident pages yet. But it is good to follow the same pattern anywhere.

### cdev: DEV_MODULE (9), device (9)

vnode and malloc:

The natural code and order is:

o // no lock held here

o allocate vp

o allocate ip and any other stuff needed to complete vp

o acquire exclusive lock

o check that vp, ip and other stuff didn't go away

o wire ip and other stuff into vp // shouldn't do more allocations here

o release lock on return (?)

but the current order for at least ffs is:

o // no lock held here

o // oops, actually we always (?) hold at least a shared lock here. Often

// we need to promote to an exclusive lock.

o allocate ip

o allocate vp

o acquire exclusive lock

// no check that nothing went away?? A comment earlier still says that

// we intentionally allow races earlier. Is that correct when we hold

// at least a shared lock throughout?

o wire ip into vp. Includes further initializations of both. Hopefully

these can't block.

bread() the inode. Can block now. No worse than blocking for write(2) (?).

o further zalloc()s for dinodes. Breaks kib's rule. It would be painful

to have to release all the zalloc()ated data on earlier failures. At

this point, we can still fail, but have initialized enough of the vnode

to just fail without cleaning it up -- ufs_reclaim() will clean it.

o further initializations, mostly not involving operations that can block.

release lock on return (?)

The allocation of ip is clearly special -- it contains state info that hopefully records the state of the initialization, and it must be wired into vp first so that destructors can see this info.