Membalance

Membalance
A dynamic memory balancer for Xen virtual machines

(version 0.1)

Sergey Oboguev
oboguev@yahoo.com

https://github.com/oboguev/membalance

Membalance dynamically balances memory allocation between Xen virtual machines (domains) in response to varying memory demand (memory pressure) within guests.

CONTENT:

RATIONALE
USING MEMBALANCE
REBALANCING ALGORITHM
DEFICIENCIES OF XEN API
FUTURE WORK / ROADMAP
REFERENCES

RATIONALE

One of the purposes of virtualization is to co-deploy virtual machines on the same physical host to improve the utilization of physical resources, such as CPU cycles, storage devices, IO bandwidth etc., by multiplexing these resources among the VMs, very much like a generation earlier their utilization was improved by a transition to multi-tasking and multi-user time-shared operating systems and multiplexing physical resources between users and concurrent processes.

Like any computing entity reflecting human activity (and some of the automated activity as well), virtual machines more often than not exhibit an uneven load pattern significantly varying over the time. Production system are likely to exhibit load depending on time of the day, business cycles, schedule for generation of reports, unpredictable factors like interactive and market customer behavior etc. Development and test/staging systems and standby/failover instances may exhibit even more uneven load pattern.

Hypervisors normally allow dynamic and adaptive reallocation of resources between the VMs to meet these fluctuating workload demands, including bursty and peak loads. This applies to reallocation of CPU cycles between the VMs, sharing of IO bandwidth and (possibly thinly provisioned) storage. A similar need also exists for dynamic reallocation of main memory between the VMs. Higher loads may require an increased amount of physical memory available to a virtual machine. Insufficient memory is likely to result in increased paging rate and increased file system cache miss rate, due to insufficient cache size, leading to lower performance and the waste of IO bandwidth to re-read virtual memory pages and file data from storage devices. Therefore it normally pays off to increase physical memory allocation for a virtual machine that has significant memory pressure, at the cost of shrinking other machines with lower current demand for main memory.

Such an adjustment very much represents a carry-over of a concept of dynamic adjustment of process working (resident) sets into the era when a locus of “multiuserness” largely moved or extended from processes/jobs within an OS to VMs within a hypervisor.

Broadly, there are two complementary approaches to dynamic partitioning of memory between the VMs.

Dynamically adjusting memory size of individual VMs, according to current memory demand of the VMs.

This approach parallels dynamic adjustment of process’ working/resident set in a traditional OS.

Using Transcendent Memory.

This approach allocates a region of hypervisor memory for VMs to use as an “extended memory” accessible to VMs only indirectly on a block basis (remember IBM PC, DOS and “extended memory” drivers?) and used as a shared area for page and file system caches. When a VM does not have sufficient main memory for virtual memory or file system caches, it can use Transcendent Memory as an overflow area.

There are two basic forms of Transcendent Memory: clean data pools and modified data pools.

In a traditional OS, a memory is (broadly) divided between per-process working sets (or sometimes single global working set in operating systems that do not have a notion of per-process working sets, like e.g. Linux), Modified Page List and Free Page List, with the latter two maintained in a LRU-like order.

When a page is evicted from a process’ working/resident set, it experiences the following transitions:

An evicted page that has not been modified is put on the free list:

process working set -> Free Page List

Pages on the Free List are considered to be available for grabbing and reuse in case memory is needed, since their content can be reconstructed by reading it again from a backing store (such as a paging or image file). On the other hand, if page on the Free List is referenced by a process that needs it, it can be refaulted into that process’ working set, at which point the page is removed from the Free List.

An evicted page that has been modified by a process is put on the Modified List first. If referenced by the process soon again, it can be re-faulted into that process’ working set and removed from the Modified List. If a page on the Modified List stays unreferenced for some time (and thus not returned back to process working set) it may be flushed to a backing storage like a page file or original file (if originates from a memory-mapped file mapped as writable), at which point the page becomes “clean” and can be moved to the Free List:

process working set -> Modified Page List -> Free Page List

Transcendent Memory translates this traditional OS paradigm into the realm of hypervisors.

TMEM clean pools represent an equivalent of Free List.
TMEM modified data pools represent an equivalent of Modified List.

Guest operating systems are free to use TMEM pools for caching both virtual memory page and file cache pages. In guest OS’es with integrated VM/FS cache this happens more or less automatically.

Very much like with pages cached in a traditional OS’es Free List, TMEM is free to drop cached pages from TMEM clean pools at any time to reclaim memory when a hypervisor needs memory, since the content of these pages can be reconstructed by reading it from the backing store. Pages from modified-data pools are not dropped.

Current TMEM design for Xen (and TMEM interface specification) fall somewhat short of their OS-level antecedents, by not defining a way for a dirty page to transition from modified-data TMEM pool to clean-data cache. Such a transition would require TMEM manager to notify guest OS of hypervisor’s increasing need for memory, and its insistence that guest OS writes out pages kept in modified-data TMEM pool and transitions them to clean pages (held either inside the guest or in TMEM clean-data pool), ready to be reclaimed. Guest OS cannot make a decision that such a transition is needed on its own, since the need for it is steered by the knowledge available only to the hypervisor and by the policy managed by hypervisor – thus a transition can be requested only by the hypervisor. Yet TMEM currently does not define a guest/hypervisor interface to facilitate such a communication.

Otherwise TMEM design parallels traditional OS memory split-up between process working sets, Modified Page List and Free Page List.

Transcendent Memory is a viable technique for improving overall system performance. Beyond acting as a global cache aggregating pages from multiple VMs (clean-pool pages are retained and dropped in LRU-like fashion, optionally adjusted by per-VM share weights and caps), and thus providing more extra memory to those VMs that currently are in a heavier need of memory, it also offers less obvious modes of optimizations. Pages can be compressed and de-duplicated. Also TMEM is potentially able to use the kinds of memory less suitable as regular main memory, for example memory blades built of slow cheap RAM (yet much faster than Flash memory) located in far NUMA nodes.

Yet, Transcendent Memory is not suitable as a total replacement for dynamic adjustment of memory size of individual VMs.

Indeed, if host system were to be configured to run undersized VMs (i.e. VMs statically configured with memory size well below that VM peak memory utilization), and relying on TMEM pool to keep their “overflow pages” during the periods of elevated load, such a configuration would induce the following kinds of overhead:

high rate for invocation of intra-guest algorithms determining which pages to evict
copyout of evicted pages to TMEM (hypercall + memcpy + list management)
copyin of cached pages from TMEM (hypercall + list management + memcpy)
updates to virtual memory structures if the page represents an address space page (rather than FS cache page)
updates to file system cache structures (for pages in FS cache)
if TMEM does not define preswap pool, then premature writing out of dirty pages (to move them to the clean list, in the conditions of intra-guest main memory shortage)
if TMEM does define preswap pool, then this raises an issue about its handling mentioned earlier (lack of guest notification when preswap pool must be shrunk)
if domain experiences high need in mapped virtual memory (as opposed to FS cache), TMEM is not the most satisfactory solution since the access to TMEM pages is only indirect and they cannot be mapped directly into process address space, thus every reference to a missing page (held in TMEM) is liable to cause a page fault with all the overhead of page fault processing

If follows then that an optimal approach to dynamic memory partitioning would involve both basic approaches as complementary – dynamic adjustment of memory size of individual VMs coupled with the use of Transcendent Memory.

Generally, the size of VMs would be adjusted to represent longer-term envelope of their memory need, estimated on the basis of some indicators and predictive logics, and policy-balanced against the needs of other co-running VMs.

Whereas Transcendent Memory would:

Accommodate shorter-term overflows beyond this envelope (intra adjustment ticks).
Dampen the errors in prediction/estimates of memory needs of individual VMs compared to their actual needs.

This is crucial, as behavior of real-world programs and systems does not fit analytic models.
All models and heuristics for estimation of VM “true working set” and prediction of its change over forecasted time interval have only limited predictive capacity, so mis-predictions are a given.

Without Transient Memory, if a domain size is mis-predicted on lower-than-actual side, evicted pages are totally lost and would have to be re-read.

True, it it possible to get by without TMEM pools and instead to divide memory otherwise allocated to Transient Memory pools among individual VMs (on some predicted ratio basis) and thus buttress the available memory in the domains. However misprediction means that in some of the domains this extra memory will be under-utilized, meaning in turn that other, more needy domains will effectively receive less memory compared to available for them in a globally pooled configuration. If a memory pool is liquidated and its memory is split among the domains, the efficiency of this memory use is vulnerable to split ratio misprediction. On the contrary, when it is held as a globally used single whole, with pooled pages used on actual global LRU-like basis, the need to predict and the impact of misprediction is eliminated.

The intent of Membalance is to (eventually) provide such an integrated memory balancing facility, along with other features such as pluggable policy module that can be customized according to site-specific needs. The roadmap and work plan for Membalance is outlined in FUTURE WORK/ROADMAP section below. The scope of initial version of Membalance (0.1, implemented so far) is to:

Create a harness that hosts the rebalancing algorithm and provides it with runtime facilities to monitor hosted domains, manage their properties, dynamically resize them, respond to sysadmin management interface etc.
Provide an initial version of rebalancing algorithm that would be sophisticated and practical enough to be usable as a foundation for production-grade memory rebalancing facility.
Identify the shortcomings of Xen domain and memory management API that are an obstacle for any production-grade memory balancer.

There are two principal ways a memory balancer can collect the information on which to base its memory allocation decisions.

One approach is application-agnostic and is based on monitoring low-level indicators of operating system or observed on OS/hypervisor boundary.
Another approach, potentially more suitable for VM deployments running a single primary-importance application, is to rely on the data supplied by such an application about its memory needs. For example, a database engine is likely to be better positioned and more qualified to make a more accurate reasoning about its current and predicted memory needs than an application-agnostic probe monitoring the whole VM without the knowledge and understanding of application internals.

The initial version of Membalance uses the former approach, but later may explore the second option as well and provide an interface for the latter option (in practice, this would require to combine data from application-specific probes with data reported by application-agnostic probe, to cover kernel-managed resource usage that an application cannot easily account for).

There is a number of approaches a balancer can use to measure memory pressure inside a domain. (See section REFERENCES below.) For example, VMware ESX estimates VM’s memory usage intensity by sampling page reference frequency. Every sampling interval a certain number of pages in a VM (in the order of 100 to 300 pages) selected at random are marked by the hypervisor as inaccessible, and the hypervisor then counts the percentage of these pages that are referenced by the guest over the sampling interval. The advantage of this technique is that it does not need to rely on OS-specific probes and indicators. The obvious disadvantage is that although the sampling of this kind helps to assess the activity within the VM’s active working set (interpreted by ESX to infer the estimate of idle memory percentage and impose “idle memory tax”), it does not provide a reliable and reliably comparable indicator of domain expansion pressure.

Membalance therefore opts to rely on OS-specific probes. Each VM subject to being managed (dynamically resized) by Membalance runs a probe that collects memory pressure data based on OS-specific indicators and reports this data to Membalance controller (daemon) running in Dom0. The controller relies on reported memory pressure data (e.g. readings of reported memory pressure within all the managed domains) plus policy settings to make a reasoning about the desired change in allocation of host memory between the domains.

In this respect, participation by a domain in Membalance is semi-voluntary – as with any scheme relying on in-guest probes.

An in-guest actor with guest-level administrator rights can suspend or terminate the execution of the probe. If Membalance does not receive probes reading, at first it temporarily stops managing memory allocation of this domain, except for the situations of significant memory need by the host, in which case the domain can be shrunk to meet these needs. Furthermore, if a domain has not reported memory pressure data for some time (established by one of Memprobe policy settings), and the domain is sized above its quota, it is shrunk down to its quota.

An in-guest actor with guest-level administrator rights can also fake probe data (or OS data the probe relies on, such as hard page fault rate), making the probe to falsely report high memory pressure inside the domain, and prompting Membalance to increase memory allocation to the domain. This reliance and limitation are inherent with any scheme relying on in-guest probes. This normally is not an issue when increased domain memory allocation results in increased billable utilization for higher memory usage, but may be an issue when resource use is not accounted or charged for.

A better indicator of memory pressure is the rate for eviction of old pages from virtual memory and file system caches, in order to replace them with missing data read from storage devices holding paging/swap space and data files. (When Transcendent Memory is used, this would also include reads from Transcendent Memory pools.) Most operating systems do not expose this kind of statistics in their stock uninstrumented kernels, however they usually do expose a close proxy:

aggregate hard page fault rate + block read rate into the file system cache
monitored by Membalance under an additional filter requirement that a percentage of free memory inside guest must be low

as there is no point expanding guest that has substantial free memory inside it
when there is substantial memory available inside guest, even high IO and paging rates do not indicate high memory pressure

This is the construction of OS-level indicator that Membalance probe currently reports.

When percentage of free memory inside a guest is below a threshold, Membalance probe reports aggregate hard page fault rate + block IO rate into file system cache.

When percentage of free memory inside a guest is above a threshold, reported rate is 0.

Note that the described indicator very much parallels the traditional indicator of process paging rate within a conventional OS, with the following correspondences:

OS	→	Hypervisor
process	→	virtual machine
process working set	→	virtual machine memory allocation
page faults	→	hard page faults + block read into file system cache

and with aggregate block/page IO statistics at VM/Hypervisor boundary taking the role of page fault statistics at process/OS boundary.

There is a situation however, in which this simple indicator may be thrown off significantly and yield false reading of high memory pressure. If a VM performs the scan of a large data set (e.g. full-disk search) using cached IO primitives, i.e. accessing data that it does not really want or intend to cache, but indicating caching intent through the use of cached IO access, the probe will likely observe very high read rate into file system cache, and since the application misuses cached IO primitives for its data scan, data scanned by the application will fill guest file system cache, resulting in low guest free memory. This will create a combination of the conditions that will cause the indicator to falsely report high memory pressure within a domain. A proper way to address this situation is to avoid it in the first place by application using O_DIRECT and/or posix_fadvise(POSIX_FADV_DONTNEED) on Linux or FILE_FLAG_NO_BUFFERING on Windows. Still it may be possible that a VM is processing a data set with a stock application developed without handling large data sets in mind, e.g. running grep on a multi-gigabyte file set. The indicator will then report a false reading of high memory pressure. An indicator with more complex construction would be able to discern this situation by looking at cache use pattern (such as perhaps excluding from its report pages that are evicted before ever being re-referenced), but the implementation of such a refined indicator would be invasive for guest OS.

With Linux probe, there is currently also another issue. IO statistics in stock Linux kernel does not disaggregate cached and non-cached reads (i.e. block reads performed on files opened with O_DIRECT and without O_DIRECT), so an application such as database engine performing non-cached reads may cause the indicator to falsely report high memory pressure. The likelihood of this is reduced by free-memory percentage filter condition, which mitigates the problem, however in the longer run it would be desirable to include Linux kernel patch that accounts for page-in operations due to O_DIRECT files, and thus allows to exclude those operations from the total statistics, leaving only cached reads and page faults.

A construction of Membalance algorithm for reallocating memory between the domains is described below in section REBALANCING ALGORITHM.

USING MEMBALANCE

Membalance build process is described in howto-build.txt.
Installation procedure is described in howto-install.txt.

Virtual machines running on a particular Xen instance can be designated as to be managed (automatically sized) by Membalance, or to be left alone and not subject to automatic memory allocation adjustment by Membalance. They are termed, correspondingly, as managed and unmanaged virtual machines.

For a machine to be designated as managed, Membalance-related settings need to be added to its Xen domain configuration file.

Overall Membalance setup process is as follows:

A memory pressure probe daemon (memprobe) needs to be installed into every virtual machine to be managed (automatically resized) by Membalance.
Memprobe does not require a configuration.
A master Membalance daemon needs to be installed in Dom0.

After installing Membalance, system administator needs to configure Membalance by specifiying its control parameters in two locations:

In Xen domain configuration file for every domain that is to be managed (automatically sized) by membalace.
In system-wide Membalance configuration file /etc/membalance.conf.
This file contains settings for Membalance daemon.
It also allows to specify default values for Membalance parameters in domain configuration files.

When Membalance is started, it scans all running virtual machines to determine if they should be managed. Afterwards, when a new virtual machine is started, Membalance daemon notices it and begins similarly examining whether it should be managed by Membalance.

While virtual machine is being examined, it is in the state termed pending.

Possible outcomes of the pending state are:

Membalance daemon detects that machine is to be managed, and its configuration is valid, and consistent, and begins managing it.
The VM is said to transition to managed state.
Membalance daemon detects that machine is not to be managed (i.e. has no Membalance settings in its Xen domain configuration file).
The VM is said to transition to unmanaged state.
Membalance daemon detects that machine has Membalance-related settings in its configuration file, but these settings are incomplete or inconsistent.
A diagnostic message is issued, and the VM istransitioned to unmanaged state.

While VM is running, it is possible for it to transition from the managed to unmanaged state in the following events:

Domain settings are updated (e.g. by changing defaults in /etc/membalance.conf) and new domain settings become incomplete or inconsistent.
Hard unexpected error inside Membalance or Xen.

It is also possible to instruct Membalance to re-examine a previously unmanaged domain, and to start managing it assuming the settings for the domain are valid.

Domain state transition diagram:

The following membalane-related settings can be specified in a domain configuration file:

Parameter name

Explanation

membalance_dmem_min

Minimum size that domain can be shrunk to in the event of memory shortage.

Low-data-rate domains are likely to be sized between dmem_min and dmem_quota.

This setting can be though of as a practical domain size for very low-rate domains in the event of a memory-constrained system.

membalance_dmem_quota

Memory quota for a domain

Quota setting approximates domain size on a memory-constrained system in the event a domain keeps creating data read-in load.

This is a practical domain size limit for low- or mid-rate domains.

membalance_dmem_max

Maximum size a domain can be expanded to.

This approximates domain size on a memory-abundant system in the event a domain keeps creating data read-in load.

This is a practical domain size limit for high-rate domains.

membalance_dmem_incr

Membalance can automatically vary a domain’s memory size in range dmem_min to dmem_max.

The values of dmem_quota, rate_high, rate_low, dmem_incr and dmem_decr control this automatic adjustment of domain memory size.

Domain has a greater claim to memory if its data read-in rate (i.e. virtual memory hard paging rate plus file system page cache block read-in rate) exceeds rate_high, the smallest claim if its rate is below rate_low, and intermediate claim if its rate is in between rate_low and rate_high.

The strength of the claim also depends on the current size of a domain, most importantly its relationship to dmem_quota.

Slightly simplifying, the strength of domain’s claim to memory is definied by its classification within the following categories:

domain size	domain rate
<= dmem_quota	>= rate_high
<= dmem_quota	rate_low ... rate_high
> dmem_quota	>= rate_high
> dmem_quota	rate_low ... rate_high
(any)	<= rate_low

(More rigorous explanation is given below in section REBALANCING ALGORITHM,
see the definition of memory pressure functions there.)

Domains in higher tiers have stronger claim to memory.
Within a tier, domains with higher rate have a stronger claim to memory.

Every interval seconds, Membalance performs an adjustment in the sizes of managed domains.

If domain rate and size justify its expansion, Membalance will try to expand the the domain at the cost of available free memory, or if it is exhausted then at the cost of shrinking down domains with weaker claim to memory. Within a single interval, Membalance will try to expand domain that needs to grow by dmem_incr percent of its current size.

If a domain is to be trimmed down in size, Membalance will normally try to limit trim amount to at most dmem_decr percent of its current size within a single interval. However in the event of dire free memory shortage and pressing needs by other domains with stronger memory claim, Membalance can trim domains by more than dmem_decr, even all way down to dmem_min.

Defaults values (unless overriden on system-wide basis in membalance.conf):

rate_high = 200 kb/sec
dmem_incr = 6% (valid range: 0.5 - 30%)

rate_low = 0 kb/sec
dmem_decr = 4% (valid range: 0.5 - 10%)

It is recommended to leave dmem_incr and dmem_decr fairly small so as to prevent excessive Membalance over-adjustment, but still substantial. Normally default values are adequate.

Generally with default interval value of 5 seconds, reasonable value for dmem_incr ought not exceed 10%. It is also reasonable to keep the value for dmem_decr somewhat below that of dmem_incr, so domains would inflate fast when they need memory but deflate slower.

membalance_dmem_decr

membalance_rate_high

membalance_rate_low

membalance_rate_zero

If guest reported data read-in rate is <= rate_zero, the rate is considered to be zero regardless of the value reported by the guest.
This value represents a limen of Membalance sensitivity, and rate values below it are considered to have no significance.

Default is 30 kb/s unless overriden on system-wide basis in membalance.conf.

membalance_guest_free_threshold

If guest system has more than guest_free_threshold percent of free guest system memory, its data read-in rate is considered to be zero regardless of the reported rate value.

Default is 15% unless overriden on system-wide basis in membalance.conf.

membalance_startup_time

Estimated threshold on domain guest OS startup time.

If domain uptime is less than startup_time and domain did not start to report rate data yet, it may be given a benefit of the doubt that it did not have a good-faith chance to start memprobe daemon yet and accorded a somewhat more lenient treatment in certain memory shortage situations compared to older non-reporting domains.

Default is 300 seconds unless overriden on system-wide basis in membalance.conf.

membalance_trim_unresponsive

Membalance will trim domain memory allocation down to dmem_quota if a domain has not been reporting its data read-in rate data (e.g. memprobe daemon stopped running in the guest) for trim_unresponsive seconds, while domain has been staying runnable over this time, and domain size is over dmem_quota.

Default is 200 seconds unless overriden on system-wide basis in membalance.conf.

If set to 0, no trimming of non-reporting domain is performed.

membalance_trim_unmanaged

Membalance will trim domain memory allocation down to dmem_quota when a domain is transitioning from managed to unmanaged state, in case its size is above dmem_quota.

Possible values: yes/true or no/false.

Default is yes unless overriden on system-wide basis in membalance.conf.

The value for a domain parameter can be set in its Xen configuration file, or defaulted to system-wide default value configured in /etc/membalance.conf, or defaulted to a hardwired Membalance default:

Parameter name in Xen domain config file	Parameter name in /etc/membalance.conf	Hardwired default
membalance_dmem_min	(none)	(none)
membalance_dmem_quota	(none)	(none)
membalance_dmem_max	(none)	(none)
membalance_dmem_incr	dmem_incr	6%
membalance_dmem_decr	dmem_decr	4%
membalance_rate_high	rate_high	200 kb/s
membalance_rate_low	rate_low	0 kb/s
membalance_rate_zero	rate_zero	30 kb/s
membalance_guest_free_threshold	guest_free_threshold	15%
membalance_startup_time	startup_time	300 sec
membalance_trim_unresponsive	trim_unresponsive	200 sec
membalance_trim_unmanaged	trim_unmanaged	yes

On a large installation managing many domains a sound policy would be to define Membalance settings for a given category of domains in a separate file and use Python import statement to include this file into a particular domain’s config file, overriding the settings in the domain config file only when necessary. This Membalance definition file may in turn derive from a higher-level file and this way a hierarchical inheritance of Membalance settings is possible.

If membalace_dmem_max is not specified in a domain configuration file, it is defaulted to the setting of maxmem (and if maxmem is not specified, then to memory).

If membalace_dmem_quota is not specified in a domain configuration file, it is defaulted to the setting of memory.

If membalace_dmem_min is not specified in a domain configuration file, it is defaulted to the setting of memory.

The following validity conditions should hold, or Membalance would refuse to manage a domain with approproate diagnostic message:

dmem_min <= dmem_quota <= dmem_max <= maxmem
dmem_min < dmem_max
rate_low < rate_high

A common sense should be exercised to avoid setting the value of maxmem much higher than it realistically needs to be. High values of maxmem come at a cost. Guest operating system typically preallocates a descriptor for every physical page (PFN) it manages, whether owned by the balloon driver or not. These structures (in Linux case, struct page) occupy approximately 1-2% of physical memory being managed. Therefore, if maxmem is arbitrarily set let us say 10 times higher than initial domain memory size, then 10-20% of domain’s initial size would be lost to these structures, with corresponding impact on guest’s available free memory at initial domain size. (Hopefully, hotplug memory interface would allow to deal with this issue better.) This is especially important to keep in mind when dmem_min is less than initial memory setting: if a domain is shrunk down to dmem_min, the share of page management structures in total memory allocated to a domain can do even higher. For instance, if dmem_min is 3 times less than memory, and maxmem is 10 times higher than memory, than after shrinking the domain to dmem_min, page management structures can start occupying 30-60% of total domain’s memory, severely reducing free memory available within a domain, with an impact on a domain’s guest OS stabillity. In particular, low values of dmem_min (especially coupled with high value of maxmem) increase the likelyhood of Out-Of-Memory (OOM) process crashes in Linux guests.

Here is a sample Xen domain config file with Membalance settings in it:

name="ub64a"

builder="hvm"



vcpus=2

acpi=1

apic=1



memory="2048"

maxmem="49152"



disk = [

   
'file:/xenvirt/vm/ub64a/disk.img,ioemu:hda,w',

   
'file:/xenvirt/iso/ubuntu-desktop-14.04-x64.iso,ioemu:hdc:cdrom,r',

]



serial='pty'

usbdevice='mouse'

vif = [ 'type=ioemu, ip=192.168.122.211, mac=00:26:B9:48:74:da,
bridge=virbr0' ]



sdl=1

vnc=0



vga="stdvga"

videoram=16

boot='c'

membalance_dmem_min='2 gb'

membalance_dmem_quota='4 gb'

membalance_rate_high='1 mb/s'

Once a domain is configured for Membalance (and also appropriate settings are selected in file /etc/membalance.conf as described below), domain should be restarted to activate it as managed by Membalance. In a future versions of Membalance, it would be possible to activate domain as managed without having to restart it.

File /etc/membalance.conf also defines several Membalance configuration settings that apply not to individual domains, but control Membalance operation as a whole:

interval	Membalance daemon will perform automatic memory adjustment of managed domain sizes every interval seconds. This settings is also communicated to memprobe daemon instances running in guest domains and instructs them to take memory pressure reading every interval seconds and reports their readings to Membalance daemon running in the hypervisor domain. Default: 5 seconds. Valid range: 2 - 30 seconds.
host_reserved_hard	Membalance will try to leave alone Xen host system memory in the amount specified by host_reserved_hard. Membalance will not expand domains if there is less than host_reserved_hard memory left, nor expand domains by an amount that would leave less than host_reserved_hard available. In addition, if host system’s free memory drops below host_reserved_hard, Membalance will try to shrink managed domains to recover enough memory to bring the amount of host system available memory back to host_reserved_hard. This amount is in addition to (on top of) Xen free memory slack, which is by default calculated by Xen to be 15% of total physical memory available, unless is set otherwise. Use command xenstore-read /local/domain/0/memory/freemem-slack to find current Xen memory slack (in KBs). Format: <amount>[optional-space][optional-unit] Examples: 1024 (default unit is mb) 2000 mb 2000m 3G 3 GB Default: 0
host_reserved_soft	Membalance will try to set the amount of host free memory defined by host_reserved_soft aside for use only by domains in substantial need of memory. These are domains below their dmem_quota and with rate > rate_low, or domains any size with rate >= rate_high. This amount is in addition to (on top of) Xen free memory slack, as described above. Format: <amount>[optional-space][optional-unit] Examples: 1024 (default unit is mb) 2000 mb 2000m 3G 3 GB Default is calculated as: host_reserved_hard + 10% of (Xen physical memory - Xen free memory slack - Dom0 minimal size) When using Membalance, it may be a good practice to keep the size Xen free memory slack to a minimum (instead of its default size of 15% of physical memory), and instead reallocate most of slack memory to host_reserved_soft area. Memory required for starting new virtual machines can be obtained on demand by membalancectl free-memory command, as described below.

Whenever a setting in membalance.conf is changed, execute the following command to make the daemon to re-read the configuration file:

# service membalance reload

Operation of the daemon can be controlled with membalancectl command. Its usage summary:

Usage:

   
list                
display domains and memory status

   
pause               
suspend automatic domain memory balancing

       
[--quiet]       
do not print pause level

   
resume              
resume automatic domain memory balancing

       
[--force]       
drop remaining
pause level to zero

       
[--quiet]       
do not print
remaining pause level

   
free-memory         
ensure sufficient amount of free memory

       
<n>[k|m|g]      
specify required
amount explicitly

       
--config <x.cfg> use the amount requested by domain
config file

       
[--above-slack]  the amount is on top of Xen free memory slack

       
[--use-reserved-hard]   draw on host_reserved_hard if
necessary

       
[--must]        
terminate with
failure status if could not ...

                        
... allocate the full requested amount

    manage-domain
<id>   request to manage domain that
was in an unmanaged state

    manage-domain
--all  request to manage all currently unmanaged domains

    log-level
[n]       
set and/or show logging level

    log-sink
[which]     set logging sink
to "syslog" or "logfile"

   
dump-debug          
dump daemon internal state to /etc/membalance.conf

   
show-debug          
dump daemon internal state to stdout

   
--version           
print version (membalanced 0.1)

   
--help              
print this text

Common options:

    --verbose,
-v       
verbose output

   
-vv                 
even more verbose output

   
-vvv                
most verbose output

   
--quiet             
quiet operation

    --human,
-h         
display data in human-readable format

    --timeout
<sec>     
time to wait for ack from the daemon (default: 10)

To list managed domains, their current memory allocation, rate and other statistics, execute:

# membalancectl list

Should a system adminstartor (for whatever reason) want to perform a manual intervention, such as manual resizing of one of the managed domains, or perform other operation that may interfere with automatic memory control, a good practice is to temporarily suspend automatic memory rebalancing:

# membalancectl pause
.... manual operation ...
# membalancectl resume

In the future this will also be augumented by commands to dynamically change Membalance setting for a domain after its start-up, such as to dynamically change domain’s dmem_quota or to enable/disable automatic memory adjustment for the domain etc.

It may also be a good practice to reduce Xen free slack to a minumum, and put this memory reserve under Membalance management instead, keeping it under host_reserved_soft (and thus making it available to domains in need of memory instead of being kept as a pure standby). To start a new domain then:

# membalancectl pause
# membalance free-memory --config xyz.cfg
# xl create xyz.cfg
# membalancectl resume --quiet

Membalance daemon can be configured to write log messages either to system log or to /var/log/membalanced.log.
The following command line options can be added to /etc/init.d/membalance (after the --daemon option):

--debug-level
<n>    set debug level

--log               
instead of syslog, log to /var/log/membalanced.log

Logging can also be dynamically controlled with membalancectl, for example:

# membalancectl log-sink logfile
# membalancectl log-level 10

REBALANCING ALGORITHM

Memory scheduling algorithm dynamically balances host memory allocation between participating virtual machines according to their current memory demand and according to the settings established by system administrator.

In general, any dynamic resource allocation algorithm should rebalance resources taking into the account:

Dynamically reported current and also projected need of a computing entity (virtual machine) in a given kind of a resource, and the impact a shortage in the resource would have on the performance and forward progress of the entity, and on the waste of other resources (for instance, if memory allocation to a VM is insufficient, it may cause high paging rate, or high data read rate due to insufficient cache size, leading to the waste in CPU and IO bandwidth resources required to process such paging or data read requests).
Relative importance of a given entity compared to other similar entities.

Some entities (VMs) are more important than others and may be assigned a greater weight or share, so their claim to the resource would have greater strength compared to the claim of less important VMs.

When all VMs have the same assigned weight or share, the algorithm is geared towards overall peformance optimizaton of the executing mix.

More often however, not all virtual machines are equal in imporance, and system administator would want, in the event of a resource shortage, to let less important virtual machnines suffer more heavily, in order to maintain a satisfactory performance of more important virtual machines. At the extreme, resource allocation is defined purely by assigned shares, without taking into account actual current demand by the VMs in the resource – likely leading to resource allocation to high-weight VMs even in the event they actually may not need this much of a resource currently, and the resource stays idle or little-used by those VMs or some of them, while lower-weight VMs sufferes a severe shortage of the resource hampering their performance.
Therefore, a system administrator should be able to assign a relative importance to share-based allocation vs. overall performance-based allocation.

The goal of maintaining best performance of high-importance machines and the goal of maintaining overall mix performance and resource utilization are contradictory, and an administrator should be able to strike a balance between the two – an adjustuble balance.

Administrator shoud be provided with a control that would force high-share VMs to release some of the resource they hold – and thus being allocated less than their share value – but only in the event they do not currently have heavy need in this resource and some of lower-weight VMs do have such a need.

To summarize, an algorithm for resource allocation between computing entities should be driven by:

Share-based requirements representing human-defined importance of entities.
Actual (current/projected and dynamically shifting) resource demand by individual entities.
An adjustuble control creating a hybrid aim of these two goals, i.e. balancing the importance of overall performance and overall host resource utilization efficiency vs. share-prioritized allocation.

The initial version of Membalance algorithm, as implemented in Membalance version 0.1 and as described below, embodies these requirements only partially. A fuller implementation is to be provided by a subsequent version of Membalance. Current version of the algorithm:

Responds to real-time memory demand by individual Xen domains and embodies few very basic demand projections.
Has a form of share-based allocation represented by memory quotas (DMEM_QUOTA) assignable for Xen domains. Currently the value of quota is assigned by system administrator manually, is a fixed value, and represents an absolute allocation size rather than a share. In the future, it may be made dynamically adjustuble (within min-max’ed range) to move it closer to a traditional share-based allocation.
The design of the algorithm tries to balance overall performance vs. granting some share-like guarantees to individual domains. The balancing control for the two categories of requirements however is not yet made explicit in a single adjustable parameter, and is rather embodied in several parameters and logics than binds them. Adding more integrated and simplified control is an area for future work.

We will now proceed to the description of memory rebalancing algorithm implemented by Membalance version 0.1.

The algorithm reasons about current memory demand of individual Xen domains chiefly on the basis of domain’s data read-in rate (aggregate of hard page faults rate + file system cache block read-in rate, kb/sec) over current sampling interval, with moving averages computed as necessary. The assumption being that reads reflect insufficient size of page cache or file system cache in a guest. The intention (not implemented yet) is to exclude non-cached reads such as direct IO (O_DIRECT) from the statistics of data read-in rate for Membalance purpose. As a further heuristics, if guest OS has ample free memory size (e.g. over 15% of total guest memory size), then its read rate is not considered as indicative of memory shortage. The scheme is vulnerable to mis-interpreting guests that perform large IO transfers using cached read functions without actual intention to cache and reuse the data.

The algorithm is invoked every interval seconds – but only if there is at least one currently managed domain. If there are no managed domains, the algorithm is not invoked. (Once there appear managed domains, tick count is incremented accordingly to refelect elapsed sleep time.)

If automatic domain memory allocation is paused by system administrator, the algorithm is invoked only to collect data (stage 1 as described below), but does not perform any adjustments (stages 2-4 are not invoked).

The algorithm tries to provide good memory-supply responsivness for domains that need to expand, while avoiding premature domain contraction and upsize-downsize thrashing. The algorithm’s motto is “expand fast, contract slow” (wherever possible at all). To this end:

Algorithm allows domains to grow in a time of plenty and trims domains only when there is actual demand for memory. When there is no demand, domains are allowed to keep memory allocated to them. As long as available memory stays above HOST_RESERVED_SOFT, there is currently no proactive reclamation (albeit some forms of it can be introduced later).
Algorithm reserves a part of host memory (defined by HOST_RESERVED_SOFT) for domains in a substantial need of memory, otherwise keeping it free and immediatelly available for expansion of a domain when it comes in a substantial need of extra memory.
Algorithm employs SLOW_RATE moving average to track recent history of domain’s data read-in rate in order to avoid premature contraction of a domain just because it did not show memory demand for a tick or two.
Maximum domain expansion amount at a normal interval (dmem_incr) is larger by default than trimming amount (dmem_decr), unless the memory is in very short supply, albeit the values are aqjustable by system administrator.

The algorithm uses the following data items.

Global dynamic values:

HOST_FREE

Current amount of host free memory.

Global settings:

INTERVAL

Interval between successive invocations of the algorithm.

HOST_RESERVED_HARD

Host memory to leave alone and never consume for allocation during domain size auto-adjustment by Membalance:

Membalance will not expand existing domains if such an expansion would leave HOST_FREE below HOST_RESERVED_HARD.
If Membalance detects that HOST_FREE dropped below HOST_RESERVED_HARD (e.g. because a new virtual machine has been launched or a domain was expanded manually by the system administrator) Membalance will try to reclaim memory from domains managed by Membalance to bring HOST_FREE up to HOST_RESERVED_HARD again or as close to it as possible for Membalance

This amount is in addition to (on top of) Xen free memory slack.

HOST_RESERVED_SOFT

Host memory between HOST_RESERVED_SOFT and HOST_RESERVED_HARD is made available only for domains in significant need of memory:

Domains with DMEM_SIZE < DMEM_QUOTA and RATE > RATE_LOW
Domains any size with data read-in rate >= RATE_HIGH

As follows:

If HOST_FREE <= HOST_RESERVED_SOFT, domain can only expand beyond DMEM_QUOTA when its data read-in RATE >= RATE_HIGH.

This is the only constraint HOST_RESERVED_SOFT imposes on domain expansion if domain size is below DMEM_QUOTA, domain still can expand up to DMEM_QUOTA if its RATE > RATE_LOW, regardless of HOST_RESERVED_SOFT.
If HOST_FREE drops below (<) HOST_RESERVED_SOFT, Membalance will try to shrink those domains that have rate < RATE_HIGH and size above their DMEM_QUOTA down to but not below DMEM_QUOTA, and will also try to shrink domains sized below QUOTA and having RATE <= RATE_LOW.

In other words, a domain can expand above DMEM_QUOTA at the expense of free memory only if its RATE >= RATE_HIGH and FREE > FREE_RESERVED_SOFT.

Note that if FREE memory drops down to FREE_RESERVED_SOFT, a domain can still expand at the expense of shrinking other domains (rather than consuming free memory), as explained below in the description of stage 4.

Also, if free memory drops down to FREE_RESERVED_SOFT, membalanced will initiate shrinking of domains, as explained in (2) above and in stage 3 description below.

As a cumulative outcome, it ensures that memory beyond HOST_RESERVED_SOFT (but constrained by HOST_RESERVED_HARD) will be allocated only to domains

(a) within their DMEM_QUOTA once they needs it (RATE > RATE_LOW)

and

(b) domains beyond DMEM_QUOTA but only having RATE >= RATE_HIGH

This amount is in addition to (on top of) Xen free memory slack.

Per-domain dynamic values:

DMEM_SIZE

Currently allocated memory size

RATE

Reported data read-in rate for the current cycle.

Note that memory demand reporting by domains is not exactly synchronized between domains (synchronized only up to approximately INTERVAL seconds) and not synchronized with execution of memory balancing algorithm in membalanced daemon (also synchronized only up to approximately INTERVAL seconds), therefore when membalanced executes its next cycle, RATE value may have not been reported by a domain yet, or reported value may reflect RATE value reported on the basis of paging/file cache read partly before the previous adjustment and partly after it (if domain size adjustment has been performed during the previous algorithm tick).

Therefore:

If no value has been reported for the current tick, value reported in the previous cycle is re-used.

For this reason values of DMEM_INCR and DMEM_DECR should be moderate, typically not bigger than 5% and certainly not over 10%, to prevent rapid swings and overreaction of Membalance mechanism based on incomplete input.

If domain reported no value for over two cycles, it is excluded from normal participation in Membalance (consideration for shrinking or expanding) until it starts reporting rate value again.

Membalance can still shrink non-reporting domain but only as the last resort measure when trying to meet the goals of HOST_RESERVED_HARD or manual memory request by system administrator and all other means of memory extraction have been exhausted (i.e. all participating domains have been shrunk to their DMEM_MIN size).

Per-domain settings:

DMEM_MIN	Minimum size that domain can be shrunk to in the event of memory shortage. Low-rate domains are likely to be sized between DMEM_MIN and DMEM_QUOTA. This setting can be though of as a practical domain size for very low-RATE domains.
DMEM_QUOTA	Memory quota for a domain Quota setting approximates domain size on a memory-constrained system in the event a domain keeps creating data read-in load. This is a practical domain size limit for low- or mid-RATE domains.
DMEM_MAX	Maximum size a domain can be expanded to. This approximates domain size on a memory-abundant system in the event a domain keeps creating data read-in load. This is a practical domain size limit for high-RATE domains.
RATE_HIGH	High-rate threshold for data read-in rate (kb/s).
RATE_LOW	Low-rate threshold for data read-in rate (kb/s).
RATE_ZERO	Data read-in rate <= RATE_ZERO is considered to be the same as zero rate and shall not cause domain expansion.
DMEM_INCR	Memory amount to expand domain by (as % of current allocation).
DMEM_DECR	Memory amount to shrink domain by (as % of current allocation).
GUEST_FREE_THRESHOLD	If guest free memory (as a percentage of guest total system memory) is > GUEST_FREE_THRESHOLD, regard domain’s data read-in rate as zero, regardless of the reported rate.

Consistency check condtitions:

RATE_HIGH > RATE_LOW
DMEM_HIGH >= DMEM_QUOTA
DMEM_QUOTA >= DMEM_MIN
HOST_RESERVED_SOFT >= HOST_RESERVED_HARD
GUEST_FREE_THRESHOLD in range 0%...100%

Memory balancing algorithm runs periodically (as defined by the INTERVAL setting, unless paused by system manager or there are no domains to manage).

At each algorithm tick, the algorithm executes in four stages:

Stage 1 (data collection):

Collect data from the domains: for each domain managed by Membalance read its xenstore “report” key and reset the key.
If in-domain guest operating system has plenty of free memory (reported guest free memory > GUEST_FREE_THRESHOLD percent of total guest system memory), treat reported data read-in rate as 0, regardless of the reported rate value.
If domain data read-in rate is <= RATE_ZERO, treat reported data read-in rate as 0, regardless of the reported rate value.
Disregard domains in the following states: paused, crashed, dying, shutdown. Only manage domains in the states: running, blocked.
Disregard non-participating domains (membalance = off).
Disregard currently non-participating domains (that have not reported data within more than one cycle/tick).
If domain has not reported data for the current cycle, presume reading to be the same as for the last cycle.
If parameter TRIM_UNRESPONSIVE is set and domain did not provide rate data reports for over TRIM_UNRESPONSIVE seconds, and current domain size is over DMEM_QUOTA, the domain is trimmed town to DMEM_QUOTA.
Calculate values of SLOW_RATE and FAST_RATE.

These are the values of RATE that have the history of RATE factored into them and are used further to compute domain pressure force to expand, and also domain force of resistance to contraction. We use these effective values of rate with history factored into them instead of just the latest reading of RATE so the algorithm stays tuned to domain rate trend and is not thrown off by temporarily and transient abrupt in rate reading for just one or two ticks.

FAST_RATE is used to calculate domain pressure force to expand.
SLOW_RATE is used to calculate domain force to resist contraction.

The value of SLOW_RATE is defined as:

SLOW_RATE = max(slow_moving_average(RATE), RATE)

Slow moving average is taken over a number of recent samples.
In no event SLOW_RATE is less than the current RATE reading.

Right now moving average for SLOW_RATE is hardwired to be taken over the readings for 5 recent intervals, with decreasing weights for older intervals. In the future it may be made tunable both on global and per-domain basis to cover a longer history than currently used five recent ticks data.

Currently, heuristically, FAST_RATE is just RATE, i.e. most recent reading of RATE. In the future the formula for calculating FAST_RATE may me modified to temper down the initial response for infrequent one-interval intermittent spikes.

If automatic memory adjustment is paused by system administrator, further stages are not performed.

Stage 2 (free memory HARD constraint):

Meet HOST_RESERVED_HARD constraint.

If HOST_FREE dropped below HOST_RESERVED_HARD, try to squeeze managed domains to bring HOST_FREE back to HOST_RESERVED_HARD or as close to it as possible.

This stage is performed in several rounds, executed until either the target for HOST_FREE getting back to HOST_RESERVED_HARD is met, or all rounds are exhausted and membalanced is still unable to meet the target for HOST_RESERVED_HARD.

Rounds start with trimming down domains that are least likely to suffer as the result of the trimming and proceed towards domain that are likely to suffer more heavily if trimmed.

Round 1. Select domains with RATE <= RATE_LOW and sort them by the time they had rate <= RATE_LOW. Starting from domain that had low rate for longest time and towards domain that had it for shortest time, trim each domain by up to DMEM_DECR (but never below DMEM_MIN).

Round 2. Select domains with RATE < RATE_HIGH and size > DMEM_QUOTA, excepting domains already trimmed in the previous round, and sort them by the time they had rate < RATE_HIGH. Starting from domain that had rate < RATE_HIGH for longest time and towards domain that had it for shortest time, trim each domain by up to DMEM_DECR (but never below DMEM_QUOTA).

Round 3. If still cannot meet the target, select domains with RATE < RATE_HIGH and size > DMEM_QUOTA, regardless of whether they were already trimmed in the previous rounds, and sort them by the time they had rate < RATE_HIGH. Starting from domain that had rate < RATE_HIGH for longest time and towards domain that had it for shortest time, and regarless of any previous trimming amount, trim each domain [additionally] by up to DMEM_DECR (but never below DMEM_QUOTA) on top of any previous trimming.

Round 4. Calculate pressure-resistance function (as described below in the write-up for stage 4) for every domain above DMEM_QUOTA (the advisory value of RATE we use in this calculation is made not perfectly valid by previous contractions, but this is the best we can do at this point). Sort the list by the value of the function. Starting from lowest-pressure domain towards highest-pressure domains, trim each domain by up to DMEM_DECR (but not brining it below DMEM_QUOTA) until the deficit is satisfied.

Treat domains that have not reported their rate lately as having zero rate.

If was unable to satisfy the deficit within one pass, repeat passes until either the deficit is satisfied or all domains are brought down to DMEM_QUOTA.

Round 5. If the deficit is still not satisfied at this point, try to satisfy it by gradually trimming domains from DMEM_QUOTA towards DMEM_MIN. Recalculate pressure-resistance function for each domain using the last sampled RATE value for this domain (this is suboptimal, but the best we can do at this point). Starting from the lowest-pressure domain towards highest-pressure domain, trim each domain by up to DMEM_DECR (but not brining it below DMEM_MIN) until the deficit is satisfied.

Treat domains that have not reported their rate lately as having zero rate, except if a domain is very young (domain uptime is less than startup_time – a parameter assignable on a per-domain basis but also having global default), in which case the domain is given a benefit of the doubt and ascribed the rate just above rate_high.

If was unable to satisfy the deficit within one pass, repeat passes until either the deficit is satisfied or all domains are brought down to DMEM_MIN.

If deficit is still not satisfied at this point, there is nothing further Membalance can do, since all domains are now down to their DMEM_MIN size, and Membalance may not try to trim domains any further. If all domains are already at their DMEM_MIN or below, quit the algorithm.

Stage 3 (free memory SOFT constraint):

Try to meet HOST_RESERVED_SOFT constraint.

If HOST_FREE dropped below HOST_RESERVED_SOFT, try to trim managed domains to bring HOST_FREE back to HOST_RESERVED_SOFT or as close to it as possible.

Unlike the adjustment for HOST_RESERVED_HARD, the adjustment to HOST_RESERVED_SOFT does not have to be performed instantaneously and can be performed gradually over a number of cycles (algorithm ticks).

Unlike in HOST_RESERVED_HARD stage, trimming of domains within a single tick is limited and no excessive trimming is performed to reach the HOST_RESERVED_SOFT target. Rather, the intent is to reach it gradually over a number of ticks by only a moderate trimming during each tick.

This stage is executed in up to three rounds.

Round 1. Select domains with RATE <= RATE_LOW and size > DMEM_QUOTA. Sort the list by the time domain RATE was <= RATE_LOW. Starting from the domains that had low rate for longest time, and towards domains that had it for shortest time, trim each by up to DMEM_DECR (but not below DMEM_QUOTA) until the deficit is satisfied (i.e. HOST_FREE is increased up to HOST_RESERVED_SOFT).

If domain has already been previously trimmed in current algorithm tick, it can only be trimmed at this stage to the extent that the total trim does not exceed DMEM_DECR.

Round 2. If deficit is still not satisifed, repeat in a similar fashion with domains with RATE <= RATE_LOW but size <= DMEM_QUOTA. In this round domains are trimmed by up to DMEM_DECR but not below DMEM_MIN. Similarly to the previous round, trimming is limited by DMEM_DECR including the trim applied previously within the same tick.

Round 3. If deficit is still not satisifed, repeat in a similar fashion with domains with RATE < RATE_HIGH and size > DMEM_QUOTA.

If still unable to satisfy the deficit, leave it off at this point and try again at next tick.

To minimize the probability of a jitter of reallocating memory back and forth between very similar domains (domain upsize/downsize jitter), a domain that has been expanded not more than shrink_protection_time ticks back is not considered elgigible for shrinking at stages 3 and 4 (stages “meeting soft free memory constraint” and “domain rebalancing”). However this protection is not in effect for stage 2 (stage “meeting hard free memory constraint”).

Outlined stage 3 algorithm can also be expressed in terms of effective pressure function for free memory region (see below in stage 4 write-up).

Future enhancements to the algorithm may also take into account longer term RATE history for a domain in a more sophisticated way, e.g. its averaging over long-term period and absolute values of rate.

Stage 4 (domain balancing):

Perform dynamic domain expansion or contraction, according to current memory demand (pressure) of specific domains.

Domains with high memory pressure inside them can be expanded by using free memory if available or at the cost of shrinking domains with low memory pressure.

Domains are always expanded at the expense of available free memory first if free memory is available, and are expanded at the cost of shrinking other domains only if free memory is unavailable (including due to the constraints imposed by HOST_RESERVED_SOFT and HOST_RESERVED_HARD).

There are two memory pressure functions calculated for each domain:

Pressure-out (expansion) force indicates “outward-directed” force for the expansion of this domain.
Pressure-resistance force indicates how strongly a domain resists an attempt to contract it.

(By the way of an example, consider an analogy in the physical world: a stone rock would have very high pressure-resistance force, but zero pressure-out force. Same applies to a domain at DMEM_MIN and rate <= RATE_LOW).

When domain X wants to be expanded, but free memory is unavailable for the expansion of X, it can expand at the cost of shrinking another domain Y if pressure-out(X) > pressure-res(Y).

Pressure functions are calculated according to domain size vs. its DMEM_MIN and DMEM_QUOTA, and its data read-in rate vs. its RATE_LOW and RATE_HIGH values as follows:

RATE	DMEM_SIZE	(force_resist) pressure-res	(force_expand) pressure-out
>= RATE_HIGH	> QUOTA	50 + x	50 + x
>= RATE_HIGH	MIN ... QUOTA	100 + x	100 + x
>= RATE_HIGH	<= MIN	500	300

LOW ... HIGH	> QUOTA	30 + x	30 + x
LOW ... HIGH	MIN ... QUOTA	60 + x	60 + x
LOW ... HIGH	<= MIN	500	200

<= RATE_LOW	> QUOTA	0	0
<= RATE_LOW	MIN ... QUOTA	40	0
<= RATE_LOW	<= MIN	500	0

Value of x ranges from 0.0 to 1.0 and is calculated depending on domain RATE. The exact formula for calculating x is unimportant, since its only purpose is to represent relative ordering of domains within the same tier (i.e. having the same pre-“+” base function value base, or that is to say in the same RATE and DMEM_SIZE category in the table). Currently x is calculated as x = RATE / RMAX where RMAX is a maximum rate for all managed domains being balanced.

Currently x is calculated on the basis of latest reported RATE values for each domain, however in the future the algorithm may take into account longer-term history of RATE, for instance domains that had low RATE for long period of time may be more eligible for shrinking than domains that had low RATE for only a short period of time, and before that exhibited a spike of higher RATE activity.

Future version of the algorithm may also introduce the following fine tuning: among domains with roughly the same RATE, domains with substantially larger DMEM_SIZE may be more eligible for shrinking and less eligible for expansion than domains with smaller DMEM_SIZE and similar RATE.

Important: if a domain has been contracted within the current algorithm tick to a full value of its DMEM_DECR (or even beyond it for the sake of meeting HOST_RESERVED_HARD target), its pressure-resistance function goes to 500, making the domain ineligible for further shrinking within the current tick.

Important: Domain expansion force is defined only for domains that have recently reported rate data. Domain contraction-resistance force however is defined also for domains that did not report data recently. Although such domains do not participate in rebalancing (stage 4), nor SOFT-targeting (stage 3), nor most rounds of HARD-targeting stage 2, rounds 1-3), but they do participate in HARD-targeting (stage2) rounds 4 and 5. For this special use case, such domains do have contraction-resistance force defined as follows:

DMEM_SIZE	(force_resist) pressure-res
> QUOTA	32
MIN ... QUOTA	62
<= MIN	500

Host free memory area also has effective pressure function ascribed to it, defined as:

HOST_FREE	(force_resist) pressure-res	(force_expand) pressure-out
> RESERVED_SOFT	0	0
SOFT ... HARD	45	35 - 45
<= RESERVED_HARD	1000	450

During stage 3, when HOST_FREE is in SOFT...HARD range, expansion-pressure function is effectively applied in two steps.

First it is effectively applied with value of 35, i.e. able to squeeze domains with RATE = LOW...HIGH and DMEM_SIZE > DMEM_QUOTA.

If it were impossible to satisfy HOST_RESERVED_SOFT target this way, then expansion-pressure function is applied again, now with value of 45 and able to contract domains in the category RATE <= RATE_LOW and with size DMEM_MIN...DMEM_QUOTA.

Stage 4 of the algorithm runs as follows.

First find out if there are any domains wishing to expand. A domain wants to expand if its pressure-out function is > 0. In no event expand domain above DMEM_MAX. If there are no domains to expand, quit.

Sort domains wishing to expand by their pressure-out function and try to accommodate their expansion requests in the sort order, starting from the domain with the highest value of pressure-out function and moving towards domains with lower pressure-out value.

Domain expansion target is normally indicated by DMEM_INCR. However if domain’s DMEM_SIZE < DMEM_MIN, then the target is expansion to DMEM_MIN.

Free memory provides the first source for request accommodation.

Free memory supply for all domains is constrained by HOST_RESERVED_HARD (which is on top of Xen free memory slack, so memory reserved for Xen free memory slack is left untouched as well).

Free memory supply for domains with RATE < RATE_HIGH and executing above their DMEM_QUOTA is also constrained by HOST_RESERVED_SOFT.

A domain expands into free memory to the extent possible, as indicated by pressure-resistance function of host free memory area (note that if allocation request causes remaining free memory size to cross HOST_RESERVED_SOFT or HOST_RESERVED_HARD threshold, then the allocation is performed accounting for such a crossing, i.e. domain pressure-out may be high enough to push to the threshold, but not necessarily beyond the threshold.)

If domain expansion request cannot be satisfied at the expense of free memory, try to satisfy it at the cost of shrinking other domains.

Sort domains (using another vector, distinct from expansion-order vector) in the order of their pressure-resistance function, starting from the lowest value towards the highest. Going from the first (lowest-resistance) element in this vector towards last (highest resistance), perform the following:

If shrink-candidates-vec-domain == expand-candidates-vec-domain, skip this shrink-candidates-vec-domain.
If pressure-resistance(shrink-candidates-vec-domain) >= pressure-out(expand-candidates-vec-domain), terminate the rebalancing process, as even the weakest shrink candidate will not yield its memory to the strongest expansion candidate.
Expand expand-candidate-vec-domain at the cost of shrinking shrink-candidate-vec-domain.

Total expansion of expand-vec-domain in the tick cycle cannot exceed DMEM_INCR, except for domains sized below DMEM_MIN and expanding to DMEM_MIN.

Total contraction of shrink-vec-domain in the whole algorithm tick cannot exceed DMEM_DECR.

Since pressure-resistance function of a domain can increase while taking memory out of it if the domain reaches the thresholds of DMEM_QUOTA or DMEM_MIN, pressure-resistance function may need to be recalculated at this point and shrink-vector re-sorted (and shrink candidate scanning re-started from the lowest index again); and similarly for expansion candidates vector if the expansion crosses the thresholds of DMEM_QUOTA or DMEM_MIN of the domain being expanded.

Once domain has been shrunk by DMEM_DECR or more during the current algorithm tick (this also includes shrinking at stages 2 and 3), its pressure-resistance function value increases to 500 to reflect this domain is no longer eligible to be shrunk during the current tick, and shrink-candidates-vector must be re-sorted, and scanning re-started from the lowest index again.

To reduce calculations, domains that are not eligible to be expanded or contracted are removed from the corresponding vectors.

To minimize the probability of a jitter of reallocating memory back and forth between very similar domains (domain upsize/downsize jitter), a domain that has been expanded not more than shrink_protection_time ticks back is not considered elgigible for shrinking at stages 3 and 4 (stages “meeting soft free memory constraint” and “domain rebalancing”). However this protection is not in effect for stage 2 (stage “meeting hard free memory constraint”).

In all stages of the algorithm:

DMEM_INCR and DMEM_DECR are rounded to the closest memory allocation quant (typically, Xen page size).
Domain is never expanded above DMEM_MAX and never contracted below DMEM_MIN.
If domain has been contracted or expanded by amount that causes its DMEM_SIZE to reach or cross DMEM_MIN or DMEM_QUOTA, its pressure functions are re-calculated at this point.
If domain has been shrunk by DMEM_DECR or more in the current cycle, its pressure-resistance function goes up to 500.

If domain has not recently provided data, it is largely left alone and does not participate in rebalancing (stage 4), SOFT-targeted trimming (stage 3) and most of HARD-targeted trimming (rounds 1, 2 and 3 of stage 2). It does however participate in HARD-targeted rounds 4 and 5. Also if domain has setting trim_unresponsive defined (or inherited from global configuration), then after staying non-reporting for trim_unresponsive seconds, Membalance will trim it down to dmem_quota.

Membalancectl utility provides free-memory command that is used to reclaim requested amount of memory for purposes such as starting new virtual machines. This command invokes procedure sched_freemem(...) inside Membalance daemon which in turn executes the routine implementing stage 2 of the algorithm (hard trimming), except it trims the domain to reclaim the amount of memory determined not by HOST_RESERVED_HARD, but rather requested by free-memory command. If domain was shrunk in the recent tick by membalancectl free-memory command (or multiple executions of free-memory command), the aggregate amount it was shrunk by is used during the next tick time to adjust the amount calculated by DMEM_DECR, in an attempt to ensure a domain does not get shrunk by more than DMEM_DECR total within a single tick, except in dire memory shortage.

After all stages of the algorithm calculate scheduled expansions and contractions for managed domains, these contractions and expansions are executed. First all the contractions are executed. Then Membalance tries to execute the scheduled expansions. The ability of Membalance to execute the expansions is contingent on available free memory, which in turn is contingent on domains ordered to shrink executing their de-ballooning promptly enough. When executing an adjustment cycle, Membalance will wait for domains ordered to shrink to release their memory for up to a timeout interval derived from configuration parameters (see domain_expansion_timeout_xxx). If they fail to release required amont of memory within this time and Xen stays short of free memory, Membalance will only expand domains scheduled to expand to the extent available memory allows it, and will leave the adjustment of remaining domains till the next tick.

The algorithm may be refined in the future.
Some possible future changes to consider are:

As free memory is getting in short supply (e.g. crossing into HOST_RESERVED_SOFT zone), dynamically scale up DMEM_DECR, making the algorithm a more agressive stripper of the domains with lower claim to memory. Likewise, if free memory is ample, dynamically scale up DMEM_INCR making for a faster expansion of the domains in need of memory.
When calculating pressure-resistance function, account for a time a domain have been in the same category of RATE (<= RATE_LOW, RATE_LOW ... RATE_HIGH, >= RATE_HIGH). For example, domains that had low rate for a long time must be squeezable more easily than domains that had a burst of activity and had been inactive only for a short time.

Also account for long-term rate history during stage 3 (shrinking domains to meet HOST_RESERVED_SOFT target) in a more sophisticated way, such as tracking down average rate over long-time peiod and taking into account absolute value of rate.
When calculating pressure-resistance function, account for domain size. Among domains with roughly the same RATE, domains with substantially larger DMEM_SIZE may be more eligible for shrinking and less eligible for expansion than domains with smaller DMEM_SIZE and similar RATE.
Shrink long waiters: In severe memory shortage, shrink domains with low % of execution time, i.e. mostly sleeping (cpu exec time < 1% of total time, over recent long interval). Reduce size by 25%?
Proactively shrink domains that for a long time had reported high percentage of free system memory inside the domain (as reported by guest OS).
Make moving average used to calculate SLOW_RATE tunable on global and per-domain basis to cover a longer history than currently used five recent ticks data.
Record domain size at which it last time had rate >= RATE_LOW or RATE_HIGH. When in need of a large block of memory e.g. for membalancectl free-memory operation, may try to trim towards this size either by the whole amount or a significant fraction of it.
Follow stage 4 (domain rebalancing) by another round of stage 3 (soft memory constraint). If stage 4 caused an expansion of some domains at the cost of host_reserved_soft, then follow-up repeat of stage 3 would allow an immediate compenatory contraction of weak-pressure domains in order to reclaim memory from them to restore free memory supply back up to host_reserved_soft if possible. Currently such a reclamation is delayed until the next tick.
Consider the mertis of proactively squeezing the domains that for a long time reported a low or very low data read-in rate.

On a larger scale, future refinements to the algorithm should:

Incorporate in more explicit form a share-based allocation, perhaps by making DMEM_QUOTA dynamically computable according to share values assigned to all managed domains, and then further thresholded against quota-min anx quota-max absolute values (defined as per-domain properties inheritable from higher-level templates). Similarly, DMEM_INCR, DMEM_DECR and RATE_xxx may be subject for share-derived adjustments.
Present an explicit and simplified adjustment control for balancing between share-based allocation criterion and overall performance-based criterion.

DEFICIENCIES OF XEN API

“Testing the waters” with Membalance 0.1 indicated that Xen currently (4.4) has few functional omissions in the API that need to be addressed for any automated memory allocation application to work reliably in a production environment.

First.

Memory allocated to Xen domain is composed of three main parts: domain pseudo-physical memory, videoram and Xen internal per-domain structures (hereafter we designate them as M, V and X correspondingly).

X part contains items like shadow page tables, device data, buffers used by split drivers, LIBXL_MAXMEM_CONSTANT, LIBXL_PV_EXTRA_MEMORY or LIBXL_HVM_EXTRA_MEMORY etc.

To properly control memory allocation to domains, the balancer needs two basic functions: (1) find current memory allocation, (2) set new memory allocation.

As it happens, Xen provides two functions (or rather sets of functions): one returns current M+V+X, the other sets new target for M+V.

Thus in effect Xen API makes use of two distinct memory scales, one representing the “visible” size of a domain (M+V scale, as indicated by domain size target recorded in Xenstore or by xc_domain_get/set_pod_target plus videoram size), the other representing “total” (M+V+X) size. Unfortunately Xen does not expose the value of X – there is no Xen hypervisor call that would allow to query it, and it is not returned as a field in structures like xc_domaininfo_t. Thus there is no way to relate two scales together and make reasoning about M+V allocation target pn the basis of M+V+X allocation data.

To make things worse, allocation data available for the “visible” scale represents only a desired (target) domain size and tells nothing about its actual size. A domain can have large memory target value recorded in Xenstore but being unable to expand, therefore the value of target by itself tells nothing about domain’s actual current allocation. Therefore a reasoning about actual current allocation can be performed only using xc_domaininfo_t data (which is on M+V+X scale scale), whereas adjustments to the allocation can be performed only on the M+V scale – yet Xen provides no way to translate between the scales.

One might try to capture the offset between the two scales (as Membalance 0.1 does), but there is no way to do it on a “spot” basis, it is possible at all (and then only in an unreliable way) after domain has been runnable and size-stable for some time, i.e. had both Xenstore-recorded memory target and xc_domaininfo_t.tot_pages unchanged for some time, and there should be free memory available in the system, so domain does not stay under target being unable to expand towards it and domain should be runnable and it should get enough execution time to let the balloon driver to perform the expansion and guest OS should be in the state that does not block balloon driver execution – a set of conditions that balancing application cannot reliably control.

Furthermore, Xen internal data area can and does vary in size in response to domain activity, therefore it is not enough to capture it just once.

Altogether it makes it impossible to implement a memory balancing application of production-grade reliability on top of existing Xen API.

The easiest amendment to Xen API to address this issue would be to add to xc_domaininfo_t (or extended version of it) a field representing X, i.e. the size of Xen internal per-domain allocation other than pseudo-physical memory and videoram.

Second.

Unfortunately Xen does not provide an adequate facility for tracking current memory allocation and outstanding memory commitments/claims that would be usable by outside applications.

Specifically, Xen does not provide an adequate way to figure out outstanding memory lien for a domain in the process of being expanded, i.e. the difference between its current allocation and target allocation, nor the sum of all outstanding liens weighing against the remaining free memory. Xen does have the notion of outstanding_pages, but it is used only during the initial domain creation, not in the process of domain expansion.

A proper memory allocation/commitments tracking facility ought to provide the following data:

Currently allocated M+V+X size for every domain.
Separately, X part of current allocation.
Target M+V+X size for every domain (distinct from item #1 for domains in the process of expansion or contraction).
Separately, X part of target allocation if different from (2).
Summary (1) and (3) for all the domains hosted by hypervisor.
Liens data should be lockable (perhaps for lien increase only – see below).

Unfortunately as of current version (4.4) Xen does not provide this data.

There are target and videoram sizes recorded in Xenstore, but they constitute only M+V part of domain size that does not include the X part – the size of a domain’s Xen internal data area which is not stored or published anywhere. Therefore capturing target and videoram values for all the domains

(overhead issues aside -- but this can be worked around by reading watch-updates from Xenstore /local/domain key and its subkeys, rather than reading Xenstore keys every time the values are needed)

does not provide us means to find out “true free memory size” because the X part cannot be accounted for. Speaking in terms of Xen interface structures, knowing target + videoram does not let us reason about xc_domaininfo_t.tot_pages and vice versa. In effect, Xen/XL operate two disjoint scales of memory sizing with no conversion between them possible for an outside application.

Domains may be in the process of shrinking or expansion (including domains not managed by Membalance) and their current size can be distinct from their allocation targets and in the process of moving towards the targets. Thus capturing just current size or free memory does not provide us “true free size” until such movement is completed, since Xen does not let us know the outstanding commitments, and there is no reliable way to relate M+V scale to M+V+X scale – in fact no way at all while the expansion is in progress.

Thus the only way left for us to find or rather pray to approximate “true free size” is to wait for domain resizings in progress to complete by observing free memory size to stabilize over some time.

This is very unreliable because:

A domain in the process of expansion or contraction may temporary stall, either because it is not alloted enough CPU time, or because some intra-guest activity temporarily preempted the balloon driver.
A domain can be paused.
Multiple domains can be expanding and shrinking simultaneously compensating each others group impact on free memory and creating an illusion that a stabilization had been achieved.

Once these conditions clear, domain will resume its expansion or contraction, or multi-domain expansion and contraction will get out of mutual balance, and the assumption of having acquired “true free memory size” based on an apparent stability of free memory amount on the host will prove wrong.

Thus any reading of “Xen free memory” is, strictly speaking, meaningless since it represents merely a spot reading and there may be domains, including domains outside of our control, that are expanding or shrinking, and it is not possible for us to know the extent of their outstanding lien on free memory.

The easiest amendment to Xen API to address this issue would be:

Add to xc_domaininfo_t (or extended version of it) a field representing X.
Add to xc_domaininfo_t (or extended version of it) a field representing target (target POD size), so we do not need to rely on reading it from Xenstore – nor suffer from possibly delayed Xenstore watch events and thus stale cache data. This would represent a lien for M+V part.
If domain is currently expanding and its expansion to target is likely to result in an increased value of X, then an esitmate of new X once target is reached. The difference with current value of X would represent a lien for X part.
Alternatively, a summary lien for an expanding domain can be reported via existing field xc_domaininfo_t.outstanding_pages.
A system-wide sum of all outstanding liens. It can be reported via existing field xc_physinfo_t.outstanding_pages, it is just not currently used to track domain expansions.
While memory balancer is performing memory rebalancing cycle, it is desirable to lock out an execution of all new memory expansion requests, so no new memory liens can be put out while memory balancer is running its calculations. This ensures that memory balancer can perform calculations on valid data and well-defined amount “true free memory” (less outstanding liens), and avoid overcommittment of remaining free memory due to concurrent claims being made by other applications. This implies that functions like xc_domain_set_pod_target or xc_domain_claim_pages would have to stall if they result in increasing domain’s lien on memory while the lock on lien increase is engaged. Decreasing the lien is fine and does not have to cause a wait. In practical terms it means that while a balancer runs a cycle determining new allocations, it would temporarily block concurrent creation of new domains (such as with xl create) or manual expansion of domain sizes (such as with xl mem-set), however domains can terminate and release resources without a delay.

Third.

XL library should export routine parse_config_data or provide some other means for an application outside of XL to be able to evaluate how much memory (need_memkb) is required to create a domain identified by supplied domain confiig file.

Fourth.

Current implementaion of libxl_set_memory_target is such that its return status is flaky and very little can be inferred from it.

FUTURE WORK / ROADMAP

Implement changes to Xen memory allocation reporting API described in section XEN API DEFICIENCIES.
Without these changes, no production-grade memory balancing application can be implemented.
Implement memprobe daemons for Windows and other relevant guests such as perhaps BSD.
Implement a patch for Linux kernel (and a corresponding change for Linux version of memprobe) to keeps a statistics of non-cached block IO operations caused by direct IO (O_DIRECT), so it can be excepted from reported data read-in rate.
Think over and implement the refinements to the balancing algorithm listed at the end of section REBALANCING ALGORITHM.
Think over and integrate the support for Xen shared (deduplicated) pages and handling of copy-on-write for shared pages.
Shared page support is currently disabled in Xen, but one day it may be (re)enabled.
Integrate the support for Transcendent Memory pools.
Implement membalancectl commands to dynamically set (and display) Membalance settings for a domain. Currently the settings are read from a domain configuration file and cannot be changed through the lifetime of a domain. Suggested addition would allow to keep the values in Xenstore, subject to dynamic modifications, including the values quota/shares, whether automatic memory allocation adjustment is allowed for a domain, and any other Membalance domain-specific settings etc. This command would also allow to start managing a previously unmanaged domain without having to restart it.
Think over an impact of in-guest hotplug memory support.
(Low priority:) Think over and integrate the support for paged pages.
Consider the merits of hierarchical resource allocation, similar to Linux cgroups.
Design and provide a support for a version of memprobe protocol that would explicitly instruct Membalance about current and projected memory needs of the virtual machine. Such a probe can be useful for VMs that are dedicated to running a principal application such as a database engine, and where such an application is in a position to make better reasoning about its memory needs than application-agnostic probes. This may possibly
result in more than one probe running inside a VM, and a need to coalesce data from a set of probes inside a single VM.
Make policy module pluggable. Reimplement it in a somewhat higher-level language (such as Java or OCaml/Haskell), so it can be used as a template for a development of custom policy modules.
Develop regular documentation, man packages, and distro-specific installation packages.

REFERENCES

Carl Waldspurger, “Memory Resource Management in VMware ESX server”

Discusses page reclamation issues, ballooning, host-based paging, content-based page sharing/deduplication, share-based resource management, and a composite resource management formula accounting both for shares and for memory pressure or “resource under-use tax”. The use factor in ESX is assessed based on page reference sampling (every sampling interval some number of pages are marked inaccessible by the hypervisor which then monitors for the percentage of those pages being hit by the guest).

Transcendent memory:

Main project page: https://oss.oracle.com/projects/tmem
Dan Mangenheimer et al., “Transcendent Memory and Linux”
Dan Mangenheimer , “Advances in memory management in a virtual environment”

Jui-Hao Chiang, “Working Set-based Physical Memory Ballooning”
Jui-Hao Chiang, “Optimization Techniques for Memory Virtualization-based Resource Management”

Describes the scheme for dynamically tracking “true working set” of the domain by monitoring its swapin/refault rate. Dynamically expands the domain on increasing the rate, or starts to squeeze it when paging rate drops down, then stops squeezing it and bounces off as an increase in the paging rate is detected again. This way the size of the domain is hoped to approximate its TWS.

Scheme description does not (?) account for VFS cache re-read rate, but this is easily amendable.

More importantly, the scheme does not attempt to correlate intra-domain memory pressure with pressure in other domains, in making the sizing decisions.

Also it is geared towards using short time horizon during the squeeze, likely resulting in elimination of pages (such as e.g. cached file data) that are likely to be needed later, but beyond this time horizon. Proactive memory reclamation should take into the account the costs of memory vs. the cost of IO bandwidth, in the spirit of Gray’s “5-minute rule” (or rather its modern editions).

Weiming Zhao, Zhenlin Wang, “Dynamic Memory Balancing for Virtual Machines”

Tries to determine “true working set” of the virtual machine by monitoring its page access LRU histogram. Pages that are accessed infrequently are not part of the TWS and can be reclaimed from the domain without a significant hit on the performance.

While this scheme (overhead and intrusiveness issues aside) gives an insight in what part of currently allocated memory is used, it does not provide unambiguous and comparable data about memory expansion pressure in the domain to expand beyond its current size.

Authors circumvent it by either monitoring OS indicators (without elaborating which, how and response logics), or by expanding LRU list to monitor VM references to page in storage devices – bound to be prohibitively expensive for large data set.

The scheme (or its discussion anyway) also does not address time horizon and premature reclamation issue (vs. “5-minute rule”) mentioned earlier.

Martin Schwidefsky et al. (IBM), “Collaborative Memory Management in Hosted Linux Environments”

Qmemman, Qubes OS memory manager (http://www.qubes-os.org/wiki/Qmemman)

KVM Memory Overcommitment Manager

http://www.ovirt.org/MoM
https://aglitke.wordpress.com/2011/03/03/automatic-memory-ballooning-with-mom

Parts of memory rebalancing algorithm currently in use by Membalance were inspired by OpenVMS working set management algorithms, see e.g. Ruth Goldenberg, “OpenVMS Alpha Internals and Data Structures: Memory Management“, HP Technologies, 2002