Linux Cluster setup

Helpful reading:!Cluster_Tutorial_2

Red Hat 7 documentation

RHEL 7 High Availability Add-On Administration

High Availability Add-On Reference Reference

Global File System 2

Load Balancer Administration

Clusters from Scratch
Pacemaker Explained (Reference)

SUSE documentation

"Pro Linux High Availability Clustering" (Kindle)

"CentOS High Availability" (Kindle)
google: corosync totem
google: OpenAIS

Older (CMAN-based) clusters included:

/etc/cluster/cluster.conf => corosync.conf + cib.xml
system-config-cluster or conga (luci + ricci) configuration UI => replaced by (still deficient) pcs-gui on port 2224
rgmanager => pacemaker
ccs => pcs

Set up Corosync/Pacemaker cluster named vc composed of three nodes (vc1, vc2, vc3)

Based on Fedora Server 22.

Warning: a bug in virt-manager Clone command may destroy AppArmor profile both on source and target virtual machines.
Replicate virtual machines manually, or at least backup source machine profile (located in /etc/apparmor.d/libvirt).

Network set-up:

It is desirable to set up separate network cards for general internet traffic, SAN traffic and cluster backchannel traffic.
Ideally, interfaces should be link-aggregated (bonded or teamed) pairs, with each link in a pair connected to separate stacked switches.
Bonded interfaces are slightly preferable to teamed interfaces for clustering, as all link management for bonded interfaces happens in the kernel and does not involve user-land proccesses (unlike in the teamed interfaces set-up).

It makes sense to use dual-port network cards and scatter general/SAN/cluster traffic ports between them, so a card failure does not bring down the whole network category.

If interfaces are bonded or teamed (rather than configured for separate sub-nets), switches should allow cross-traffic, i.e. be either stackable (preferably) or have ISL/IST (inter-switch link/trunking, aka SMLT/DSMLT/R-SMLT). 802.1aq (Shortest Path Bridging) support may be desirable. See here.

Note that IPMI (AMT/SOL) interface cannot be included in the bond or team without loosing its IPMI capabillity, since it ceases to be indvidually addressable (having own P address).
Thus if IPMI is to be used for fencing or remote management, IPMI port is to be left alone.

For a real physical NIC, can identify port with

ethtool --identify ethX [10] => flashes LED 10 times

When hosting cluster nodes in KVM, create KVM macvtap interfaces (virtio/Bridge).

Bond interfaces:

About bonding
RHEL7 documentation
more about bonding

Note that bonded/teamed interfaces in most setups do not provide increased data speed or increased bandwidth from one node to another. They provide a failover and may provide an increased aggregate bandwidth for concurrent connections to multiple target hosts (but not to the same target host). However, see further down below.

Use network manager GUI:

"+" -> select Bond
Add->Create->Ethernet->select eth0
Add->Create->Ethernet->select eth1
Link Monitoring: MII => check media state
                 ARP => use ARP to "ping" specified IP addresses (comma-separated),
                        at least one responds -> link ok (can also configure to require all to respond)
Mode = 802.3ad => if linked to a real switch (802.3ad-compliant peer)
                  Adaptive load balancing => otherwise (if connected directly or via a hub, not a switch)
Monitoring frequency = 100 ms      

Or create files:


#BONDING_OPTS="miimon=100 updelay=0 downdelay=0 mode=balance-rr"
BONDING_OPTS="miimon=100 updelay=0 downdelay=0 mode=balance-alb"


NAME="bond0 slave 1"


NAME="bond0 slave 2"

nmcli device disconntct ifname
nmcli connection reload [ifname]
nmcli connecton up ifname

route -n => must go to bond, not slaves

also make sure default route is present
if not, add to /etc/sysconfig/network:  GATEWAY=xx.xx.xx.xx

To team interfaces:

dnf install -y teamd NetworkManager-team

then configure team interface with NetworkManager GIU

Bonded/teamed interfaces in most setups do not provide increased data speed or increased bandwidth from one node to another. They provide a failover and may provide an increased aggregate bandwidth for concurrent connections to multiple target hosts (but not to the same target host). However, there is a couple of workarounds:

bonding mode=4 (802.3ad)

The latter hashes using src-(ip,port) and dst-(ip,port).
Still not good for a single connection.
Create separate VLAN for each port (on each of the nodes) and use bonding mode = Adaptive load balancing.

Then LACP-compliant bridge will consider links separate and won't try to correlate the traffic and direct it via a single link according to xmit_hash_policy.
However this will reduce somewhat failover capacity: for example if Node1.LinkVLAN1 and Node2.LinkVLAN2 both fail.
It also requires that all peer systems (such as iSCSI servers, iSNS, etc.) have their interfaces configured accordingly to the sameVLAN scheme.

Remember to enable jumbo frames: ifconfig ethX mtu 9000.


Names vc1, vc2 and vc3 below are for cluster backchannel.

On each node:

# set node name
hostnamectl set-hostname vcx

# disable "captive portal" detection in Fedora
dnf install -y crudini
crudini --set /etc/NetworkManager/conf.d/21-connectivity-local.conf connectivity interval 0
systemctl restart NetworkManager

Cluster shells


dnf install -y pdsh clustershell

To use pdsh:

pdsh -R exec -f 1 -w vc1,vc2,vc3 cmd | dshbak
pdsh -R exec -f 1 -w vc[1-3]  cmd | dshbak

pdsh -R exec -f 1 -w vc1,vc2,vc3
pdsh -R exec -f 1 -w vc[1-3]

cmd substitution:

%h  => remote host name
%u  => remote user name
%n  => 0, 1, 2, 3 ...
%%  => %

To set up for clush, first enable password-less ssh.
Clumsy way:

ssh vc1
ssh-keygen -t rsa

ssh vc1 mkdir -p .ssh
ssh vc2 mkdir -p .ssh
ssh vc3 mkdir -p .ssh

ssh vc1 chmod 700 .ssh
ssh vc2 chmod 700 .ssh
ssh vc3 chmod 700 .ssh

cat .ssh/ | ssh vc1 'cat >> .ssh/authorized_keys'
cat .ssh/ | ssh vc2 'cat >> .ssh/authorized_keys'
cat .ssh/ | ssh vc3 'cat >> .ssh/authorized_keys'

ssh vc2
ssh-keygen -t rsa
cat .ssh/ | ssh vc1 'cat >> .ssh/authorized_keys'
cat .ssh/ | ssh vc2 'cat >> .ssh/authorized_keys'
cat .ssh/ | ssh vc3 'cat >> .ssh/authorized_keys'

ssh vc3
ssh-keygen -t rsa
cat .ssh/ | ssh vc1 'cat >> .ssh/authorized_keys'
cat .ssh/ | ssh vc2 'cat >> .ssh/authorized_keys'
cat .ssh/ | ssh vc3 'cat >> .ssh/authorized_keys'

Cleaner way:

Create, id_rsa and authorized_keys on one node,
then replicate them to other nodes in the cluster.

To use clush:

clush -w vc1,vc2,vc3 -b  [cmd]
clush -w vc[1-3] -b  [cmd]

Basic cluster install:

On each node:

dnf install -y pcs fence-agents-all fence-agents-virsh resource-agents pacemaker

Optional: dnf install -y dlm lvm2-cluster gfs2-utils iscsi-initiator-utils lsscsi httpd wget

systemctl start firewalld.service
firewall-cmd --permanent --add-service=high-availability
firewall-cmd --add-service=high-availability
systemctl stop firewalld.service
iptables --flush

## optionally disable SELinux:
#setenforce 0
#edit /etc/selinux/config and change
SELINUX=enforcing => SELINUX=permissive

passwd hacluster

systemctl start pcsd.service
systemctl enable pcsd.service

# make sure no http_proxy exported
pcs cluster auth -u hacluster -p xxxxx --force
e.g. pcs cluster auth vc1 vc2 vc3 -u hacluster -p abc123 --force

# created auth data is stored in /var/lib/pcsd

On one node:

pcs cluster setup [--force] --name vc

pcs cluster start --all

to stop:   pcs cluster stop --all

On each node:

# to auto-start cluster on reboot
# alternatively can manually do "pcs cluster start" on each reboot
pcs cluster enable --all

to disable: pcs cluster disable --all

View status:

pcs status
pcs cluster status
pcs cluster pcsd-status
systemctl status corosync.service
journalctl -xe

cibadmin --query
pcs property list [--all] [--defaults]
corosync-quorumtool -oi [-i]
corosync-cmapctl  [  |  grep members]
corosync-cfgtool -s
pcs cluster cib

Verify current configuration

crm_verify --live --verbose

Start/stop node

pcs cluster stop vc2
pcs status
pcs cluster start vc2

Disable/enable hosting resources on the node (standby state)

pcs cluster standby vc2
pcs status
pcs cluster unstandby vc2

"Transactional" configuration:

pcs cluster cib my.xml                    # get a copy of CIB to my.xml
pcs -f my.xml  ... change command ...     # make changes of config in my.xml
crm_verify --verbose --xml-file=q.xml     # verify config
pcs cluster cib-push my.xml               # push config from my.xml to CIB

Configure STONITH

All agents:

fence_virsh - fences machine via ssh to vm host and execuiting sudo virsh destroy <vmid> or sudo virsh reboot <vmid>

Alternative to virsh: fence_virt/fence_xvm

dnf install -y fence-virt

STONITH is needed:

Configure virsh STONISH

On the vm host:

define user stonithmgr
add it to sudoers as
stonithmgr ALL=(ALL)  NOPASSWD: ALL

On a cluster node:

pcs stonith list
pcs stonith describe fence-virsh
man fence_virsh
fence_virhs -h

# test
fence_virsh --ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose --plug=vc2 --action=metadata
fence_virsh --ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose --plug=vc2 --use-sudo --action=status
fence_virsh --ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose --plug=vc2 --use-sudo --action=list
fence_virsh --ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose --plug=vc2 --use-sudo --action=monitor
fence_virsh --ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose --plug=vc2 --use-sudo --action=off

create file /root/ as
echo "vc-cluster-passwd"

chmod 755 /root/
rsync -av /root/ vc2:/root
rsync -av /root/ vc3:/root

for node in vc1 vc2 vc3; do
pcs stonith delete fence_${node}_virsh
pcs stonith create fence_${node}_virsh \
    fence_virsh \
    priority=10 \
    ipaddr=${node}-vmhost \
    login=stonithmgr passwd_script="/root/" sudo=1 \
    port=${node} \

pcmk_host_list =>
port => vm name in virsh
ipaddr => name of machine hosting vm
delay=15 => delay for execution of fencing action

STONITH commands:

pcs stonith show --full
pcs stonith fence vc2 --off
pcs stonith confirm vc2
pcs stonith delete fence_vc1_virsh


Management GUI:


log in as hacluster

Management GUI, Hawk:   SUSE on Hawk


Essential files:


/var/lib/pacemaker/cib/cib.xml  (do not edit manually)


/var/log/corosync.log  (but by default sent to syslog)

or on new Fedora:
journalctl --boot -x
journalctl --list-boots
journalctl --follow -x
journalctl --all -x
journalctl -xe

Man pages:

man corosync.conf
man corosync.xml
man corosync-xmlproc

man corosync_overview
man corosync
man corosync-cfgtool

man quorum_overview        // quorum library
man votequorum_overview    // ...
man votequorum             // quorum configuration
man corosync-quorumtool

man cibadmin

man cmap_overview        // corosync config registry
man cmap_keys
man corosync-cmapctl

man sam_overview         // library to register process for a restart on failure

man cpg_overview         // closed group messaging library w/virtual synchrony
man corosync-cpgtool

man corosync-blackbox    // dump protocol "blackbox" data
man qb-blackbox
man ocf-tester
man crmadmin

man gfs2
man tunegfs2

Essential processes:

corosync totem, membership and quorum manager, messaging
cib cluster information base
stonithd fencing daemon
crmd cluster resource management daemon
lrmd local resource management daemon
pengine policy engine
attrd co-ordinates updates to cib, as an intermediary
dlm_controld distributed lock manager
clvmd clustered LVM daemon

Alternatives to corosync: CMAN or CCM + HEARTBEAT

DC ≡ Designated Controller. One of CRMd instances elected to act as a master. Should the elected CRMd process or its node fail, a new master is elected. DC carries out PEngine's instructions by passing them to LRMd on a local node, or to CRMd peers on other nodes, which in turn pass them to their LRMd's. Peers then report the results of execution to DC.

Resource categories:

LSB Services from /etc/init.d
Systemd systemd units
Upstart upstart jobs
OCF Open Cluster Framework scripts
Nagios Nagios monitoring plugins
STONITH fence agents

pcs resource standards
pcs resource providers
pcs resource agents ocf:heartbeat
pcs resource agents ocf:pacemaker
pcs resource agents systemd
pcs resource agents service
pcs resource agents lsb
pcs resource agents stonith

Resource consraints:

location Which nodes the resource can run on
order The order in which the resource is launched
colocation Where the resource will be placed relative to other resources

Connect to iSCSI drives:

See iSCSI page.

Briefly, on each cluster node:

Install the open-iscsi package. The package is also known as the Linux Open-iSCSI Initiator.


apt-get install open-iscsi lsscsi
gedit /etc/iscsi/iscsid.conf
/etc/init.d/open-iscsi restart


dnf install -y iscsi-initiator-utils lsscsi
systemctl enable iscsid.service
systemctl start iscsid.service

Display/edit initiator name, ensure it is unique in the landscape (especially if cloned the system)

cat /etc/iscsi/initiatorname.iscsi

e.g. => =>

Optional: edit configuration

gedit /etc/iscsi/iscsid.conf

restart the service

Discover the iSCSI targets on a specific host

iscsiadm -m discovery -t sendtargets -p qnap1x:3260 \
    --name discovery.sendtargets.auth.authmethod --value CHAP \
    --name discovery.sendtargets.auth.username --value sergey \
    --name discovery.sendtargets.auth.password --value abc123abc123 

Check the available iSCSI node(s) to connect to.

iscsiadm -m node

Delete node(s) you don’t want to connect to when the service is on with the following command:

iscsiadm -m node --op delete --targetname <target_iqn>

Configure authentication for the remaining targets:

iscsiadm   --mode node  --targetname ""  -p qnap1x:3260 --op=update --name node.session.auth.authmethod --value=CHAP
iscsiadm   --mode node  --targetname ""  -p qnap1x:3260 --op=update --name node.session.auth.username --value=sergey
iscsiadm   --mode node  --targetname ""  -p qnap1x:3260 --op=update --name node.session.auth.password --value=abc123abc123
iscsiadm   --mode node  --targetname ""  -p qnap1x:3260 --login

iscsiadm   --mode node  --targetname ""  -p qnap1x:3260 --op=update --name node.session.auth.authmethod --value=CHAP
iscsiadm   --mode node  --targetname ""  -p qnap1x:3260 --op=update --name node.session.auth.username --value=sergey
iscsiadm   --mode node  --targetname ""  -p qnap1x:3260 --op=update --name node.session.auth.password --value=abc123abc123
iscsiadm   --mode node  --targetname ""  -p qnap1x:3260 --login

iscsiadm   --mode node  --targetname ""  -p qnap1x:3260 --op=update --name node.session.auth.authmethod --value=CHAP
iscsiadm   --mode node  --targetname ""  -p qnap1x:3260 --op=update --name node.session.auth.username --value=sergey
iscsiadm   --mode node  --targetname ""  -p qnap1x:3260 --op=update --name node.session.auth.password --value=abc123abc123
iscsiadm   --mode node  --targetname ""  -p qnap1x:3260 --login

You should be able to see the login message as below:

Login session [iface: default, target:, portal:,3260] [ OK ]

Restart open-iscsi to login to all of the available nodes.

Fedora:  systemctl restart iscsid.service
Ubuntu:  /etc/init.d/open-iscsi restart

Check the device status with dmesg.

dmesg | tail -30

List available devices:

lsscsi -s
lsscsi -dg
lsscsi -c
lsscsi -Lvl

iscsiadm -m session [-P 3] [-o show]

For multipathing, see a section below.

Format volume with cluster LVM

See RHEL7 LVM Administration, chapters 1.4, 3.1, 4.3.3, 4.3.8, 4.7, 5.5.

On each node:

lvmconf --enable-cluster
systemctl stop lvm2-lvmetad.service
systemctl disable lvm2-lvmetad.service

To revert (if desired later):
lvmconf --disable-cluster

edit /etc/lvm/lvm.conf
    change use_lvmetad to 1

systemctl start lvm2-lvmetad.service
systemctl enable lvm2-lvmetad.service

On one node (cluster must be running):

pcs resource create dlm ocf:pacemaker:controld op monitor interval=30s on-fail=fence clone interleave=true ordered=true
pcs resource create clvmd ocf:heartbeat:clvm with_cmirrord=true op monitor interval=30s on-fail=fence clone interleave=true ordered=true
pcs constraint order start dlm-clone then clvmd-clone
pcs constraint colocation add clvmd-clone with dlm-clone

pcs constraint show
pcs resource show

If clvmd was already configured earlier, but without cmirrord, can enable the latter with:

pcs resource update clvmd with_cmirrord=true

Identify the drive

iscsiadm -m session -P 3 | grep Target
iscsiadm -m session -P 3 | grep scsi | grep Channel
tree /dev/disk

Partition the drive and create volume group

fdisk /dev/disk/by-path/

respond:  n, p, ...., w, p, q

Refresh parition table view on all other nodes:


pvcreate /dev/disk/by-path/
vgcreate [--clustered y] vg1 /dev/disk/by-path/
vgdisplay vg1

Create logical volume:

lvcreate vg1 --name lv1 --size 9500M
lvcreate vg1 --name lv1 --extents 2544   # find the number of free extents from vgdisplay
lvcreate vg1 --name lv1 --extents 100%FREE

ls -l /dev/vg1/lv1


See here.


Red Hat GFS2 documentation
Configure cluster no-quorum-policy as freeze

pcs property set no-quorum-policy=freeze

By default, the value of no-quorum-policy is set to stop, indicating that once quorum is lost, all the resources on the remaining (minority) partition will immediately be stopped. Typically this default is the safest and most optimal option, but unlike most resources, GFS2 and OCFS2 require quorum to function. When quorum is lost both the applications using the GFS2 mounts and the GFS2 mount itself cannot be correctly stopped in a partition that has become non-quorate. Any attempts to stop these resources without quorum will fail which will ultimately result in the entire cluster being fenced every time quorum is lost.

To address this situation, set the no-quorum-policy=freeze when GFS2 is in use. This means that when quorum is lost, the remaining (minority) partition will do nothing until quorum is regained.

If majority partition remains, it will fence the minority partition.

Find out for sure: if the majority partition can launch a failover replica of a service (that was running inside a minority partition) before fencing a minority partition, or will do it only after fencing a minority parition . If before, two replicas can conflict when no-quorum-policy is freeze (and even when it is stop).

Create file system and Pacemaker resource for it:

mkfs.gfs2 -j 3 -p lock_dlm -t vc:cfs1 /dev/vg1/lv1

-j 3 => pre-create journals for three cluster nodes
-t value => locking table name (must be ClusterName:FilesystemName)
-O => do not ask for confirmation
-J 256 => create journal with size of 256 MB (default: 128, min: 8)
-r <mb> => size of allocation "resource group", usually 256 MB

# view settings
tunegfs2 /dev/vg1/lv1

# change label (note: label is also the lcck table name)
tunegfs2 -L vc:cfs1 /dev/vg1/lv1

# some other settings can also later be changed with tunegfs2

pcs resource create cfs1 Filesystem device="/dev/vg1/lv1" directory="/var/mnt/cfs1" fstype=gfs2 \
" run_fsck=no \
    op monitor interval=10s on-fail=fence clone interleave=none

Mount options:
acl         enable ACLs
discard     when on SSD or SCSCI devices, enable UNMAP function for blocks being freed
quota=on    enforce quota
quota=account matain quota, but do not enforce it
noatime     disable update of access time
nodiratime  same for directories

lockproto=lock_nolock => mounting out of cluster (no DLM)

pcs constraint order start clvmd-clone then cfs1-clone
pcs constraint colocation add cfs1-clone with clvmd-clone
mount | grep /var/mnt/cfs1

To suspend write activity on file system (e.g. to create LVM snapshot)

dmsetup suspend /dev/vg1/lv1
[... use LVM to create a snapshot ...]
dmsetup resume

To run fsck, stop the resource to unmount file systems from all the nodes:

pcs resource disable cfs1 [--wait=60]    # default wait time is 60 seconds
fsck.gfs2 -y /dev/vg1/lv1
pcs resource enable cfs1

To expand file system:

lvextend  ... vg1/lv1
gfs2_grow /var/mnt/cfs1

When adding node to cluster, provide enough journals first:

# find out how many journals are available
# must unmount file syste first
pcs resource disable cfs1
gfs2_edit -p jindex /dev/vg1/lv1 | grep journal
pcs resource enable cfs1

# add one more journal, sized 128 MB
gfs2_jadd /var/mnt/cfs1

# add two more journals sized 256 MB
gfs2_jadd -j 2 -J 256 /var/mnt/cfs1

[... add the node ...]

Optional – Performance tuning – Increase DLM table sizes

echo 1024 > /sys/kernel/config/dlm/cluster/lkbtbl_size
echo 1024 > /sys/kernel/config/dlm/cluster/rsbtbl_size
echo 1024 > /sys/kernel/config/dlm/cluster/dirtbl_size

Optional – Performance tuning – Tune VFS

# percentage of system memory that can be filled with “dirty” pages before the pdflush kicks in
sysctl -n vm.dirty_background_ratio    # default is 5-10
sysctl -w vm.dirty_background_ratio=20

# discard inodes and directory entries from cache more agressively
sysctl -n vm.vfs_cache_pressure        # default is 100
sysctl -n vm.vfs_cache_pressure=500

# can be permanently changed in /etc/sysctl.conf

Optional – Tuning


To enable data journaling on a file (default: disabled)

chattr +j /var/mnt/cfs1/path/file    #enable
chattr -j /var/mnt/cfs1/path/file    #disable

Program optimizations:
To drop the cache (after large backups etc.)

echo 3 > /proc/sys/vm/drop_caches

View lock etc. status:

/sys/kernel/debug/gfs2/vc:cfs1/glocks    # decoded here

dlm_tool ls [-n] [-s] [-v] [-w]

dlm_tool plocks lockspace-name [options]

dlm_tool dump [options]
dlm_tool log_plock [options]

dlm_tool lockdump lockspace-name [options]
dlm_tool lockdebug lockspace-name [options]

tunegfs2   /dev/vg1/lv1

Quota manipulations:

mount with "quota=on"

to create quota files:   quotacheck -cug /var/mnt/cfs1

to edit user quota:      export EDITOR=`which nano' ; edquota username
to edit group quota:     export EDITOR=`which nano' ; edquota -g groupname

grace periods:           edquota -t

verify user quota:       quota -u username
verify group quota:      quota -g groupname

report quota:            repquota /var/mnt/cfs1

synchronize quota data between nodes:    quotasync -ug /var/mnt/cfs1

NFS over GFS2: see here


### multipath: man mpathpersist

### LVM: fsfreeze

misc, iscsi:

### add node (also GFS2 journals)
### virtual-ip
### httpd
### nfs
### fence_scsi
### GFS2
### OCFS2
### DRBD
### interface bonding/teaming
### quorum disk, qdiskd, mkqdisk
### GlusterFS
### Lustre
### hawk GUI