Linux cluster setup

Linux Cluster setup

Helpful reading:

https://alteeve.ca/w/AN!Cluster_Tutorial_2
https://alteeve.ca/w/2-Node_Red_Hat_KVM_Cluster_Tutorial_-_Archive

Red Hat 7 documentation

RHEL 7 High Availability Add-On Administration

High Availability Add-On Reference Reference

Global File System 2

Load Balancer Administration

http://clusterlabs.org
http://clusterlabs.org/quickstart-redhat.html
http://clusterlabs.org/quickstart-ubuntu.html
http://clusterlabs.org/quickstart-suse.html
http://clusterlabs.org/doc
http://clusterlabs.org/faq.html

Clusters from Scratch
Pacemaker Explained (Reference)

SUSE documentation

"Pro Linux High Availability Clustering" (Kindle)

"CentOS High Availability" (Kindle)

http://corosync.org
https://alteeve.ca/w/Corosync
google: corosync totem
google: OpenAIS

Older (CMAN-based) clusters included:

/etc/cluster/cluster.conf => corosync.conf + cib.xml
system-config-cluster or conga (luci + ricci) configuration UI => replaced by (still deficient) pcs-gui on port 2224
rgmanager => pacemaker
ccs => pcs

Set up Corosync/Pacemaker cluster named vc composed of three nodes (vc1, vc2, vc3)

Based on Fedora Server 22.

Warning: a bug in virt-manager Clone command may destroy AppArmor profile both on source and target virtual machines.
Replicate virtual machines manually, or at least backup source machine profile (located in /etc/apparmor.d/libvirt).

Network set-up:

It is desirable to set up separate network cards for general internet traffic, SAN traffic and cluster backchannel traffic.
Ideally, interfaces should be link-aggregated (bonded or teamed) pairs, with each link in a pair connected to separate stacked switches.

backchannel/cluster network

can be two sub-nets (on separate interfaces) with corosync redundant ring configured through them
however bonded interface is easier to set up, more resilient to failures, and allows traffic for other components be fail-safe too
it is also possible to bind multiple addresses to the bonded interface and set up corosync redundant ring amont them - but it does not make sense

SAN network

can be two sub-nets (on separate interfaces), with iSCSI multi-pathing configured between them
however can also be bonded: either utilizing one sub-net for all SAN traffic (with disks dual-ported between iSCSI portals within the same sub-net, but different addresses), or binding muiltiple sub-nets to the bonded interface (with disks dual-ported between iSCSI portals located on different sub-nets)

general network

better be bonded, so each node can be conveniently accessed by a single IP address
however load balancer can instead be configured to use multiple addresses for a node

Bonded interfaces are slightly preferable to teamed interfaces for clustering, as all link management for bonded interfaces happens in the kernel and does not involve user-land proccesses (unlike in the teamed interfaces set-up).

It makes sense to use dual-port network cards and scatter general/SAN/cluster traffic ports between them, so a card failure does not bring down the whole network category.

If interfaces are bonded or teamed (rather than configured for separate sub-nets), switches should allow cross-traffic, i.e. be either stackable (preferably) or have ISL/IST (inter-switch link/trunking, aka SMLT/DSMLT/R-SMLT). 802.1aq (Shortest Path Bridging) support may be desirable. See here.

Note that IPMI (AMT/SOL) interface cannot be included in the bond or team without loosing its IPMI capabillity, since it ceases to be indvidually addressable (having own P address).
Thus if IPMI is to be used for fencing or remote management, IPMI port is to be left alone.

For a real physical NIC, can identify port with

ethtool --identify ethX [10] => flashes LED 10 times

When hosting cluster nodes in KVM, create KVM macvtap interfaces (virtio/Bridge).

Bond interfaces:

About bonding
RHEL7 documentation
more about bonding

Note that bonded/teamed interfaces in most setups do not provide increased data speed or increased bandwidth from one node to another. They provide a failover and may provide an increased aggregate bandwidth for concurrent connections to multiple target hosts (but not to the same target host). However, see further down below.

Use network manager GUI:

"+" -> select Bond
Add->Create->Ethernet->select eth0
Add->Create->Ethernet->select eth1
Link Monitoring: MII => check media state
ARP => use ARP to "ping" specified IP addresses (comma-separated),
at least one responds -> link ok (can also configure to require all to respond)
Mode = 802.3ad => if linked to a real switch (802.3ad-compliant peer)
Adaptive load balancing => otherwise (if connected directly or via a hub, not a switch)
Monitoring frequency = 100 ms

Or create files:

/etc/sysconfig/network-scripts/ifcfg-bond0

DEVICE=bond0
NAME=bond0
TYPE=Bond
ONBOOT=yes
BONDING_MASTER=yes
BOOTPROTO=none
#DEFROUTE=yes
#IPV4_FAILURE_FATAL=no
#UUID=9d1c6d47-2246-4c74-9c62-adf260d3fcfc
#BONDING_OPTS="miimon=100 updelay=0 downdelay=0 mode=balance-rr"
BONDING_OPTS="miimon=100 updelay=0 downdelay=0 mode=balance-alb"
IPADDR=223.100.0.10
PREFIX=24
#IPV6INIT=yes
#IPV6_AUTOCONF=yes
#IPV6_DEFROUTE=yes
#IPV6_FAILURE_FATAL=no
#IPV6_PEERDNS=yes
#IPV6_PEERROUTES=yes
#IPV6_PRIVACY=no

/etc/sysconfig/network-scripts/ifcfg-bond0_slave_1

HWADDR=52:54:00:9C:32:50
TYPE=Ethernet
NAME="bond0 slave 1"
#UUID=97b83c1b-de26-43f0-91e7-885ef758d0ec
ONBOOT=yes
MASTER=bond0
#MASTER=9d1c6d47-2246-4c74-9c62-adf260d3fcfc
SLAVE=yes

/etc/sysconfig/network-scripts/ifcfg-bond0_slave_2

HWADDR=52:54:00:CE:B6:91
TYPE=Ethernet
NAME="bond0 slave 2"
#UUID=2bf74af0-191a-4bf3-b9df-36b930e2cc2f
ONBOOT=yes
MASTER=bond0
#MASTER=9d1c6d47-2246-4c74-9c62-adf260d3fcfc
SLAVE=yes

nmcli device disconntct ifname
nmcli connection reload [ifname]
nmcli connecton up ifname

route -n => must go to bond, not slaves

also make sure default route is present
if not, add to /etc/sysconfig/network: GATEWAY=xx.xx.xx.xx

To team interfaces:

dnf install -y teamd NetworkManager-team

then configure team interface with NetworkManager GIU

Bonded/teamed interfaces in most setups do not provide increased data speed or increased bandwidth from one node to another. They provide a failover and may provide an increased aggregate bandwidth for concurrent connections to multiple target hosts (but not to the same target host). However, there is a couple of workarounds:

Option 1:

Use

bonding mode=4 (802.3ad)
lacp_rate=0
xmit_hash_policy=layer3+4

The latter hashes using src-(ip,port) and dst-(ip,port).
Still not good for a single connection.

Option 2:

Create separate VLAN for each port (on each of the nodes) and use bonding mode = Adaptive load balancing.

Then LACP-compliant bridge will consider links separate and won't try to correlate the traffic and direct it via a single link according to xmit_hash_policy.
However this will reduce somewhat failover capacity: for example if Node1.LinkVLAN1 and Node2.LinkVLAN2 both fail.
It also requires that all peer systems (such as iSCSI servers, iSNS, etc.) have their interfaces configured accordingly to the sameVLAN scheme.

Remember to enable jumbo frames: ifconfig ethX mtu 9000.

Prepare:

Names vc1, vc2 and vc3 below are for cluster backchannel.

On each node:

# set node name
hostnamectl set-hostname vcx

# disable "captive portal" detection in Fedora
dnf install -y crudini
crudini --set /etc/NetworkManager/conf.d/21-connectivity-local.conf connectivity interval 0
systemctl restart NetworkManager

Cluster shells

Install

dnf install -y pdsh clustershell

To use pdsh:

#non-interactive:

pdsh -R exec -f 1 -w vc1,vc2,vc3 cmd | dshbak

pdsh -R exec -f 1 -w vc[1-3]  cmd | dshbak 

#interactive:

pdsh -R exec -f 1 -w vc1,vc2,vc3

pdsh -R exec -f 1 -w vc[1-3]

cmd substitution:

%h  => remote host name

%u  => remote user name

%n  => 0, 1, 2, 3 ...

%%  => %

To set up for clush, first enable password-less ssh.
Clumsy way:

ssh vc1

ssh-keygen -t rsa

ssh vc1 mkdir -p .ssh

ssh vc2 mkdir -p .ssh

ssh vc3 mkdir -p .ssh

ssh vc1 chmod 700 .ssh

ssh vc2 chmod 700 .ssh

ssh vc3 chmod 700 .ssh

cat .ssh/id_rsa.pub | ssh vc1 'cat
>> .ssh/authorized_keys'

cat .ssh/id_rsa.pub | ssh vc2 'cat >>
.ssh/authorized_keys'

cat .ssh/id_rsa.pub | ssh vc3 'cat >>
.ssh/authorized_keys'

Ctrl-D

ssh vc2

ssh-keygen -t rsa

cat .ssh/id_rsa.pub | ssh vc1 'cat >>
.ssh/authorized_keys'

cat .ssh/id_rsa.pub | ssh vc2 'cat >>
.ssh/authorized_keys'

cat .ssh/id_rsa.pub | ssh vc3 'cat >>
.ssh/authorized_keys'

Ctrl-D

ssh vc3

ssh-keygen -t rsa

cat .ssh/id_rsa.pub | ssh vc1 'cat >>
.ssh/authorized_keys'

cat .ssh/id_rsa.pub | ssh vc2 'cat >>
.ssh/authorized_keys'

cat .ssh/id_rsa.pub | ssh vc3 'cat >>
.ssh/authorized_keys'

Ctrl-D

Cleaner way:

Create id_rsa.pub, id_rsa and authorized_keys on one node,
then replicate them to other nodes in the cluster.

To use clush:

clush -w vc1,vc2,vc3 -b  [cmd]

clush -w vc[1-3] -b  [cmd]

Basic cluster install:

On each node:

dnf install -y pcs fence-agents-all fence-agents-virsh resource-agents pacemaker

Optional: dnf install -y dlm lvm2-cluster gfs2-utils iscsi-initiator-utils lsscsi httpd wget

systemctl start firewalld.service
firewall-cmd --permanent --add-service=high-availability
firewall-cmd --add-service=high-availability
systemctl stop firewalld.service
iptables --flush

## optionally disable SELinux:
#setenforce 0
#edit /etc/selinux/config and change SELINUX=enforcing => SELINUX=permissive

passwd hacluster

systemctl start pcsd.service
systemctl enable pcsd.service

# make sure no http_proxy exported
pcs cluster auth vc1.example.com vc2.example.com vc3.example.com -u hacluster -p xxxxx --force
e.g. pcs cluster auth vc1 vc2 vc3 -u hacluster -p abc123 --force

# created auth data is stored in /var/lib/pcsd

On one node:

pcs cluster setup [--force] --name vc
vc1.example.com vc2.example.com vc3.example.com

pcs cluster start --all

to stop: pcs cluster stop --all

On each node:

# to auto-start cluster on reboot
# alternatively can manually do "pcs cluster start" on each reboot
pcs cluster enable --all

to disable: pcs cluster disable --all

View status:

pcs status
pcs cluster status
pcs cluster pcsd-status
systemctl status corosync.service
journalctl -xe
cibadmin --query
pcs property list [--all] [--defaults]
corosync-quorumtool -oi [-i]
corosync-cpgtool
corosync-cmapctl [ | grep members]
corosync-cfgtool -s
pcs cluster cib

Verify current configuration

crm_verify --live --verbose

Start/stop node

pcs cluster stop vc2

pcs status

pcs cluster start vc2

Disable/enable hosting resources on the node (standby state)

pcs cluster standby vc2

pcs status

pcs cluster unstandby vc2

"Transactional" configuration:

pcs cluster cib my.xml # get a copy of CIB to my.xml
pcs -f my.xml ... change command ... # make changes of config in my.xml
crm_verify --verbose --xml-file=q.xml # verify config
pcs cluster cib-push my.xml # push config from my.xml to CIB

Configure STONITH

All agents: https://github.com/ClusterLabs/fence-agents/tree/master/fence/agents

fence_virsh - fences machine via ssh to vm host and execuiting sudo virsh destroy <vmid> or sudo virsh reboot <vmid>

Alternative to virsh: fence_virt/fence_xvm

dnf install -y fence-virt

STONITH is needed:

In resource (non-quorum) based clusters, for obvious reasons
In two-node clusters without quorum disk (a special case of the above), for obvious reasons
In quorum-based clusters, because Linux clustering solutions including Corosync and CMAN run as user-level processes and are unable to interdict user-level and kernel-level activity on the node when cluster node losesconnection to majority-votes partition. By comparison, in VMS C NXMAN is a kernel component which makes all CPUs to spin in IOPOST by requeueing the request to the tail of IOPOST queue until quorum is restored and the node re-joins the majority partition. During this time, no user-level processes can execute, and no new IO can be initiated, except the controlled IO to the quorum disk and SCS datagrams by CNXMAN. When connection to the majority-parition is restored, mount verification is further executed, and all file system requests are held off until mount verification completes. If a noderestores connection to the majority parition and detects new incarnation of the cluster, the node executes a bugcheck to reboot.

Configure virsh STONISH

On the vm host:

define user stonithmgr
add it to sudoers as

stonithmgr ALL=(ALL) NOPASSWD: ALL

On a cluster node:

pcs stonith list
pcs stonith describe fence-virsh
man fence_virsh
fence_virhs -h

# test
fence_virsh --ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose --plug=vc2 --action=metadata
fence_virsh --ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose --plug=vc2 --use-sudo --action=status
fence_virsh --ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose --plug=vc2 --use-sudo --action=list
fence_virsh --ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose --plug=vc2 --use-sudo --action=monitor
fence_virsh --ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose --plug=vc2 --use-sudo --action=off

create file /root/stonithmgr-passwd.sh as

#!/bin/sh
echo "vc-cluster-passwd"

chmod 755 /root/stonithmgr-passwd.sh
rsync -av /root/stonithmgr-passwd.sh vc2:/root
rsync -av /root/stonithmgr-passwd.sh vc3:/root

for node in vc1 vc2 vc3; do
pcs stonith delete fence_${node}_virsh
pcs stonith create fence_${node}_virsh \
   fence_virsh \
   priority=10 \
   ipaddr=${node}-vmhost \
   login=stonithmgr passwd_script="/root/stonithmgr-passwd.sh" sudo=1 \
   port=${node} \
   pcmk_host_list=${node}
done

pcmk_host_list => vc1.example.com
port => vm name in virsh
ipaddr => name of machine hosting vm
delay=15 => delay for execution of fencing action

STONITH commands:

pcs stonith show --full
pcs stonith fence vc2 --off
pcs stonith confirm vc2
pcs stonith delete fence_vc1_virsh

Reading:

https://alteeve.ca/w/Anvil!_Tutorial_3#Fencing_using_fence_virsh
https://www.centos.org/forums/viewtopic.php?f=48&t=50904
https://www.ibm.com/developerworks/community/blogs/mhhaque/entry/configure_two_node_highly_available_cluster_using_kvm_fencing_on_rhel7
http://www.hpuxtips.es/?q=content/part6-fencing-fencevirsh-my-study-notes-red-hat-certificate-expertise-clustering-and-storage

Management GUI:

https://vc1:2224

log in as hacluster

Management GUI, Hawk:

https://github.com/ClusterLabs/hawk SUSE on Hawk

https://vc1:7630

Essential files:

/etc/corosync/cofosync.conf
/etc/corosync/cofosync.xml
/etc/corosync/authkey

/var/lib/pacemaker/cib/cib.xml (do not edit manually)

/etc/sysconfig/corosync
/etc/sysconfig/corosync-inotifyd
/etc/sysconfig/pacemaker

/var/log/pacemaker.log
/var/log/corosync.log (but by default sent to syslog)
/var/log/pcsd/...
/var/log/cluster/...

/var/log/syslog
or on new Fedora:

journalctl --boot -x
journalctl --list-boots
journalctl --follow -x
journalctl --all -x
journalctl -xe

Man pages:

man corosync.conf
man corosync.xml
man corosync-xmlproc

man corosync_overview
man corosync
man corosync-cfgtool

man quorum_overview        // quorum library
man votequorum_overview    // ...
man votequorum            // quorum configuration
man corosync-quorumtool

man cibadmin

man cmap_overview    // corosync config registry
man cmap_keys
man corosync-cmapctl

man sam_overview        // library to register process for a restart on failure

man cpg_overview        // closed group messaging library w/virtual synchrony
man corosync-cpgtool

man corosync-blackbox    // dump protocol "blackbox" data
man qb-blackbox
man ocf-tester
man crmadmin

man gfs2
man tunegfs2

Essential processes:

corosync	totem, membership and quorum manager, messaging
cib	cluster information base
stonithd	fencing daemon
crmd	cluster resource management daemon
lrmd	local resource management daemon
pengine	policy engine
attrd	co-ordinates updates to cib, as an intermediary
dlm_controld	distributed lock manager
clvmd	clustered LVM daemon

Alternatives to corosync: CMAN or CCM + HEARTBEAT

DC ≡ Designated Controller. One of CRMd instances elected to act as a master. Should the elected CRMd process or its node fail, a new master is elected. DC carries out PEngine's instructions by passing them to LRMd on a local node, or to CRMd peers on other nodes, which in turn pass them to their LRMd's. Peers then report the results of execution to DC.

Resource categories:

LSB	Services from /etc/init.d
Systemd	systemd units
Upstart	upstart jobs
OCF	Open Cluster Framework scripts
Nagios	Nagios monitoring plugins
STONITH	fence agents

pcs resource standards
pcs resource providers
pcs resource agents ocf:heartbeat
pcs resource agents ocf:pacemaker
pcs resource agents systemd
pcs resource agents service
pcs resource agents lsb
pcs resource agents stonith

Resource consraints:

location	Which nodes the resource can run on
order	The order in which the resource is launched
colocation	Where the resource will be placed relative to other resources

Connect to iSCSI drives:

See iSCSI page.

Briefly, on each cluster node:

Install the open-iscsi package. The package is also known as the Linux Open-iSCSI Initiator.

Ubuntu:

apt-get install open-iscsi lsscsi
gedit /etc/iscsi/iscsid.conf
/etc/init.d/open-iscsi restart

Fedora:

dnf install -y iscsi-initiator-utils lsscsi
systemctl enable iscsid.service
systemctl start iscsid.service

Display/edit initiator name, ensure it is unique in the landscape (especially if cloned the system)

cat /etc/iscsi/initiatorname.iscsi

e.g.

InitiatorName=iqn.1994-05.com.redhat:cbf2ba2dff2 => iqn.1994-05.com.redhat:mynode1
InitiatorName=iqn.1993-08.org.debian:01:16c1be18eee8 => iqn.1993-08.org.debian:01:myhost2

Optional: edit configuration

gedit /etc/iscsi/iscsid.conf

restart the service

Discover the iSCSI targets on a specific host

iscsiadm -m
discovery -t
sendtargets -p qnap1x:3260 \

    --name discovery.sendtargets.auth.authmethod
--value CHAP \

    --name discovery.sendtargets.auth.username
--value sergey \

    --name discovery.sendtargets.auth.password
--value abc123abc123  

Check the available iSCSI node(s) to connect to.

iscsiadm -m node

Delete node(s) you don’t want to connect to when the service is on with the following command:

iscsiadm -m node --op delete --targetname <target_iqn>

Configure authentication for the remaining targets:

iscsiadm   --mode node --targetname "iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c" -p qnap1x:3260 --op=update --name node.session.auth.authmethod --value=CHAP
iscsiadm   --mode node --targetname "iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c" -p qnap1x:3260 --op=update --name node.session.auth.username --value=sergey
iscsiadm   --mode node --targetname "iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c" -p qnap1x:3260 --op=update --name node.session.auth.password --value=abc123abc123
iscsiadm   --mode node --targetname "iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c" -p qnap1x:3260 --login

iscsiadm   --mode node --targetname "iqn.2004-04.com.qnap:ts-569l:iscsi.xs2.e4cd7c" -p qnap1x:3260 --op=update --name node.session.auth.authmethod --value=CHAP
iscsiadm   --mode node --targetname "iqn.2004-04.com.qnap:ts-569l:iscsi.xs2.e4cd7c" -p qnap1x:3260 --op=update --name node.session.auth.username --value=sergey
iscsiadm   --mode node --targetname "iqn.2004-04.com.qnap:ts-569l:iscsi.xs2.e4cd7c" -p qnap1x:3260 --op=update --name node.session.auth.password --value=abc123abc123
iscsiadm   --mode node --targetname "iqn.2004-04.com.qnap:ts-569l:iscsi.xs2.e4cd7c" -p qnap1x:3260 --login

iscsiadm   --mode node --targetname "iqn.2004-04.com.qnap:ts-569l:iscsi.xs3.e4cd7c" -p qnap1x:3260 --op=update --name node.session.auth.authmethod --value=CHAP
iscsiadm   --mode node --targetname "iqn.2004-04.com.qnap:ts-569l:iscsi.xs3.e4cd7c" -p qnap1x:3260 --op=update --name node.session.auth.username --value=sergey
iscsiadm   --mode node --targetname "iqn.2004-04.com.qnap:ts-569l:iscsi.xs3.e4cd7c" -p qnap1x:3260 --op=update --name node.session.auth.password --value=abc123abc123
iscsiadm   --mode node --targetname "iqn.2004-04.com.qnap:ts-569l:iscsi.xs3.e4cd7c" -p qnap1x:3260 --login

You should be able to see the login message as below:

Restart open-iscsi to login to all of the available nodes.

Fedora: systemctl restart iscsid.service
Ubuntu: /etc/init.d/open-iscsi restart

Check the device status with dmesg.

dmesg | tail -30

List available devices:

lsscsi
lsscsi -s
lsscsi -dg
lsscsi -c
lsscsi -Lvl

iscsiadm -m session [-P 3] [-o show]

For multipathing, see a section below.

Format volume with cluster LVM

See RHEL7 LVM Administration, chapters 1.4, 3.1, 4.3.3, 4.3.8, 4.7, 5.5.

On each node:

lvmconf --enable-cluster
systemctl stop lvm2-lvmetad.service
systemctl disable lvm2-lvmetad.service

To revert (if desired later):

lvmconf --disable-cluster

edit /etc/lvm/lvm.conf
change use_lvmetad to 1

systemctl start lvm2-lvmetad.service
systemctl enable lvm2-lvmetad.service

On one node (cluster must be running):

pcs resource create dlm ocf:pacemaker:controld op
monitor interval=30s on-fail=fence clone interleave=true ordered=true

pcs
resource create clvmd ocf:heartbeat:clvm with_cmirrord=true op monitor
interval=30s on-fail=fence clone interleave=true ordered=true

pcs constraint order start dlm-clone then clvmd-clone

pcs constraint colocation add clvmd-clone with dlm-clone

pcs constraint show

pcs resource show

If clvmd was already configured earlier, but without cmirrord, can enable the latter with:

pcs resource update clvmd with_cmirrord=true

Identify the drive

iscsiadm -m session -P 3 | grep Target
iscsiadm -m session -P 3 | grep scsi | grep Channel
lsscsi
tree /dev/disk

Partition the drive and create volume group

fdisk /dev/disk/by-path/ip-192.168.73.2:3260-iscsi-iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c-lun-0

respond:  n, p,
...., w, p, q

Refresh parition table view on all other nodes:

partprobe

pvcreate /dev/disk/by-path/ip-192.168.73.2:3260-iscsi-iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c-lun-0-part1

vgcreate
[--clustered y] vg1
/dev/disk/by-path/ip-192.168.73.2:3260-iscsi-iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c-lun-0-part1

vgdisplay vg1

pvdisplay

vgs

Create logical volume:

lvcreate vg1 --name lv1 --size 9500M

lvcreate vg1 --name
lv1 --extents 2544
  # find the number of free extents from vgdisplay

lvcreate vg1 --name
lv1 --extents 100%FREE

lvdisplay

ls -l /dev/vg1/lv1

Multipathing

See here.

GFS2

Red Hat GFS2 documentation

File system name must be unique in a cluster (DLM lock names derive from it)
File system hosts journal files. One journal is required per each cluster node that mounts this file system.
Default journal size: 128 MB (per journal).
Minimum journal size: 8 MB.
For large file systems, increase to 256 MB.
If journal is too small, requests will have to wait for journal space, and performance will suffer.
Do not use SELinux with GFS2.
SELinux stores information in every file's extended attributes, which will cause significant GFS2 slowdown.
If GFS2 filesystem is mounted manually (rather than through Pacemaker resource), unmount it manually.
Otherwise shutdown script will kill cluster processes and will then try to unmount the GFS2 file system,
but without the processes the unmount will fail and the system will hang (and a hardware reboot will be required).

Configure cluster no-quorum-policy as freeze

pcs property set no-quorum-policy=freeze

By default, the value of no-quorum-policy is set to stop, indicating that once quorum is lost, all the resources on the remaining (minority) partition will immediately be stopped. Typically this default is the safest and most optimal option, but unlike most resources, GFS2 and OCFS2 require quorum to function. When quorum is lost both the applications using the GFS2 mounts and the GFS2 mount itself cannot be correctly stopped in a partition that has become non-quorate. Any attempts to stop these resources without quorum will fail which will ultimately result in the entire cluster being fenced every time quorum is lost.

To address this situation, set the no-quorum-policy=freeze when GFS2 is in use. This means that when quorum is lost, the remaining (minority) partition will do nothing until quorum is regained.

If majority partition remains, it will fence the minority partition.

Find out for sure: if the majority partition can launch a failover replica of a service (that was running inside a minority partition) before fencing a minority partition, or will do it only after fencing a minority parition . If before, two replicas can conflict when no-quorum-policy is freeze (and even when it is stop).

Create file system and Pacemaker resource for it:

mkfs.gfs2 -j 3 -p lock_dlm -t vc:cfs1 /dev/vg1/lv1

-j 3 => pre-create
journals for three cluster nodes

-t value => locking table name (must be
ClusterName:FilesystemName)

-O => do not ask for confirmation

-J 256 => create journal with size of 256 MB (default: 128, min:
8)

-r <mb> => size of allocation "resource group",
usually 256 MB

# view settings

tunegfs2 /dev/vg1/lv1

# change label (note: label
is also the lcck table name)

tunegfs2 -L vc:cfs1 /dev/vg1/lv1

# some other settings can also later be changed with tunegfs2

pcs resource create cfs1
Filesystem device="/dev/vg1/lv1" directory="/var/mnt/cfs1" fstype=gfs2
\

    options="noatime,nodiratime" run_fsck=no
\

    op monitor interval=10s on-fail=fence clone
interleave=none

Mount options:
acl    
    enable ACLs

discard     when on SSD or SCSCI devices, enable
UNMAP function for blocks being freed

quota=on    enforce quota

quota=account matain quota, but do not enforce it

noatime     disable update of access time

nodiratime  same for directories 

lockproto=lock_nolock => mounting out of cluster (no DLM)

pcs constraint order start clvmd-clone then cfs1-clone

pcs constraint colocation add cfs1-clone
with clvmd-clone

mount | grep /var/mnt/cfs1

To suspend write activity on file system (e.g. to create LVM snapshot)

dmsetup suspend /dev/vg1/lv1
[... use LVM to create a snapshot ...]
dmsetup resume /dev/vg1/lv1

To run fsck, stop the resource to unmount file systems from all the nodes:

pcs resource disable cfs1 [--wait=60]
   # default wait time is 60 seconds

fsck.gfs2 -y /dev/vg1/lv1

pcs resource enable cfs1

To expand file system:

lvextend  ... vg1/lv1

gfs2_grow /var/mnt/cfs1

When adding node to cluster, provide enough journals first:

# find out how many journals are available

# must unmount file syste first

pcs resource disable cfs1

gfs2_edit -p jindex /dev/vg1/lv1 | grep
journal

pcs resource enable cfs1

# add one more journal, sized 128 MB

gfs2_jadd /var/mnt/cfs1

# add two
more journals sized 256 MB

gfs2_jadd -j 2 -J 256 /var/mnt/cfs1

[... add the node ...]

Optional – Performance tuning – Increase DLM table sizes

echo 1024 >
/sys/kernel/config/dlm/cluster/lkbtbl_size

echo 1024 > /sys/kernel/config/dlm/cluster/rsbtbl_size

echo 1024 > /sys/kernel/config/dlm/cluster/dirtbl_size

Optional – Performance tuning – Tune VFS

# percentage of system memory that can be filled
with “dirty” pages before the pdflush kicks in

sysctl -n vm.dirty_background_ratio    #
default is 5-10

sysctl -w
vm.dirty_background_ratio=20

# discard inodes and
directory entries from cache more agressively

sysctl -n vm.vfs_cache_pressure   
    # default is 100

sysctl -n vm.vfs_cache_pressure=500

# can be permanently changed in /etc/sysctl.conf

Optional – Tuning

/sys/fs/gfs2/vc:cfs1/tune/...

To enable data journaling on a file (default: disabled)

chattr +j /var/mnt/cfs1/path/file  
 #enable

chattr -j /var/mnt/cfs1/path/file    #disable

Program optimizations:

preallocate file space – use fallocate(...) if possible
flock(...) is faster than fcntl(...) with GFS2
with fcntl(...), l_pid may refer to a process on a different node

To drop the cache (after large backups etc.)

echo 3 > /proc/sys/vm/drop_caches

View lock etc. status:

/sys/kernel/debug/gfs2/vc:cfs1/glocks   
# decoded here

dlm_tool ls [-n] [-s] [-v] [-w]

dlm_tool plocks lockspace-name
[options]

dlm_tool dump [options]

dlm_tool log_plock [options]

dlm_tool lockdump lockspace-name
[options]

dlm_tool lockdebug lockspace-name
[options]

tunegfs2   /dev/vg1/lv1

Quota manipulations:

mount with "quota=on"

to create quota files:   quotacheck -cug /var/mnt/cfs1

to edit user quota:      export EDITOR=`which nano' ; edquota username
to edit group quota:    export EDITOR=`which nano' ; edquota -g groupname

grace periods:    edquota -t

verify user quota:    quota -u username
verify group quota: quota -g groupname

report quota: repquota /var/mnt/cfs1

synchronize quota data between nodes: quotasync -ug /var/mnt/cfs1

NFS over GFS2: see here

=========

### multipath: man mpathpersist https://www.suse.com/documentation/sles-12/stor_admin/data/sec_multipath_mpiotools.html

### LVM: fsfreeze

misc, iscsi: https://www.ibm.com/developerworks/community/blogs/mhhaque/entry/configure_two_node_highly_available_cluster_using_kvm_fencing_on_rhel7?lang=en

http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_moving_resources_due_to_connectivity_changes.html
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/IPaddr2

### add node (also GFS2 journals)
### virtual-ip
### httpd
### nfs
### fence_scsi
### GFS2
### OCFS2
### DRBD
### interface bonding/teaming
### quorum disk, qdiskd, mkqdisk
### GlusterFS
### Lustre
### hawk GUI https://github.com/ClusterLabs/hawk

### http://www.spinics.net/lists/cluster/threads.html