Linux Cluster setup
Helpful reading:
Older (CMAN-based) clusters included:
/etc/cluster/cluster.conf
=> corosync.conf + cib.xml
system-config-cluster or conga (luci + ricci) configuration UI
=> replaced by (still deficient) pcs-gui on port 2224
rgmanager => pacemaker
ccs => pcs
Set
up Corosync/Pacemaker cluster named vc
composed of three nodes (vc1, vc2, vc3)
Based on Fedora Server 22.
Warning: a bug in virt-manager Clone command may destroy AppArmor
profile both on source and target virtual machines.
Replicate virtual machines manually, or at least backup source
machine profile (located in /etc/apparmor.d/libvirt).
Network
set-up:
It is desirable to set up separate network cards for general internet
traffic, SAN traffic and cluster backchannel traffic.
Ideally,
interfaces should be link-aggregated (bonded or teamed) pairs, with
each
link in a pair connected to separate stacked switches.
- backchannel/cluster network
- can be two sub-nets (on separate interfaces) with
corosync redundant ring configured through them
- however bonded interface is easier to set up, more
resilient to failures, and allows traffic for other components be
fail-safe too
- it is also possible to bind multiple addresses to the
bonded interface and set up corosync redundant ring amont them - but it
does not make sense
- SAN network
- can be two sub-nets (on separate interfaces), with iSCSI
multi-pathing configured between them
- however can also be bonded: either utilizing one sub-net
for all SAN traffic (with disks dual-ported between iSCSI portals
within the same sub-net, but different addresses), or binding muiltiple
sub-nets to the bonded interface (with disks dual-ported between iSCSI
portals located on different sub-nets)
- general network
- better be bonded, so each node can be conveniently
accessed by a single IP address
- however load balancer can instead be configured to use
multiple addresses for a node
Bonded interfaces are slightly preferable to teamed interfaces for
clustering, as all link management for bonded interfaces happens in
the kernel and does not involve user-land proccesses (unlike in the teamed
interfaces set-up).
It makes sense to use dual-port network cards and scatter
general/SAN/cluster traffic ports between them, so a card failure does
not bring down the whole network category.
If interfaces are bonded or teamed (rather than configured for separate
sub-nets), switches should allow cross-traffic, i.e. be either
stackable
(preferably) or have ISL/IST (inter-switch link/trunking, aka
SMLT/DSMLT/R-SMLT). 802.1aq (Shortest Path Bridging) support may be
desirable. See here.
Note
that IPMI (AMT/SOL) interface cannot be included in the bond or team
without loosing its IPMI capabillity, since it ceases to be indvidually
addressable (having own P address).
Thus if IPMI is to be used for fencing or remote management, IPMI port is to be left alone.
For a real physical NIC, can identify port with
ethtool --identify ethX [10] => flashes LED 10 times
When hosting cluster nodes in KVM, create KVM macvtap interfaces (virtio/Bridge).
Bond interfaces:
Note
that bonded/teamed interfaces in most setups do not provide increased
data speed or increased bandwidth from one node to another. They
provide a failover and may provide an increased aggregate bandwidth for concurrent connections to multiple target hosts (but not to the same target host). However, see further down below.
Use network manager GUI:
"+" -> select Bond
Add->Create->Ethernet->select eth0
Add->Create->Ethernet->select eth1
Link Monitoring: MII => check media state
ARP => use ARP to "ping" specified IP addresses
(comma-separated),
at least one responds -> link ok (can
also configure to require all to respond)
Mode = 802.3ad => if linked to a real switch (802.3ad-compliant peer)
Adaptive load balancing => otherwise (if connected directly or via a hub, not a switch)
Monitoring frequency = 100 ms
Or create files:
/etc/sysconfig/network-scripts/ifcfg-bond0DEVICE=bond0
NAME=bond0
TYPE=Bond
ONBOOT=yes
BONDING_MASTER=yes
BOOTPROTO=none
#DEFROUTE=yes
#IPV4_FAILURE_FATAL=no
#UUID=9d1c6d47-2246-4c74-9c62-adf260d3fcfc
#BONDING_OPTS="miimon=100 updelay=0 downdelay=0 mode=balance-rr"
BONDING_OPTS="miimon=100 updelay=0 downdelay=0 mode=balance-alb"
IPADDR=223.100.0.10
PREFIX=24
#IPV6INIT=yes
#IPV6_AUTOCONF=yes
#IPV6_DEFROUTE=yes
#IPV6_FAILURE_FATAL=no
#IPV6_PEERDNS=yes
#IPV6_PEERROUTES=yes
#IPV6_PRIVACY=no
/etc/sysconfig/network-scripts/ifcfg-bond0_slave_1HWADDR=52:54:00:9C:32:50
TYPE=Ethernet
NAME="bond0 slave 1"
#UUID=97b83c1b-de26-43f0-91e7-885ef758d0ec
ONBOOT=yes
MASTER=bond0
#MASTER=9d1c6d47-2246-4c74-9c62-adf260d3fcfc
SLAVE=yes
/etc/sysconfig/network-scripts/ifcfg-bond0_slave_2HWADDR=52:54:00:CE:B6:91
TYPE=Ethernet
NAME="bond0 slave 2"
#UUID=2bf74af0-191a-4bf3-b9df-36b930e2cc2f
ONBOOT=yes
MASTER=bond0
#MASTER=9d1c6d47-2246-4c74-9c62-adf260d3fcfc
SLAVE=yes
nmcli device disconntct ifname
nmcli connection reload [ifname]
nmcli connecton up ifname
route -n => must go to bond, not slaves
also make sure default route is present
if not, add to /etc/sysconfig/network: GATEWAY=xx.xx.xx.xx
To team interfaces:
dnf install -y teamd NetworkManager-team
then configure team interface with NetworkManager GIU
Bonded/teamed interfaces in most setups do not provide increased
data speed or increased bandwidth from one node to another. They
provide a failover and may provide an increased aggregate bandwidth for concurrent connections to multiple target hosts (but not to the same target host). However, there is a couple of workarounds:
bonding mode=4 (802.3ad)
lacp_rate=0
xmit_hash_policy=layer3+4
The latter hashes using src-(ip,port) and dst-(ip,port).
Still not good for a single connection.
Create separate VLAN for each port (on each of the nodes) and use bonding mode = Adaptive load balancing.
Then
LACP-compliant bridge will consider links separate and won't try to
correlate the traffic and direct it via a single link according to
xmit_hash_policy.
However this will reduce somewhat failover capacity: for example if Node1.LinkVLAN1 and Node2.LinkVLAN2 both fail.
It
also requires that all peer systems (such as iSCSI servers, iSNS, etc.)
have their interfaces configured accordingly to the sameVLAN scheme.
Remember to enable jumbo frames: ifconfig ethX mtu 9000.
Prepare:
Names vc1,
vc2 and vc3 below are for
cluster backchannel.
On each node:
# set node name
hostnamectl
set-hostname vcx
# disable "captive portal"
detection in Fedora
dnf
install -y crudini
crudini --set
/etc/NetworkManager/conf.d/21-connectivity-local.conf connectivity
interval 0
systemctl restart
NetworkManager
Cluster
shells
Install
dnf install
-y pdsh clustershell
To use pdsh:
#non-interactive:
pdsh -R exec -f 1 -w vc1,vc2,vc3 cmd | dshbak
pdsh -R exec -f 1 -w vc[1-3] cmd | dshbak
#interactive:
pdsh -R exec -f 1 -w vc1,vc2,vc3
pdsh -R exec -f 1 -w vc[1-3]
cmd substitution:
%h => remote host name
%u => remote user name
%n => 0, 1, 2, 3 ...
%% => %
To set up for clush, first enable password-less ssh.
Clumsy way:
ssh vc1
ssh-keygen -t rsa
ssh vc1 mkdir -p .ssh
ssh vc2 mkdir -p .ssh
ssh vc3 mkdir -p .ssh
ssh vc1 chmod 700 .ssh
ssh vc2 chmod 700 .ssh
ssh vc3 chmod 700 .ssh
cat .ssh/id_rsa.pub | ssh vc1 'cat
>> .ssh/authorized_keys'
cat .ssh/id_rsa.pub | ssh vc2 'cat >>
.ssh/authorized_keys'
cat .ssh/id_rsa.pub | ssh vc3 'cat >>
.ssh/authorized_keys'
Ctrl-D
ssh vc2
ssh-keygen -t rsa
cat .ssh/id_rsa.pub | ssh vc1 'cat >>
.ssh/authorized_keys'
cat .ssh/id_rsa.pub | ssh vc2 'cat >>
.ssh/authorized_keys'
cat .ssh/id_rsa.pub | ssh vc3 'cat >>
.ssh/authorized_keys'
Ctrl-D
ssh vc3
ssh-keygen -t rsa
cat .ssh/id_rsa.pub | ssh vc1 'cat >>
.ssh/authorized_keys'
cat .ssh/id_rsa.pub | ssh vc2 'cat >>
.ssh/authorized_keys'
cat .ssh/id_rsa.pub | ssh vc3 'cat >>
.ssh/authorized_keys'
Ctrl-D
Cleaner way:
Create id_rsa.pub, id_rsa
and authorized_keys on one node,
then replicate them to other nodes in the cluster.
To use clush:
clush -w vc1,vc2,vc3 -b [cmd]
clush -w vc[1-3] -b [cmd]
Basic
cluster install:
On each node:
dnf install -y pcs
fence-agents-all fence-agents-virsh resource-agents pacemaker
Optional: dnf
install -y dlm lvm2-cluster gfs2-utils iscsi-initiator-utils lsscsi
httpd wget
systemctl start
firewalld.service
firewall-cmd
--permanent --add-service=high-availability
firewall-cmd
--add-service=high-availability
systemctl stop
firewalld.service
iptables --flush
## optionally disable SELinux:
#setenforce 0
#edit /etc/selinux/config and change SELINUX=enforcing
=> SELINUX=permissive
passwd hacluster
systemctl start
pcsd.service
systemctl enable
pcsd.service
# make sure no http_proxy
exported
pcs
cluster auth vc1.example.com vc2.example.com vc3.example.com -u
hacluster -p xxxxx
--force
e.g. pcs cluster auth vc1 vc2 vc3
-u hacluster -p abc123 --force
# created auth data is
stored in /var/lib/pcsd
On one node:
pcs cluster setup [--force] --name vc
vc1.example.com vc2.example.com vc3.example.com
pcs cluster start --all
to stop: pcs
cluster stop --all
On each node:
# to auto-start cluster on
reboot
# alternatively can manually do "pcs cluster start" on each reboot
pcs
cluster enable --all
to disable: pcs cluster disable --all
View status:
pcs status
pcs cluster status
pcs cluster
pcsd-status
systemctl status
corosync.service
journalctl -xe
cibadmin
--query
pcs
property list [--all] [--defaults]
corosync-quorumtool
-oi [-i]
corosync-cpgtool
corosync-cmapctl
[ | grep
members]
corosync-cfgtool -s
pcs cluster cib
Verify current configuration
crm_verify --live --verbose
Start/stop node
pcs cluster stop vc2
pcs status
pcs cluster start vc2
Disable/enable hosting resources on the node (standby state)
pcs cluster standby vc2
pcs status
pcs cluster unstandby vc2
"Transactional"
configuration:
pcs cluster
cib my.xml
# get a copy of
CIB to my.xml
pcs -f my.xml ... change command ... #
make changes of config in my.xml
crm_verify --verbose --xml-file=q.xml # verify
config
pcs cluster cib-push my.xml
# push config from my.xml to CIB
Configure
STONITH
fence_virsh -
fences machine via ssh to vm host and execuiting sudo virsh destroy
<vmid> or sudo virsh reboot
<vmid>
Alternative to virsh:
fence_virt/fence_xvm
dnf install -y fence-virt
STONITH is needed:
- In resource (non-quorum) based
clusters, for obvious reasons
- In two-node clusters without
quorum disk (a special case of the above), for obvious reasons
- In
quorum-based clusters, because Linux clustering solutions including
Corosync and CMAN run as user-level processes and are unable to
interdict user-level and kernel-level activity on the node when cluster
node losesconnection to majority-votes partition. By comparison, in VMS
C
NXMAN is a kernel component which makes all CPUs to spin in
IOPOST by requeueing the request to the tail of IOPOST queue
until
quorum is restored and the node re-joins the majority
partition.
During this time, no user-level processes can execute, and no new IO
can be initiated, except the controlled IO to the quorum disk and SCS
datagrams by CNXMAN. When connection to the majority-parition is
restored, mount verification is further executed, and all file system
requests are held off until mount verification completes. If a
noderestores connection to the majority parition and detects new
incarnation of the cluster, the node executes a bugcheck to reboot.
Configure virsh
STONISH
On the vm host:
define user stonithmgr
add it to sudoers as
stonithmgr
ALL=(ALL) NOPASSWD: ALL
On a cluster node:
pcs stonith list
pcs stonith describe
fence-virsh
man fence_virsh
fence_virhs -h
#
test
fence_virsh
--ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose
--plug=vc2 --action=metadata
fence_virsh
--ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose
--plug=vc2 --use-sudo --action=status
fence_virsh
--ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose
--plug=vc2 --use-sudo --action=list
fence_virsh
--ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose
--plug=vc2 --use-sudo --action=monitor
fence_virsh
--ip=vc2-vmhost --username=stonithmgr --password=vc-cluster --verbose
--plug=vc2 --use-sudo --action=off
create file /root/stonithmgr-passwd.sh as
#!/bin/sh
echo
"vc-cluster-passwd"
chmod 755
/root/stonithmgr-passwd.sh
rsync -av
/root/stonithmgr-passwd.sh vc2:/root
rsync -av
/root/stonithmgr-passwd.sh vc3:/root
for node in vc1 vc2
vc3; do
pcs stonith delete
fence_${node}_virsh
pcs stonith create
fence_${node}_virsh \
fence_virsh \
priority=10 \
ipaddr=${node}-vmhost \
login=stonithmgr passwd_script="/root/stonithmgr-passwd.sh"
sudo=1 \
port=${node} \
pcmk_host_list=${node}
done
pcmk_host_list => vc1.example.com
port => vm name in virsh
ipaddr => name of machine hosting vm
delay=15 => delay for execution of fencing action
STONITH commands:
pcs stonith
show --full
pcs stonith fence vc2 --off
pcs stonith confirm vc2
pcs stonith delete fence_vc1_virsh
Reading:
Management GUI:
https://vc1:2224
log in as hacluster
Management
GUI, Hawk:
Essential files:
/etc/corosync/cofosync.conf
/etc/corosync/cofosync.xml
/etc/corosync/authkey
/var/lib/pacemaker/cib/cib.xml
(do not edit manually)
/etc/sysconfig/corosync
/etc/sysconfig/corosync-inotifyd
/etc/sysconfig/pacemaker
/var/log/pacemaker.log
/var/log/corosync.log
(but by default sent to syslog)
/var/log/pcsd/...
/var/log/cluster/...
/var/log/syslog
or on new Fedora:
journalctl --boot -x
journalctl --list-boots
journalctl --follow
-x
journalctl --all
-x
journalctl -xe
Man pages:
man
corosync.conf
man corosync.xml
man corosync-xmlproc
man corosync_overview
man corosync
man corosync-cfgtool
man quorum_overview
// quorum library
man votequorum_overview // ...
man votequorum
// quorum configuration
man corosync-quorumtool
man cibadmin
man cmap_overview
// corosync config registry
man cmap_keys
man corosync-cmapctl
man sam_overview
// library to register process
for a restart on failure
man cpg_overview
// closed group messaging
library w/virtual synchrony
man corosync-cpgtool
man corosync-blackbox // dump protocol
"blackbox" data
man qb-blackbox
man ocf-tester
man crmadmin
man gfs2
man tunegfs2
Essential processes:
corosync |
totem,
membership and quorum manager, messaging |
cib |
cluster
information base |
stonithd |
fencing
daemon |
crmd |
cluster
resource management daemon |
lrmd |
local
resource management daemon |
pengine |
policy
engine |
attrd |
co-ordinates
updates to cib, as an intermediary |
dlm_controld |
distributed
lock manager |
clvmd |
clustered
LVM daemon |
Alternatives to corosync:
CMAN or CCM + HEARTBEAT
DC ≡ Designated
Controller. One of CRMd instances elected to act as a master. Should
the elected CRMd process or its node fail, a new master is elected. DC
carries out PEngine's instructions by passing them to LRMd on a local
node, or to CRMd peers on other nodes, which in turn pass them to their
LRMd's. Peers then report the results of execution to DC.
Resource categories:
LSB |
Services from /etc/init.d |
Systemd |
systemd units |
Upstart |
upstart jobs |
OCF |
Open Cluster Framework scripts |
Nagios |
Nagios monitoring plugins |
STONITH |
fence agents |
pcs resource standards
pcs resource
providers
pcs resource agents
ocf:heartbeat
pcs resource agents
ocf:pacemaker
pcs resource agents
systemd
pcs resource agents
service
pcs resource agents
lsb
pcs resource agents
stonith
Resource consraints:
location |
Which nodes the resource can run on |
order |
The order in which the resource is launched |
colocation |
Where the resource will be placed relative to other
resources |
Connect
to iSCSI drives:
See iSCSI page.
Briefly, on each cluster node:
Install the open-iscsi package. The package is also known as the Linux
Open-iSCSI Initiator.
Ubuntu:
apt-get install open-iscsi
lsscsi
gedit
/etc/iscsi/iscsid.conf
/etc/init.d/open-iscsi
restart
Fedora:
dnf install -y
iscsi-initiator-utils lsscsi
systemctl enable
iscsid.service
systemctl start
iscsid.service
Display/edit initiator name, ensure it is unique in the landscape
(especially if cloned the system)
cat
/etc/iscsi/initiatorname.iscsi
e.g.
InitiatorName=iqn.1994-05.com.redhat:cbf2ba2dff2
=> iqn.1994-05.com.redhat:mynode1
InitiatorName=iqn.1993-08.org.debian:01:16c1be18eee8
=> iqn.1993-08.org.debian:01:myhost2
Optional: edit configuration
gedit /etc/iscsi/iscsid.conf
restart the service
Discover the iSCSI targets on a specific host
iscsiadm -m
discovery -t
sendtargets -p qnap1x:3260 \
--name discovery.sendtargets.auth.authmethod
--value CHAP \
--name discovery.sendtargets.auth.username
--value sergey \
--name discovery.sendtargets.auth.password
--value abc123abc123
Check the available iSCSI node(s) to connect to.
iscsiadm -m node
Delete node(s) you don’t want to connect to when the service is on with
the following command:
iscsiadm -m node --op delete
--targetname <target_iqn>
Configure authentication for the remaining targets:
iscsiadm
--mode node --targetname
"iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c" -p
qnap1x:3260
--op=update --name node.session.auth.authmethod --value=CHAP
iscsiadm
--mode node --targetname
"iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c" -p
qnap1x:3260
--op=update --name node.session.auth.username --value=sergey
iscsiadm
--mode node --targetname
"iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c" -p
qnap1x:3260
--op=update --name node.session.auth.password --value=abc123abc123
iscsiadm --mode node --targetname
"iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c" -p
qnap1x:3260 --login
iscsiadm
--mode node --targetname
"iqn.2004-04.com.qnap:ts-569l:iscsi.xs2.e4cd7c" -p
qnap1x:3260
--op=update --name node.session.auth.authmethod --value=CHAP
iscsiadm
--mode node --targetname
"iqn.2004-04.com.qnap:ts-569l:iscsi.xs2.e4cd7c" -p
qnap1x:3260
--op=update --name node.session.auth.username --value=sergey
iscsiadm
--mode node --targetname
"iqn.2004-04.com.qnap:ts-569l:iscsi.xs2.e4cd7c" -p
qnap1x:3260
--op=update --name node.session.auth.password --value=abc123abc123
iscsiadm --mode node --targetname
"iqn.2004-04.com.qnap:ts-569l:iscsi.xs2.e4cd7c" -p
qnap1x:3260 --login
iscsiadm
--mode node --targetname
"iqn.2004-04.com.qnap:ts-569l:iscsi.xs3.e4cd7c" -p
qnap1x:3260
--op=update --name node.session.auth.authmethod --value=CHAP
iscsiadm
--mode node --targetname
"iqn.2004-04.com.qnap:ts-569l:iscsi.xs3.e4cd7c" -p
qnap1x:3260
--op=update --name node.session.auth.username --value=sergey
iscsiadm
--mode node --targetname
"iqn.2004-04.com.qnap:ts-569l:iscsi.xs3.e4cd7c" -p
qnap1x:3260
--op=update --name node.session.auth.password --value=abc123abc123
iscsiadm --mode node --targetname
"iqn.2004-04.com.qnap:ts-569l:iscsi.xs3.e4cd7c" -p
qnap1x:3260 --login
You
should be able to see the login message as below:
Login session [iface:
default, target: iqn.2004-04.com:NAS:iSCSI.ForUbuntu.B9281B, portal:
10.8.12.31,3260] [ OK ]
Restart open-iscsi to login to all of the available nodes.
Fedora: systemctl
restart iscsid.service
Ubuntu: /etc/init.d/open-iscsi restart
Check the device status with dmesg.
dmesg | tail -30
List available devices:
lsscsi
lsscsi
-s
lsscsi
-dg
lsscsi
-c
lsscsi
-Lvl
iscsiadm
-m session [-P 3] [-o show]
For multipathing, see a section below.
Format
volume with cluster LVM
See RHEL7
LVM Administration, chapters 1.4, 3.1, 4.3.3, 4.3.8, 4.7, 5.5.
On each node:
lvmconf --enable-cluster
systemctl stop
lvm2-lvmetad.service
systemctl disable
lvm2-lvmetad.service
To revert (if desired later):
lvmconf
--disable-cluster
edit /etc/lvm/lvm.conf
change use_lvmetad to 1
systemctl
start lvm2-lvmetad.service
systemctl enable
lvm2-lvmetad.service
On one node (cluster must be running):
pcs resource create dlm ocf:pacemaker:controld op
monitor interval=30s on-fail=fence clone interleave=true ordered=true
pcs
resource create clvmd ocf:heartbeat:clvm with_cmirrord=true op monitor
interval=30s on-fail=fence clone interleave=true ordered=true
pcs constraint order start dlm-clone then clvmd-clone
pcs constraint colocation add clvmd-clone with dlm-clone
pcs constraint show
pcs resource show
If clvmd
was already configured earlier, but without cmirrord, can
enable the latter with:
pcs resource update clvmd
with_cmirrord=true
Identify the drive
iscsiadm -m
session -P 3 | grep Target
iscsiadm -m session -P 3 | grep scsi | grep Channel
lsscsi
tree /dev/disk
Partition the drive and create volume group
fdisk /dev/disk/by-path/ip-192.168.73.2:3260-iscsi-iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c-lun-0
respond: n, p,
...., w, p, q
Refresh parition table view on all other nodes:
partprobe
pvcreate /dev/disk/by-path/ip-192.168.73.2:3260-iscsi-iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c-lun-0-part1
vgcreate
[--clustered y] vg1
/dev/disk/by-path/ip-192.168.73.2:3260-iscsi-iqn.2004-04.com.qnap:ts-569l:iscsi.xs1.e4cd7c-lun-0-part1
vgdisplay vg1
pvdisplay
vgs
Create logical volume:
lvcreate vg1 --name lv1 --size 9500M
lvcreate vg1 --name
lv1 --extents 2544
# find the number of free extents from vgdisplay
lvcreate vg1 --name
lv1 --extents 100%FREE
lvdisplay
ls -l /dev/vg1/lv1
Multipathing
See here.
GFS2
Red
Hat GFS2 documentation
- File system name must be unique in a cluster (DLM lock
names derive from it)
- File system hosts journal files. One journal is required
per each cluster node that mounts this file system.
Default journal size: 128 MB (per journal).
Minimum journal size: 8 MB.
For large file systems, increase to 256 MB.
If journal is too small, requests will have to wait for journal space,
and performance will suffer.
- Do not use SELinux with GFS2.
SELinux stores information in every file's extended attributes, which
will cause significant GFS2 slowdown.
- If GFS2 filesystem is mounted manually (rather than through
Pacemaker resource), unmount it manually.
Otherwise shutdown script will kill cluster processes and will then try
to unmount the GFS2 file system,
but without the processes the unmount will fail and the system will
hang (and a hardware reboot will be required).
Configure cluster no-quorum-policy
as freeze
pcs property set
no-quorum-policy=freeze
By default, the value of
no-quorum-policy
is set to
stop,
indicating that once quorum is lost, all the resources on the remaining
(minority) partition will immediately be stopped. Typically this
default is the safest and most optimal option, but unlike most
resources, GFS2 and OCFS2 require quorum to function. When quorum is
lost both the applications using the GFS2 mounts and the GFS2 mount
itself cannot be correctly stopped in a partition that has become
non-quorate. Any attempts to stop these resources without quorum will
fail which will ultimately result in the entire cluster being fenced
every time quorum is lost.
To address this situation, set the
no-quorum-policy=freeze
when GFS2 is in use. This means that when quorum is lost, the remaining
(minority) partition will do nothing until quorum is regained.
If majority partition remains, it will fence the minority partition.
Find out for sure: if the
majority
partition can launch a failover replica of a service (that was running
inside a minority partition) before
fencing a minority partition, or will do it only after fencing a
minority parition . If before, two replicas can conflict when
no-quorum-policy is freeze
(and even when it is stop).
Create file system and Pacemaker resource for it:
mkfs.gfs2 -j 3 -p lock_dlm -t vc:cfs1 /dev/vg1/lv1
-j 3 => pre-create
journals for three cluster nodes
-t value => locking table name (must be
ClusterName:FilesystemName)
-O => do not ask for confirmation
-J 256 => create journal with size of 256 MB (default: 128, min:
8)
-r <mb> => size of allocation "resource group",
usually 256 MB
# view settings
tunegfs2 /dev/vg1/lv1
# change label (note: label
is also the lcck table name)
tunegfs2 -L vc:cfs1 /dev/vg1/lv1
# some other settings can also later be changed with tunegfs2
pcs resource create cfs1
Filesystem device="/dev/vg1/lv1" directory="/var/mnt/cfs1" fstype=gfs2
\
options="noatime,nodiratime" run_fsck=no
\
op monitor interval=10s on-fail=fence clone
interleave=none
Mount options:
acl
enable ACLs
discard when on SSD or SCSCI devices, enable
UNMAP function for blocks being freed
quota=on enforce quota
quota=account matain quota, but do not enforce it
noatime disable update of access time
nodiratime same for directories
lockproto=lock_nolock => mounting out of cluster (no DLM)
pcs constraint order start clvmd-clone then cfs1-clone
pcs constraint colocation add cfs1-clone
with clvmd-clone
mount | grep /var/mnt/cfs1
To suspend write activity on file system (e.g. to create LVM snapshot)
dmsetup suspend /dev/vg1/lv1
[... use LVM to
create a snapshot ...]
dmsetup resume /dev/vg1/lv1
To run fsck, stop the resource to unmount file systems from all the
nodes:
pcs resource disable cfs1 [--wait=60]
# default wait time is 60 seconds
fsck.gfs2 -y /dev/vg1/lv1
pcs resource enable cfs1
To expand file system:
lvextend ... vg1/lv1
gfs2_grow /var/mnt/cfs1
When adding node to cluster, provide enough journals first:
# find out how many journals are available
# must unmount file syste first
pcs resource disable cfs1
gfs2_edit -p jindex /dev/vg1/lv1 | grep
journal
pcs resource enable cfs1
# add one more journal, sized 128 MB
gfs2_jadd /var/mnt/cfs1
# add two
more journals sized 256 MB
gfs2_jadd -j 2 -J 256 /var/mnt/cfs1
[... add the node ...]
Optional – Performance tuning – Increase DLM table sizes
echo 1024 >
/sys/kernel/config/dlm/cluster/lkbtbl_size
echo 1024 > /sys/kernel/config/dlm/cluster/rsbtbl_size
echo 1024 > /sys/kernel/config/dlm/cluster/dirtbl_size
Optional – Performance tuning – Tune VFS
# percentage of system memory that can be filled
with “dirty” pages before the pdflush kicks in
sysctl -n vm.dirty_background_ratio #
default is 5-10
sysctl -w
vm.dirty_background_ratio=20
# discard inodes and
directory entries from cache more agressively
sysctl -n vm.vfs_cache_pressure
# default is 100
sysctl -n vm.vfs_cache_pressure=500
# can be permanently changed in /etc/sysctl.conf
Optional – Tuning
/sys/fs/gfs2/vc:cfs1/tune/...
To enable data journaling on a file (default: disabled)
chattr +j /var/mnt/cfs1/path/file
#enable
chattr -j /var/mnt/cfs1/path/file #disable
Program optimizations:
- preallocate file space – use fallocate(...) if possible
- flock(...) is faster than fcntl(...) with GFS2
- with fcntl(...), l_pid may refer to a process on a
different node
To drop the cache (after large backups etc.)
echo 3 >
/proc/sys/vm/drop_caches
View lock etc. status:
/sys/kernel/debug/gfs2/vc:cfs1/glocks
# decoded here
dlm_tool ls [-n] [-s] [-v] [-w]
dlm_tool plocks lockspace-name
[options]
dlm_tool dump [options]
dlm_tool log_plock [options]
dlm_tool lockdump lockspace-name
[options]
dlm_tool lockdebug lockspace-name
[options]
tunegfs2 /dev/vg1/lv1
Quota manipulations:
mount with "quota=on"
to create quota
files: quotacheck -cug /var/mnt/cfs1
to edit user
quota: export
EDITOR=`which nano' ; edquota username
to edit group
quota: export EDITOR=`which
nano' ; edquota -g groupname
grace
periods:
edquota -t
verify user
quota: quota -u username
verify group quota:
quota -g groupname
report quota:
repquota /var/mnt/cfs1
synchronize
quota data between nodes: quotasync -ug
/var/mnt/cfs1
NFS over GFS2: see here
=========
### multipath: man mpathpersist
https://www.suse.com/documentation/sles-12/stor_admin/data/sec_multipath_mpiotools.html
### LVM: fsfreeze
misc,
iscsi:
https://www.ibm.com/developerworks/community/blogs/mhhaque/entry/configure_two_node_highly_available_cluster_using_kvm_fencing_on_rhel7?lang=en
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/_moving_resources_due_to_connectivity_changes.html
https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/IPaddr2
### add node (also GFS2 journals)
### virtual-ip
### httpd
### nfs
### fence_scsi
### GFS2
### OCFS2
### DRBD
### interface bonding/teaming
### quorum disk, qdiskd, mkqdisk
### GlusterFS
### Lustre
### hawk GUI https://github.com/ClusterLabs/hawk
### http://www.spinics.net/lists/cluster/threads.html