Clustered KVM setup for Live Migration with gfs2 on CentOS 6
Cluster KVM

Let’s start by configuring networking, then install a bunch of packages for clustering and KVM. After that we’ll configure the services and create a gfs2 cluster file system. Finally we’ll relocate libvirt to the clustered storage so live migration will just work right out of the box.
Ready set go.
Networking
If you only have one or two network interfaces, you can make this part pretty simple by skipping the bonding setup. You should have at least two interfaces so all internal cluster and KVM migration communication can stay separate on its own network, but even this isn’t a necessity for testing purposes.
If your switch supports LACP, use bonding mode 802.3ad. I found it to be the best for performance and redundancy. If your switch does not have specialized channel bonding support, consider using the adaptive load balancing mode instead.
For the nitty-gritty bonding details, see http://www.linuxfoundation.org/collaborate/workgroups/networking/bonding
Loading the bonding module and specifying some options:
|
1 2 3 |
# cat /etc/modprobe.d/bonding.conf alias bond0 bonding options bonding mode=4 miimon=100 updelay=200 downdelay=200 |
Configure the files in /etc/sysconfig/network-scripts/ to bring up the network at boot time. My configuration uses three interfaces. Two interfaces (eth0,eth1) are bonded with the bond (bond0) connected to a bridge (br0) for the public network and a single interface (eth2) for private communication between nodes in the cluster.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
# cat ifcfg-eth0 # public network DEVICE=eth0 ONBOOT=yes MASTER=bond0 SLAVE=yes # cat ifcfg-eth1 # public network DEVICE=eth1 ONBOOT=yes MASTER=bond0 SLAVE=yes # cat ifcfg-bond0 # public network DEVICE=bond0 BRIDGE=br0 ONBOOT=yes # cat ifcfg-br0 # public network DEVICE=br0 ONBOOT=yes TYPE=Bridge IPADDR=192.168.1.101 NETMASK=255.255.0.0 GATEWAY=192.168.1.1 # cat ifcfg-eth2 # private network DEVICE=eth2 ONBOOT=yes IPADDR=172.16.1.101 NETMASK=255.255.0.0 |
That was for node1. For the rest of your nodes, do the same thing but increment the ip addreses. I use 101, 102, and 103 for a three node cluster.
Setup your hosts file so the nodes can talk to each other on your private network. This will be the same on each host, so just make it once and copy it to the other nodes. For all configurations from here on, make it on one node and copy the file to the rest.
|
1 2 3 4 5 6 7 8 9 10 |
# cat /etc/hosts 127.0.0.1 localhost localhost.localdomain ::1 localhost localhost.localdomain 172.16.1.101 host1 172.16.1.102 host2 172.16.1.103 host3 172.16.1.201 host1-ipmi 172.16.1.202 host2-ipmi 172.16.1.203 host3-ipmi |
The second name and IP address for each host is for the dedicated baseboard management controller (BMC). I use the “-ipmi” names for fencing in the cluster.conf later.
Install Cluster Packages
This list will pull all packages and dependencies needed for getting up and running. You can add things like snmp and foghorn later. Skip OpenIPMI and ipmitool if you don’t want to use a BMC like HP’s integrated lights out controller for fencing.
123456
# yum install ntpdate ntp openssh-askpass libsmi gfs2-utils libtool-ltdl \cluster-glue-libs libibverbs corosync clusterlib pacemaker-cluster-libs \resource-agents cluster-glue pacemaker qemu-kvm nc libvirt-client libvirt \qemu-kvm-tools OpenIPMI virt-manager virt-top openais modcluster fence-virt \ipmitool ricci fence-agents cman lvm2-cluster rgmanager fence-virtd sanlock augeas \libvirt-lock-sanlock
|
1 2 3 4 5 6 |
# yum install ntpdate ntp openssh-askpass libsmi gfs2-utils libtool-ltdl \ cluster-glue-libs libibverbs corosync clusterlib pacemaker-cluster-libs \ resource-agents cluster-glue pacemaker qemu-kvm nc libvirt-client libvirt \ qemu-kvm-tools OpenIPMI virt-manager virt-top openais modcluster fence-virt \ ipmitool ricci fence-agents cman lvm2-cluster rgmanager fence-virtd sanlock augeas \ libvirt-lock-sanlock |
I avoid selinux like the plague and I don’t have qlogic hardware, yet these things try to get in my way so i’ll remove a few packages and be done with it.
|
1 2 |
# yum remove selinux-policy selinux-policy-targeted fcoe-utils ql2100-firmware \ ql2200-firmware ql23xx-firmware ql2400-firmware ql2500-firmware |
Disable selinux and the firewall. You can add the firewall back later when you’re not testing on a private network.
|
1 2 3 4 5 |
# sed -i 's/SELINUX=.*/SELINUX=disabled/' /etc/selinux/config # chkconfig iptables -F # chkconfig iptables off # chkconfig ip6tables off # mv /etc/sysconfig/iptables /root/ |
Sanlock and watchdog
When running augtool, use a unique value for each host with a value between 1 and 2000. See http://libvirt.org/locking.html for more information. I let sanlock’s dependency start on its own using chkconfig, but I don’t want sanlock starting by itself. I put this in the cluster stack instead.
1234
# yum install libvirt-lock-sanlock# chkconfig wdmd onhost1 # augtool -s set /files/etc/libvirt/qemu-sanlock.conf/host_id 1host2 # augtool -s set /files/etc/libvirt/qemu-sanlock.conf/host_id 2
|
1 2 3 4 |
# yum install libvirt-lock-sanlock # chkconfig wdmd on host1 # augtool -s set /files/etc/libvirt/qemu-sanlock.conf/host_id 1 host2 # augtool -s set /files/etc/libvirt/qemu-sanlock.conf/host_id 2 |
Softdog module must be loaded at boot for sanlock to work. Make a script and put it in the sysconfig/modules directory.
|
1 2 3 4 |
# cat /etc/sysconfig/modules/softdog.modules #!/bin/sh modprobe -b softdog >/dev/null 2>&1 exit 0 |
If you’re using an IPMI interface for fencing, the modules must load for ipmitool to work. I had to do this to configure the BMC’s network interface and for general probing from the host.
|
1 2 3 4 5 |
# cat ipmi.modules #!/bin/sh modprobe -b ipmi_devintf >/dev/null 2>&1 modprobe -b ipmi_si >/dev/null 2>&1 exit 0 |
Setup IPMI with user and password for cluster fencing. Using ipmitool, configure an IP address. I don’t plan on touching the BMC from the public side at all, so I put it on my private network.
|
1 2 3 4 5 6 7 |
# ipmitool lan set 1 ipaddr x.x.x.x # ipmitool user set name 1 root # ipmitool user set password 1 secret # ipmitool user enable 1 # ipmitool channel setaccess 1 1 ipmi=on link=on privilege=4 # ipmitool user test 1 16 secret # ipmitool user test 1 20 secret |
For more details on ipmitool, go to this projects home page at http://ipmitool.sourceforge.net/
Turn on services you want at boot for starting up the cluster
|
1 2 3 4 5 6 |
# chkconfig ricci on # chkconfig cman on # chkconfig modclusterd on # chkconfig rgmanager on # chkconfig messagebus on # chkconfig corosync-notifyd on |
Turn off services with chkconfig for anything the cluster will handle starting. It’s also important to include services that keep files open on your storage cluster. If you don’t let cman handle start/stop of dnsmasq for example, the storage will not be able to unmount while the service is still running.
|
1 2 3 4 5 |
# chkconfig clvmd off # chkconfig gfs2 off # chkconfig sanlock off # chkconfig dnsmasq off # chkconfig libvirtd off |
Reconfigure lvm for clustering. You could create the file system with local locking and change it later, but why? I do this before creating the gfs2 volumes so I know they’re configured right from the beginning. Change locking type to 3 for build-in clustered locking. I disable fallback to local locking to avoid any kind of split brain problems with two hosts writing independently and screwing up the gfs2 volume. If a host can’t play nice with others, it’s safer to not allow the storage to mount at all.
|
1 2 3 |
# cd /etc/lvm/ # sed -i 's/^ *locking_type.*/locking_type = 3/' lvm.conf # sed -i 's/fallback_to_local.*/fallback_to_local_locking = 0/' lvm.conf' |
Cluster Config
cluster.conf controls the way your cluster stack loads, unloads, fences, etc. Each time you make a change to cluster.conf you must increment the config_version to have your changes take effect.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
<?xml version="1.0"?><cluster name="hacluster" config_version="1"> <cman two_node="1" expected_votes="2"/> <clusternodes> <clusternode name="host1" votes="1" nodeid="2"> <fence> <method name="ipmi"> <device name="ipmi_host1" action="reboot" /> </method> </fence> </clusternode> <clusternode name="host2" votes="1" nodeid="3"> <fence> <method name="ipmi"> <device name="ipmi_host2" action="reboot" /> </method> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice name="ipmi_host1" agent="fence_ipmilan" ipaddr="host1-ipmi" login="root" passwd="secret" /> <fencedevice name="ipmi_host2" agent="fence_ipmilan" ipaddr="host2-ipmi" login="root" passwd="secret" /> </fencedevices> <fence_daemon post_join_delay="30" /> <rm log_level="5"> <resources> <script file="/etc/init.d/clvmd" name="clvmd" /> <script file="/etc/init.d/gfs2" name="gfs2" /> <script file="/etc/init.d/sanlock" name="sanlock" /> <script file="/etc/init.d/dnsmasq" name="dnsmasq" /> <script file="/etc/init.d/libvirtd" name="libvirtd" /> </resources> <failoverdomains> <failoverdomain name="only_host1" nofailback="1" ordered="0" restricted="1"> <failoverdomainnode name="host1" /> </failoverdomain> <failoverdomain name="only_host2" nofailback="1" ordered="0" restricted="1"> <failoverdomainnode name="host2" /> </failoverdomain> </failoverdomains> <service name="storage_host1" autostart="1" domain="only_host1" exclusive="0" recovery="restart"> <script ref="clvmd"> <script ref="gfs2"> <script ref="sanlock"> <script ref="dnsmasq"> <script ref="libvirtd" /> </script> </script> </script> </script> </service> <service name="storage_host2" autostart="1" domain="only_host2" exclusive="0" recovery="restart"> <script ref="clvmd"> <script ref="gfs2"> <script ref="sanlock"> <script ref="dnsmasq"> <script ref="libvirtd" /> </script> </script> </script> </script> </service> </rm></cluster>
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
<?xml version="1.0"?> <cluster name="hacluster" config_version="1"> <cman two_node="1" expected_votes="2"/> <clusternodes> <clusternode name="host1" votes="1" nodeid="2"> <fence> <method name="ipmi"> <device name="ipmi_host1" action="reboot" /> </method> </fence> </clusternode> <clusternode name="host2" votes="1" nodeid="3"> <fence> <method name="ipmi"> <device name="ipmi_host2" action="reboot" /> </method> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice name="ipmi_host1" agent="fence_ipmilan" ipaddr="host1-ipmi" login="root" passwd="secret" /> <fencedevice name="ipmi_host2" agent="fence_ipmilan" ipaddr="host2-ipmi" login="root" passwd="secret" /> </fencedevices> <fence_daemon post_join_delay="30" /> <rm log_level="5"> <resources> <script file="/etc/init.d/clvmd" name="clvmd" /> <script file="/etc/init.d/gfs2" name="gfs2" /> <script file="/etc/init.d/sanlock" name="sanlock" /> <script file="/etc/init.d/dnsmasq" name="dnsmasq" /> <script file="/etc/init.d/libvirtd" name="libvirtd" /> </resources> <failoverdomains> <failoverdomain name="only_host1" nofailback="1" ordered="0" restricted="1"> <failoverdomainnode name="host1" /> </failoverdomain> <failoverdomain name="only_host2" nofailback="1" ordered="0" restricted="1"> <failoverdomainnode name="host2" /> </failoverdomain> </failoverdomains> <service name="storage_host1" autostart="1" domain="only_host1" exclusive="0" recovery="restart"> <script ref="clvmd"> <script ref="gfs2"> <script ref="sanlock"> <script ref="dnsmasq"> <script ref="libvirtd" /> </script> </script> </script> </script> </service> <service name="storage_host2" autostart="1" domain="only_host2" exclusive="0" recovery="restart"> <script ref="clvmd"> <script ref="gfs2"> <script ref="sanlock"> <script ref="dnsmasq"> <script ref="libvirtd" /> </script> </script> </script> </script> </service> </rm> </cluster> |
Check the config for errors with ccs_config_validate.
|
1 |
# ccs_config_validate |
And just for future reference: To list raw currently running values similar to how sysctl.conf printing works, run corosync-objctl. Since your cluster isn’t running yet, ignore this for now.
|
1 |
# corosync-objctl |
Cluster Storage
Create clustered volumes for storage with clvmd running so you know it’s going to work properly. Start cman and clvmd manually for now.
12
# /etc/init.d/cman start# /etc/init.d/clvmd start
|
1 2 |
# /etc/init.d/cman start # /etc/init.d/clvmd start |
Create a physical volume, a volume group, then a logical volume, and finally the gfs2 file system in that order.
|
1 2 3 4 |
# pvcreate /dev/sdb # vgcreate vg_vol1 /dev/sdb # lvcreate -l 100%FREE vg_vol1 -n lv_vol1 # mkfs.gfs2 -j 3 -p lock_dlm -t hacluster:vol1 /dev/vg_vol1/lv_vol1 |
# mkdir /vol1
Add entries to /etc/fstab so gfs2 will mount the volume when the cluster tells it to.
|
1 |
# echo "/dev/vg_vol1/lv_vol1 /vol1 gfs2 rw,relatime 0 0" >> /etc/fstab |
You should be able to mount it manually now or let the cluster start it after rebooting.
With the storage mounted, move the libvirt directory to the cluster storage from one node and create a link to it. You must do this for live migration to work. Delete it from the rest of the nodes and just create the link. The other option is to change the location where libvirt keeps all its files.
From host1:
|
1 2 |
# mv /var/lib/libvirt /vol1/ # cd /var/lib && ln -s /vol1/libvirt |
From host2 and on:
|
1 2 3 |
# cd /var/lib # rm -rf libvirt # ln -s /vol1/libvirt |
All done!
Reboot all your nodes and check to make sure they come up. Use clustat to verify that services are running. If you see your cluster storage services listed with the state shown as “started”, then everything worked. If not, check each service individually to figure out where the problem is. Make sure the storage is mounted, then check libvirt, dnsmasq, sanlock, gfs2, and then clvmd.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
host1# clustat Cluster Status for hacluster @ Fri Dec 21 11:11 2012 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ host1 1 Online, Local, rgmanager host2 2 Online, rgmanager host3 3 Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:storage_host1 host1 started service:storage_host2 host2 started service:storage_host3 host3 started |
If you find the services are missing or hanging up somehow, use rg_test to do a dry run and make sure all of the individual components of your service are loading, and in the proper order. You’re not going to get gfs2 mounted if clvmd isn’t started first. And in the reverse order for shutdown, the cluster isn’t going to unmount if libvirtd doesn’t stop first.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# rg_test test cluster.conf start service ... Starting storage_host1... <info> Executing /etc/init.d/clvmd start Activating VG(s): 1 logical volume(s) in volume group "vg_vol1" now active [ OK ] <info> Executing /etc/init.d/gfs2 start Mounting GFS2 filesystem (/vol1): already mounted [ OK ] <info> Executing /etc/init.d/sanlock start <info> Executing /etc/init.d/dnsmasq start Starting dnsmasq: <info> Executing /etc/init.d/libvirtd start Starting libvirtd daemon: Start of storage_host1 complete |
If everything looks good so far but the cluster won’t unmount cleaning, try using lsof and grepping for anything in vol1. If a file is open and you’re waiting for a gfs2 unmount, your cluster is gonna have a bad time.