Cluster service wont start with rgmanager fixed

Cluster service wont start because it’s in a failed state caused from the service attempting to start but failing too many times in too short of a time period (based on your cluster.conf retries and remember policy) or, more likely it’s failing to shutdown cleanly.

Manually enabling the service fails, right?

# clusvcadm -e host1service

Checking syslog, you will see the service refusing to start and the reason why. In this case it’s complaining it failed to stop cleanly, but the road block is really that it’s in a failed state.

# tail /var/log/messages
rgmanager: Service:host1service has failed; can not start.
rgmanager: Service:host1service failed to stop cleanly

four people with a thumbs up
Assuming cluster.conf is valid because you ran ccs_config_validate and it told you so, the problem is probably pretty simple. If a service is in a failed state, you can’t enable it unless you disable it first!

Perhaps this functionality is to stop you from starting a failed service because it’s already tried to start up a number of times and couldn’t already, so what’s the point of another try? In my case, what’s the harm? It’s already failed again and again. Why not just retry it when I step in and manually ask for the service to start?!

Oh well, it’s just one extra step.

Disable the service, then enable it. If you really have fixed the problem that caused the failure in the first place, the service should start right up.

# clusvcadm -d host1service
# clusvcadm -e host1service

Checking syslog once again, you should see the service disable, renable, and fire off the startup script.

Stopping service service:host1service
rgmanager: [script] Executing /etc/init.d/ClusterService stop
rgmanager: [script] Executing /etc/init.d/ClusterService start
rgmanager: Service service:host1service started

Check the cluster status to make sure it started.

# clustat
Cluster Status for hacluster @ Fri Nov 9 02:15:00 2012
Member Status: Quorate
Member Name ID Status
------ ---- -- ------
host1 1 Online, Local, rgmanager
host2 2 Online, rgmanager
host3 3 Online, rgmanager
Service Name Owner State
------- ---- ----- -----
service:host1service host1 started

If the service is relocatable, now is a good time to test relocation to make sure it works properly and doesn’t fail on the other available nodes.

# clusvcadm -r dnsmasq -m host2
Trying to relocate service:dnsmasq to host2…Success
service:dnsmasq is now running on host2

If relocation fails on any particular node, fix the problem or take that node out of the failoverdomain in cluster.conf.