Soc ninja title

Managing Heartbeat and DRBD

Basic management commands

  • # crm status # Show current status
  • # crm_mon # Monitorize status
  • # crm resource status # Show all resources' status
  • # crm resource {start,stop,status} RESOURCE_NAME # Start, stop or show RESOURCE_NAME status
  • # tail -f /var/log/messages ; tail -f /var/log/heartbeat.log # Show logs
  • # crm_resource --resource RESOURCE_NAME --cleanup --node NODE_NAME # Clean up resource after multiple failures on node; useful when a resource doesn't restart.
  • # crm_resource -P # Re-check all resources and start/stop them accordingly (i.e. like a "reset")
  • # crm resource promote RESOURCE_NAME # Promote resource or group to be Master (i.e. for MS_DRBD, migrates current DRBD to Primary)
  • # crm resource migrate RESOURCE_NAME NODE_NAME # Move resource or group to new node
  • # crm resource unmanage RESOURCE_NAME # Unmanage the resource, so Pacemaker won't try to do anything with it. If you have a resource that fails to start, and there's nothing obvious in the logs, you can try starting it manually to diagnose the problem further. Likewise for failed stop and monitor ops.

Configuration commands

  • # crm configure show # Show current configuration as per crm commands.
  • # crm configure save crm.conf # Saves current configuration to crm.conf
  • # crm configure load {replace,update} crm.conf # Loads configuration from file. Always use update if current node already has configuration.

Some tricks

  • # crm configure load replace /dev/null # Useful for resetting configuration
  • # /etc/init.d/heartbeat stop; killall -9 heartbeat; rm -fr /var/lib/heartbeat/crm/* # Hard way to reset heartbeat configuration

Managing DRBD ( without Heartbeat )

DRBD is used to share a disk device between two machines using the network (you can see it as a RAID1 through network, with some major differences).

Issue the following commands only when heartbeat/pacemaker doesn't work or is stopped.

  1. # cat /proc/drbd # Show drbd status (i.e.  "0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----" )
  2. # drbdadm {primary,secondary} RESOURCE_NAME # Makes current node DRBD resource primary/secondary. Example: drbdadm primary r0
  3. # drbdadm {connect,disconnect} RESOURCE_NAME # Connects / disconnects resource. Example: drbdadm disconnect r0
  4. # drbdadm -- --discard-my-data connect RESOURCE_NAME # Discards ALL CHANGES and connects resource. Example: drbdadm -- --discard-my-data connect r0; Issue this command only on Secondary nodes

Recovering secondary node after a split-brain situation

Typically it involves issuing just 4 commands on the Secondary node (the one with out-of-date data):

  1. # drbdadm secondary r0
  2. # drbdadm disconnect r0
  3. # drbdadm -- --discard-my-data connect r0
  4. # drbdadm connect r0
  5. # cat /proc/drbd # Should show "ro:Secondary/Primary" and data synchronization progress.
  6. # tail -f /var/log/messages # Always useful.

TIPS

If all the above fails, make sure /etc/drbd.conf is the same on both nodes (split-brain actions and so on). 
Both DRBD nodes must be configured with the same exact actions on split-brain situations (aka after-sb-0priafter-sb-1priafter-sb-2pri actions). 
If not, connection/synchronization will fail. So, please, configure both accordingly and  # /etc/init.d/drbd reload as needed.

References




If you found anything useful enough and you want to thank us for that, please consider donating to people who need it, like the NGO OXFAM. Thank you !


Comments

Comments are manually approved. Just be a bit patient :-)

There are currently no comments

New Comment

required

required (not published)

optional

Recent Tweets