Nagios and `service restart`
Have you tried turning it off and on again?
This is usually the goto solution for failing/frozen services. Something’s wrong let’s just restart whatever. It gets as absurd as running a restart cronjob every hour or two! Of course restarting helps, but it’s not the best solution. It’s often better to find out what’s wrong to begin with, why the thing that is supposed to work is not working anymore.
Nagios is an open-source host and service monitoring solution that not only monitors but also makes it possible to solve issues before bugging sleeping sysadmins and devops people. But more often than not the event handler responsible for trying to make things right is set to run /etc/init.d/service restart
, right?
That’s pretty convenient, but it’s cheating. We wake up next morning and are happy as if nothing was wrong. Instead of merely restarting, why don’t we dump a post-mortem debugging context for us to go through to find out what actually went wrong, why the drastic restart was needed.
Here’s a simple script to get you started:
#!/bin/bash # By default we dump to a directory with the current time # this can be overridden by setting the UID envvar DUMPDIR=${DUMPDIR:-"~/.postmortems/$(date +%Y%m%d%H%M%S)"} # Read PROCESS, SERVICE and LOGFILE envvars # and do dump, restart and read log respectively # These can be called separately. # Dump process memory if [ ! -z "$PROCESS" ]; then mkdir -p $DUMPDIR # Coredump processes and threads for pid in $(pidof $PROCESS); do for tid in $(ls /proc/$pid/task); do gcore -o $DUMPDIR/$PROCESS.$tid.$(date +%Y%m%d%H%M%S).core $tid done; done; fi; # Dump logfiles (can be a space-delimited list of files) if [ ! -z "$LOGFILE" ]; then mkdir -p $DUMPDIR for l in $LOGFILE; do cp $l $DUMPDIR/$PROCESS.$(date +%Y%m%d%H%M%S).$(basename $l) done; fi; # Restart service if [ ! -z "$SERVICE" ]; then service $SERVICE restart fi;
This code is available on GitHub. Meow!
Say for instance our website is down and we’re fast asleep and can’t be bothered. What should be run?
PROCESS=nginx SERVICE=nginx LOGFILE="/var/log/nginx/error.log /var/log/access.log" ./dumpstart PROCESS=mysqld SERVICE=mysql LOGFILE=/var/log/mysql.err ./dumpstart PROCESS=php5-fpm SERVICE=php5-fpm LOGFILE=/var/log/php5-fpm.log ./dumpstart
Enjoy the dumps in the morning and uptime throughout the night (hopefully).