Nagios and `service restart`

Have you tried turning it off and on again?

This is usually the goto solution for failing/frozen services. Something’s wrong let’s just restart whatever. It gets as absurd as running a restart cronjob every hour or two! Of course restarting helps, but it’s not the best solution. It’s often better to find out what’s wrong to begin with, why the thing that is supposed to work is not working anymore.

Nagios is an open-source host and service monitoring solution that not only monitors but also makes it possible to solve issues before bugging sleeping sysadmins and devops people. But more often than not the event handler responsible for trying to make things right is set to run /etc/init.d/service restart, right?

That’s pretty convenient, but it’s cheating. We wake up next morning and are happy as if nothing was wrong. Instead of merely restarting, why don’t we dump a post-mortem debugging context for us to go through to find out what actually went wrong, why the drastic restart was needed.

Here’s a simple script to get you started:

#!/bin/bash

# By default we dump to a directory with the current time
# this can be overridden by setting the UID envvar
DUMPDIR=${DUMPDIR:-"~/.postmortems/$(date +%Y%m%d%H%M%S)"}

# Read PROCESS, SERVICE and LOGFILE envvars
# and do dump, restart and read log respectively
# These can be called separately.

# Dump process memory
if [ ! -z "$PROCESS" ]; then
        mkdir -p $DUMPDIR

        # Coredump processes and threads
        for pid in $(pidof $PROCESS); do
                for tid in $(ls /proc/$pid/task); do
                        gcore -o $DUMPDIR/$PROCESS.$tid.$(date +%Y%m%d%H%M%S).core $tid
                done;
        done;
fi;

# Dump logfiles (can be a space-delimited list of files)
if [ ! -z "$LOGFILE" ]; then
        mkdir -p $DUMPDIR

        for l in $LOGFILE; do
                cp $l $DUMPDIR/$PROCESS.$(date +%Y%m%d%H%M%S).$(basename $l)
        done;
fi;

# Restart service
if [ ! -z "$SERVICE" ]; then
        service $SERVICE restart
fi;

This code is available on GitHub. Meow!

Say for instance our website is down and we’re fast asleep and can’t be bothered. What should be run?

PROCESS=nginx SERVICE=nginx LOGFILE="/var/log/nginx/error.log /var/log/access.log" ./dumpstart
PROCESS=mysqld SERVICE=mysql LOGFILE=/var/log/mysql.err ./dumpstart
PROCESS=php5-fpm SERVICE=php5-fpm LOGFILE=/var/log/php5-fpm.log ./dumpstart

Enjoy the dumps in the morning and uptime throughout the night (hopefully).