Writing a Simple Shinken Log Check over SSH

Following the wonderful Alfresco issues I’ve been working on solving lately I decided that proactive is the way to go to ensure the best uptime for our clients. So I had to add this to monitoring, this is something we can easily check on because every Java heap memory error is logged on the Alfresco server. So lets go through the anatomy of a check in Shinken.

Pretty much I used the linux-ssh pack and copied it over and modified everything / removed everything unnecessary:

Setup

cd /etc/shinken/packs
mkdir alfresco
rsync -avz linux-ssh/ alfresco/

Setup Command: /etc/shinken/packs/alfresco/commands.cfg

# -----------------------------------------------------------------
#
#      Alfresco standard check
#
# -----------------------------------------------------------------
define command {
       command_name     check_alfresco_heap_errors
       command_line     $PLUGINSDIR$/check_alfresco_heap_errors_by_ssh.py -H $HOSTADDRESS$ -p $_HOSTSSH_PORT$ -u root -i $_HOSTSSH_KEY$
}
So above we created a new command to be run in our services and this particular one uses /var/lib/shinken/libexec/check_alfresco_heap_errors_by_ssh.py which I wrote to pretty much execute a short bash script on one of our Alfresco servers.
Setup templates: /etc/shinken/packs/alfresco/templates.cfg
define host{
   name             alfresco
   check_command            check_alfresco_heap_errors
   register         0
   _SSH_KEY         $SSH_KEY$
   _SSH_KEY_PASSPHRASE      $SSH_KEY_PASSPHRASE$
   _SSH_USER            $SSH_USER$
   _SSH_PORT            $SSH_PORT$
}
define service{
  name              alfresco-log-service
  use               generic-service
  register                      0
  aggregation           system
}
Here we are setting up the service to be used when setting up our checks, which we will see next.
Setup the check: /etc/shinken/packs/alfresco/services/heap.cfg
define service{
   service_description    AlfrescoLogCheck
   use                alfresco-log-service
   register           0
   host_name          alfresco
   check_command      check_alfresco_heap_errors
}

Make sure all of the names match (alfresco-log-service) or else Shinken will error out on config check. All that is left now is to add our new check to a host and to setup the actual check script.

Add to host: /etc/shinken/hosts/device.cfg

define host{
        use                     generic-host, http, alfresco
    contact_groups      admins
        host_name               alfresco-prod-oct
        address                 1.2.3.4
    _SSH_PORT       9999
    _SSH_KEY        /home/shinken/.ssh/SuperSecretKey.pem
        }

And for the check I actually just modified the linux-ssh uptime check to do my check:

Check Script

Check Script
#!/usr/bin/env python
'''
 This script is a check to see if alfresco has any java heap errors
 over ssh without having an agent on the other side
'''
import os
import sys
import optparse
# Ok try to load our directory to load the plugin utils.
my_dir = os.path.dirname(__file__)
sys.path.insert(0, my_dir)
try:
    import schecks
except ImportError:
    print "ERROR : this plugin needs the local schecks.py lib. Please install it"
    sys.exit(2)
VERSION = "0.1"
DEFAULT_WARNING = '1' # There is no warning, only critical
DEFAULT_CRITICAL = '2'
def get_java_errors(client):
    raw = r"""/root/bin/check_alfresco_memory_errors.sh"""
    stdin, stdout, stderr = client.exec_command(raw)
    line = [l for l in stdout][0].strip()
    client.close()
    return line
parser = optparse.OptionParser(
    "%prog [options]", version="%prog " + VERSION)
parser.add_option('-H', '--hostname',
    dest="hostname", help='Hostname to connect to')
parser.add_option('-p', '--port',
    dest="port", type="int", default=22,
    help='SSH port to connect to. Default : 22')
parser.add_option('-i', '--ssh-key',
    dest="ssh_key_file",
    help='SSH key file to use. By default will take ~/.ssh/id_rsa.')
parser.add_option('-u', '--user',
    dest="user", help='remote use to use. By default shinken.')
parser.add_option('-P', '--passphrase',
                  dest="passphrase", help='SSH key passphrase. By default will use void')
parser.add_option('-c', '--critical',
                  dest="critical", help='Critical value for uptime in seconds. Less means critical error. Default : 3600')
if __name__ == '__main__':
    # Ok first job : parse args
    opts, args = parser.parse_args()
    if args:
        parser.error("Does not accept any argument.")
    hostname = opts.hostname or ''
    port = opts.port
    ssh_key_file = opts.ssh_key_file or os.path.expanduser('~/.ssh/id_rsa')
    user = opts.user or 'shinken'
    passphrase = opts.passphrase or ''
    # Try to get numeic warning/critical values
    s_warning  = DEFAULT_WARNING
    s_critical = opts.critical or DEFAULT_CRITICAL
    _, critical = schecks.get_warn_crit(s_warning, s_critical)
    # Ok now connect, and try to get values for memory
    client = schecks.connect(hostname, port, ssh_key_file, passphrase, user)
    errors = int(get_java_errors(client))
    if errors > 1:
        print "Critical: There are %d java heap errors" % (errors)
        sys.exit(2)
    if errors == 1:
        print "Critical: There is %d java heap error" % (errors)
        sys.exit(1)
    else:
        print "Ok: There are no java heap errors: %d" % (errors)
        sys.exit(0)

This script is really just SSHing in and executing /root/bin/check_alfresco_memory_errors.sh and returning STDOUT to the script. So it’s about as simple as you can get and you can handle a lot of the heavy or not so heavy lifting on the remote server.

Simple Log Checker for Heap Errors

And here is quick little bash script I turned up to check the logs for heap errors so we can alert on those as well.

#!/bin/bash
# Purpose: To ensure that there are no recent memory errors
# and to notify upon memory errors
# to ensure uptime
# used by monitoring - shinken
logfile='/opt/bitnami/apache-tomcat/logs/alfresco.log'
logerror='Caused by: java.lang.OutOfMemoryError: Java heap space'
number_of_errors=$(grep -c "$logerror" $logfile)
echo $number_of_errors

And tada now we have monitoring for Alfresco throwing errors and can proactively fix things before anyone notices!

Write a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.