We've never had a great monitoring solution. If something goes really wrong, usually I wait for someone to text me if I'm not there to deal with it. I've long wanted to initiate an automatic notification of downtime, and tried to use something like nagios to do the monitoring, but it ended up being overkill for this monitoring of 1 service on 1 server. So I decided to write my own solution.
I believe there may be some logical complications here, so I wanna talk out my logic and ask you, the readers, to poke holes in my thinking. This script isn't done yet, by far.
The server lives as a node on my home network, which is behind my own firewall, which is behind my router. First, we need to monitor if a computer elsewhere on the internet can contact my home network. Second, if we can get to my hone network, then we need to make sure that the server the game runs on is online...to do that we monitor it from my always-on home server (home desktop assistant, or HDA). Finally, let's assume the server is running, we need to make sure that the minecraft server application is running, and that it's properly communicating (and not hung).
I have some simple solutions for these tasks, but I'm not convinced these are logically sound. Without being repetitive, I outlined some of my concerns in the comments of the script. To test that you can reach my home network, I have a ping going to my DynDNS resolver...this will just check that my router can be connected to. It should be an indicator that the internet is up at my house. My home server auto-boots if the power goes out, so I'm operating under the assumption that if the network is reachable, the HDA will be up and as a result running its checks. The HDA then auto-pings the minecraft server, to make sure it's running. It also auto-boots, so if it's running, its checks should be running. Finally, the on-server script first checks that a java process is running...if it is, send a command to the server to produce a log entry, then immediately pull a system timecode. If the command is successful, indicating the server is running properly, the timecodes should match. If they don't, the log isn't being written to, and even though the process is running something is wrong.
So that's the deal, like I said there's a few things I'm concerned about, but I think this should really help us keep on top of a downtime. I really need feedback, error checking, and thoughts on improving this one.
Spoiler Inside |
SelectShow> |
#!/bin/bash
#
# Title: script.monitor.sh
# Description: Script that monitors the teh3l3m3nts minecraft server
# Author: Joseph Gullo (surfrock66) (surfrock66@surfrock66.com)
#
# This is a comprehensive monitoring script to use for the teh3l3m3nts
# minecraft server. It can be used in 3 configurations for 3 types
# of monitoring:
#
# 0 - Monitoring the server process from inside the same machine.
# This configuration first checks that a java process is running;
# this would indicate that the server is running. If there is no
# java process, then the server is offline. If java is running,
# then attempt to send a command to the server initiating a "ping"
# which will print a log entry. Then, generate the current date and
# time. Parse out the most recent line of the minecraft log file,
# and parse the date and time out of that. Now we're checking to
# see if the last log entry is basically the same time as the current
# check, indicating that the server is currently accepting commands.
# Through my testing, they are usually off by about 9 seconds. First
# check that the date is the same (this thing is gonna be cron'd to
# run every 5 minutes...if it REALLY catches the server at the precise
# monent the date rolls over...fine whatever. Next, check that the
# hours are the same, again, if the 9 second delay JUST HAPPENS to
# catch it at the rollover, whatever, possible unnecessary log file.
# Finally, check that the minutes are the same...OR WITHIN 1 OF EACH
# OTHER! I believe this will alleviate most of the erroneous error alerts.
# 1 - Monitoring the server from inside the same network; the impetus
# is that if the machine is down, it won't be able to alert us to its
# status. This basically pings the server machine to see if it's running.
# 2 - Monitoring the entire network from the outside; what if the internet
# goes out at the house the server is being run at? Check to see if the
# network is reachable...mostly using a DynDNS resolver since we're all
# on residential internet.
#
# As the script proceeds, it builds a message that will be the body of
# a message should one be triggered. All errors are added to provide
# the best diagnostic steps.
#
# Potential Concerns:
# 1) Let's say the server reboots and gets stuck in the
# BIOS...will it survive the ping (passing check 1) but not send an error
# message from inside the server, meaning cron and the server are down?
# We wouldn't be alerted in this case. Investigate triggers that the
# OS is running to a point that we can be satisfied cron (and the
# internal monitoring script) is running.
# 2) What if the power goes out, but within a 5 minute cron window it
# comes back on, making the router come alive but not any of the servers
# for some reason. Ideally with the modem reset we'd get a new IP so the
# DynDNS wouldn't resolve, but it's POSSIBLE that if the IP doesn't change,
# the ping from outside the network hits the router even though the
# servers are all down. Possibly investigate pinging a specific service,
# in this setup, on surfrock66's HDA (such as subsonic)
#
#Variable that chooses whic set of script commands to run, based on location
# Must be one of the following:
# 0 = Monitoring from inside the same machine
# 1 = Monitoring from different machine inside same network
# 2 = Monitoring from different machine outside the network
MONITORLOC="0"
# This is the name of the screen holding the server process
NAME="Bukkit"
# This is the description of the server, used in sending in-game notices
DESC="teh3l3m3nts Minecraft Server"
# E-mail subject for any error e-mails going out
SUBJECT="TEH3L3M3NTS MINECRAFT SERVER MONITORING ALERT!"
# Team e-mail addresses to send alerts to
TOJOE="surfrock66@surfrock66.com"
TOMIKE="landers.mike@gmail.com"
TONATE="nathan.lowell@arvixe.com"
TOSERVER="minecraftpisspants@gmail.com"
# Temporary file to store alert message as it's constructed
MESSAGE="/tmp/message.txt"
# Address to ping the server's network from outside the network
HOSTMC="surfrock66.yourhda.com"
# Address to ping the server from inside the network
SERVERIP="192.168.1.50"
# Processname to check to see if the server is running
SERVERPROCESS="java"
# Location of the server's logfile
LOGFILE="/var/bukkit/server.log"
# Errorflag, defaults to 0 for No Error, can be set to 1 for Error
ERRORFLAG=0
# Initiate the message file that becomes the content of an alert e-mail
echo "MONITORING ALERT FOR TEH3L3M3NTS MINECRAFT SERVER!" > $MESSAGE
if [ $MONITORLOC == "2" ]; then
#####
#
# Code for testing server access from another box outside the netork
#
# Ping the home network from outside of the network
ping -c 1 $HOSTMC &> /dev/null
# If the ping produces no response, then initiate an error message
if [ $? -ne 0 ]; then
echo "Ping to teh3l3m3nts server failed, network is down!" >> $MESSAGE
# Change the flag which triggers the e-mail alert
ERRORFLAG=1
else
# If the ping succeeds, add a message to the error message
# indicating successful steps
echo "Ping to teh3l3m3nts server succeeded, network is up!" >> $MESSAGE
fi
#
# End of outside-network access test.
#
#####
elif [ $MONITORLOC == "1" ]; then
#####
#
# Code for testing server access from another box on the same netork
#
# Ping the server machine from inside the network
ping -c 1 $SERVERIP &> /dev/null
# If the ping produces no response, then initiate an error message
if [ $? -ne 0 ]; then
echo "Ping to teh3l3m3nts server failed, server is down!" >> $MESSAGE
# Change the flag which triggers the e-mail alert
ERRORFLAG=1
else
# If the ping succeeds, add a message to the error message
# indicating successful steps
echo "Ping to teh3l3m3nts server succeeded, server is up!" >> $MESSAGE
fi
#
# End of same-network access test.
#
#####
elif [ $MONITORLOC == "0" ]; then
#####
#
# Code for testing if the server is up from the same machine
#
# Poll running services for the server's process, stores the PID
SERVERID=$(pgrep "$SERVERPROCESS")
# Check that the server's process is running
if [ $SERVERID ]; then
# Add an indicator that the process is running to the message
echo "teh3l3m3nts minecraft server process is running!" >> $MESSAGE
# Check that the server's logfile exists
if [ -f $LOGFILE ]; then
# Send a ping command to the server to generate a new timecode line
screen -dr "$NAME" -p 0 -X stuff "$(printf "ping\r")"
# Grab the last line of the log file...likely the above ping
LOGEND=$(tail -1 "$LOGFILE")
# Add the date and time to a variable in the same format as
# the minecraft log timestamp
DATESTR=$(date +"%Y-%m-%d %H:%M:%S")
# Parse out just the date and time string
TIMECODE=${LOGEND:0:19}
# Set the date to a variable from the logfile line
TIMECODE=${LOGEND:0:10}
# Extract the date to a variable from the current timecode
DATECODE=${DATESTR:0:10}
# Check to make sure the last log entry is on the same day as
# the current timecode
if [ "$TIMECODE" == "$DATECODE" ]; then
# Indicate at least the datecodes match in the e-mail message
echo "Log being written to, datecodes match!" >> $MESSAGE
# Parse out the hour from the log entry
TIMECODE=${LOGEND:11:2}
# Parse out the hour from the current timecode
DATECODE=${DATESTR:11:2}
# Check to make sure the last log entry is the same hour as
# the current timecode
if [ "$TIMECODE" == "$DATECODE" ]; then
# Indicate at least the hours match in the e-mail message
echo "Log being written to, timecode hours match!" >> $MESSAGE
# Parse out the minute from the log entry
TIMECODE=${LOGEND:14:2}
# Parse out the minute from the current timecode
DATECODE=${DATESTR:14:2}
# Convert the log minute to a number from a string
LOGMIN=$(( 10#$TIMECODE ))
# Convert the current minute to a number from a string
DATEMIN=$(( 10#$DATECODE ))
# Add a second variable for 1 more than the log minute
LOGMIN1=$(($LOGMIN+1))
# Check to see if the log and current minutes are within
# 1 of each other, which is close-enough
if [ "$LOGMIN" == "$DATEMIN" -o "$LOGMIN1" == "$DATEMIN" ]; then
# Indicate that the minutes basically match in the message
echo "Log being written to, timecode minutes match!" >> $MESSAGE
else
# If the last log line is too different of a minute
# than the current timecode, trigger an error
echo "Log not being written to, timecode minutes mismatch, server is OFFLINE!" >> $MESSAGE
# Change the flag which triggers the e-mail alert
ERRORFLAG=1
fi
else
# If the last log line is a differnt hour than now,
# trigger an error
echo "Log not being written to, timecode hours mismatch, server is OFFLINE!" >> $MESSAGE
# Change the flag which triggers the e-mail alert
ERRORFLAG=1
fi
else
# If the last log line is a different date than now, trigger
# an error.
echo "Log not being written to, datecode mismatch, server is OFFLINE!" >> $MESSAGE
# Change the flag which triggers the e-mail alert
ERRORFLAG=1
fi
else
# If there is no log file, trigger an error message
echo "There appears to be no logfile!" >> $MESSAGE
# Change the flag which triggers the e-mail alert
ERRORFLAG=1
fi
else
# If the server's process isn't running, the initiate an error message
echo "teh3l3m3nts minecraft server process is NOT running!" >> $MESSAGE
# Change the flag which triggers the e-mail alert
ERRORFLAG=1
fi
#
# End of same-machine accss test
#
#####
fi
# Check if the error is indicated, then initiate sending messages to
# all owners.
if [ $ERRORFLAG == "1" ]; then
echo "ERROR DETECTED, SENDING MAIL!"
mail -s "$SUBJECT" "$TOJOE" < $MESSAGE
mail -s "$SUBJECT" "$TOMIKE" < $MESSAGE
mail -s "$SUBJECT" "$TONATE" < $MESSAGE
mail -s "$SUBJECT" " $TOSERVER" < $MESSAGE
fi
|
This entry was posted on Friday, October 26th, 2012 at 9:54 pm by surfrock66 and is filed under Owner's Corner.
You can leave a response, or