Title Image

Owner’s Corner: Monitoring Script

We've never had a great monitoring solution. If something goes really wrong, usually I wait for someone to text me if I'm not there to deal with it. I've long wanted to initiate an automatic notification of downtime, and tried to use something like nagios to do the monitoring, but it ended up being overkill for this monitoring of 1 service on 1 server. So I decided to write my own solution.

I believe there may be some logical complications here, so I wanna talk out my logic and ask you, the readers, to poke holes in my thinking. This script isn't done yet, by far.

The server lives as a node on my home network, which is behind my own firewall, which is behind my router. First, we need to monitor if a computer elsewhere on the internet can contact my home network. Second, if we can get to my hone network, then we need to make sure that the server the game runs on is online...to do that we monitor it from my always-on home server (home desktop assistant, or HDA). Finally, let's assume the server is running, we need to make sure that the minecraft server application is running, and that it's properly communicating (and not hung).

I have some simple solutions for these tasks, but I'm not convinced these are logically sound. Without being repetitive, I outlined some of my concerns in the comments of the script. To test that you can reach my home network, I have a ping going to my DynDNS resolver...this will just check that my router can be connected to. It should be an indicator that the internet is up at my house. My home server auto-boots if the power goes out, so I'm operating under the assumption that if the network is reachable, the HDA will be up and as a result running its checks. The HDA then auto-pings the minecraft server, to make sure it's running. It also auto-boots, so if it's running, its checks should be running. Finally, the on-server script first checks that a java process is running...if it is, send a command to the server to produce a log entry, then immediately pull a system timecode. If the command is successful, indicating the server is running properly, the timecodes should match. If they don't, the log isn't being written to, and even though the process is running something is wrong.

So that's the deal, like I said there's a few things I'm concerned about, but I think this should really help us keep on top of a downtime. I really need feedback, error checking, and thoughts on improving this one.

Spoiler Inside SelectShow