Thursday, November 19, 2009

repeating your mistakes

We ran into an issue at work again where poor planning ended up biting us in the ass. The computer does not have bugs - the program written by the human has bugs. In this case our monitoring agent couldn't send alerts from individual hosts because the MTA wasn't running, and we had no check to ensure the MTA was running.

This should have been fixed in the past. When /var would fill up, the MTA couldn't deliver mail. We added checks to alert before /var fills up (which is really stupid if you ask me; create a file and seek to the end of the filesystem and write something and /var is filled up, so it's possible this alert could be missed too).

So the fix here is to add a check on another host if the MTA isn't running. Great. Now we just need to assume nothing else prevents the MTA from delivering the message and we're all good. But what's the alternative? Remote syslog and a remote check to see if the host is down and when it's back up determine why it was down & to reap the unreceived syslog entry? I could be crazy, but something based on Spread seems a little more lightweight and just about as reliable, though because you're removing the requirement of a mail spool (you keep the logs on the client if it can't deliver the message) it reduces the complexity a tad.

At the end of the day we should have learned from our mistake the first time. Somebody should have sat down and thought of all the ways we may miss alerts in the future and work out solutions to them, document it and assign someone to implement it. But our architect didn't work this way and now we lack any architect. Nobody is tending the light and we're doomed to repeat our mistakes over and over.

Also we shouldn't have reinvented a whole monitoring agent when cron scripts, Spread (or collectd) and Nagios could maintain alerts just as well and a lot easier/quicker.

No comments:

Post a Comment