Monday, September 7, 2009

personal preferences for managing large sets of host configuration

after dealing with a rather befuddling cfengine configuration for several years i can honestly say that nobody should ever be allowed to maintain it by themselves. by the same token, a large group should never be allowed to maintain it without a very rigorous method of verifying changes, and policies that will *actually* be enforced by a manager, team lead or other designated person(s).

the problem you get when managing large host configurations with something like cfengine is the flexibility in determining configuration and how people differ in their approach of applying the configuration. say you want to configure apache on 1000 hosts, many of them having differences in the configuration. generally one method will be set down and *most* people will do it the same way. this allows for simplicity in terms of making an update to an existing config. but what about edge cases? those few, weird, new changes that don't work with the existing model? perhaps it requires such a newer or older version of the software that the configuration method changes drastically.

how can you trust your admins to make it work 'the right way' when they need to make a major change? the fact is, you can't. it's not like they're setting out to create a bad precedent. most of the time people just want to get something to work and don't have the foggiest how, but they try anyway. this results in something which is much different and slightly broken compared to your old working model, but since it's new nobody else knows how it works.

i don't have a time-tested solution to this, but i know how i'd do it if i had to set down the original model. it comes down to the same logic behind managing open source software. it's important every contributed change works, yes? and though you trust your commiters you want to ensure no problems crop up. you want to make sure no drastic design changes happen or that in general things are being done the right way. the only way to do that involves two policies.

1. managerial overview. this is similar to peer-review, except there's only 1 or 2 peers whose task it is to look at every single commit, understand why it was done the way it was, and decide to fix or remove the commit if it violates the accepted working model. this requires a certain amount of time from an employee so a team lead makes more sense than a manager. it's not a critical role but it is an important one, and anyone who understands the model and the inner-workings of your configuration management language should be capable.

2. strict procedures for change of configuration management. this means you take the time to enumerate every example of how your admins can modify your configuration management. typically this also includes a "catch-all" instruction to get verification from a manager, team lead, or other person-in-charge if you're going to make a change outside the scope of the original procedures. this requires a delicate touch; bark too hard at offensive misdirected changes and they'll prefer not to contact you in the future. on the other hand, if you don't enforce people following the original procedures and using good judgment you'll get called all the time.

at the end of the day it all comes down to how you lay out your config management. it needs to be simple and user-friendly while at the same time extensible and flexible. you want it to be able to grow with uncommon uses while at the same time not being over-designed or clunky. in my opinion, the best way to go about this is to break up everything into sections, just like you would a filesystem full of irregular material.

the parent directories should be very vague/general and become more specific as they go down, while at the same time always having the ability to group similar configuration into those sub-directories. a depth of 4 or 5 is a good target. don't get too worried about making it too specific; the more general you are at each step, the easier it is to expand the configuration there in the future as well as making it easier for users to find and apply configuration where it makes sense.

you also need to consider: "how practical is it to manage this configuration on this host?" you need to make it so that any given change on any given host should take no longer than 5 minutes to determine how to make that change on that host (using your configuration management system). in this way anyone from the NOC to the system engineers or architects can modify the system in real-time when it counts. documentation is no substitute for intuitive user-friendliness. documentation should explain why something is the way it is, not how to do it or where to find it.

note puppet's style guide and its reasons for its formatting: "These guidelines were developed at Stanford University, where complexity drove the need for a highly formalized and structured stylistic practice which is strictly adhered to by the dozen Unix admins"

*update* i don't think i even touched on it, but in cfengine the use of modules should probably be leveraged greatly instead of briefly. the more code you shove into your module, the less you'll need in your inputs and thus the easier it will be to manage the hulking beast of lengthy input scripts.

No comments:

Post a Comment