NOTE: We are currently in the process of migrating many of our monitoring services to Prometheus. For more information, visit the documentation page for Prometheus here.
We use Munin to provide real-time monitoring of our hardware. The master is dementors which runs a cron job every five minutes to collect data from the node server running on each machine. A custom script periodically generates the list of available nodes from LDAP.
We monitor servers, desktops, and staff VMs, but not the hozer boxes. Additionally, we don't receive email alerts for staff VMs.
Munin sends mail to root whenever certain stats run out of bounds for a machine, e.g. if disk usage goes above 92%. Some plugins have configurable warning and critical levels for each field, which are usually set in the node config like so:
[pluginname] env.fieldname_warning min:max env.fieldname_critical min:max
The warning bounds for each node are generated from a Puppet template in the
ocf module using machine specs from facter. While config files use
underscores, the display name for a variable's warning levels takes the form
munin-limits finds a variable in warning or critical range, it pipes the
alert text to another script which filters out
uninteresting or noisy messages and emails the rest to root. Munin itself isn't
very flexible about disabling alerts from plugins, so, if there is a noisy
variable you want to ignore alerts for, you can add it to the list of
We provide a Puppet class,
ocf::munin::plugin, which installs a custom Munin
plugin to a machine, for example, to monitor the number of players on our CS:GO
server. Writing a plugin is very easy, should you need to do so. When called
without arguments, it should print to standard output a list of variable names
field1.value <value> field2.value <value> ...
When given the lone argument
config, it should print display information for
Munin graphs and variable warning levels:
graph_title Title graph_vlabel yaxis graph_scale no field1.label label field1.warning min:max ...