collectd exec 插件使用进阶

前面一篇文章《使用 collectd 进行服务监控 | 看看俺 – KanKanAn.com》展示了如何使用 collectd 的 exec 插件。

要使收集的统计信息显示正常、易于使用，则需要对上报的数据有充分的理解。

数据的标识

PUTVAL Identifier [OptionList] Valuelist Submits one or more values (identified by Identifier, see below) to the daemon which will dispatch it to all it's write-plugins.

An Identifier is of the form "host/plugin-instance/type-instance" with both instance-parts being optional. If they're omitted the hyphen must be omitted, too. plugin and each instance-part may be chosen freely as long as the tuple (plugin, plugin instance, type instance) uniquely identifies the plugin within collectd. type identifies the type and number of values (i. e. data-set) passed to collectd. A large list of predefined data-sets is available in the types.db file. See types.db(5) for a description of the format of this file.

The OptionList is an optional list of Options, where each option is a key-value-pair. A list of currently understood options can be found below, all other options will be ignored. Values that contain spaces must be quoted with double quotes.

Valuelist is a colon-separated list of the time and the values, each either an integer if the data-source is a counter, or a double if the data-source is of type "gauge". You can submit an undefined gauge-value by using U. When submitting U to a counter the behavior is undefined. The time is given as epoch (i. e. standard UNIX time).

You can mix options and values, but the order is important: Options only effect following values, so specifying an option as last field is allowed, but useless. Also, an option applies to all following values, so you don't need to re-set an option over and over again.

The currently defined Options are:

interval=seconds Gives the interval in which the data identified by Identifier is being collected.

Please note that this is the same format as used in the unixsock plugin, see collectd-unixsock(5). There's also a bit more information on identifiers in case you're confused.

Since examples usually let one understand a lot better, here are some:

PUTVAL leeloo/cpu-0/cpu-idle N:2299366 PUTVAL alice/interface/if_octets-eth0 interval=10 1180647081:421465:479194

Identifier

格式为 host/plugin-instance/type-instance

其中的 - 为分隔符， instance 部分是可省略（此时 - 也要省略）。
host

主机名称，通常取自 HOSTNAME 环境变量。
plugin

插件名称。
type

预定义的值类型名称，定义值的类型及数量，以及 collectd 服务会对值做何处理（如：按时间间隔平均化）。

参考 man 5 types.db Data source - collectd Wiki

如某个上报的统计指标在网页上没有对应的图表产生，请检查 collectd 服务器与客户机上 types.db，数据集必须定义且一致，上报的值必须符合数据集定义。服务器或客户端安装的 collectd 可能版本较低，附带的 types.db 中缺少第三方插件要求的数据集定义，运营人员改动 types.db 中 memory 类型也会导致上报失败：
```
# memory         value:GAUGE:0:281474976710656
memory          free:GAUGE:0:281474976710656, buffered:GAUGE:0:281474976710656, used:GAUGE:0:281474976710656, cached:GAUGE:0:281474976710656
```

数据的展示

数据由上到下分级展示。

主机列表

选择要查看的主机，对应上面的 host

插件列表

选择要查看的插件，对应上面的 plugin

统计图表列表页

插件实例（ plugin instance ）+类型（ type ）产生一张图表，类型实例（ type instance ）对应图标上的一条曲线。

统计图表详情页

点击统计图表列表页上的图表进入统计图表详情页，此时可以选择统计的时间范围（如：按小时、天、周、月、年）。

另外可以聚合显示所有主机上的相同统计图表，以便进行交叉对比。

标识的使用

上报数据时，我们拥有极大的自由性，而 collectd 会宽容地接受并展示结果，但是为了让最终的结果有用、易用，我们需要正确地指定上报的信息项。

host

应该填写主机名称，当我们需要整个服务（包括多台主机）的统计时，可以借助 collectd 界面提供的聚合功能实现。

plugin

插件名称
plugin instance

插件实例，对应插件收集一个统计指标名称，如：memory。

对于简单的插件（只收集一个统计指标），则可以直接省略插件实例（plugin instance）部分，插件名称命名使用统计指标名称。
type

请在 types.db 中预定义的类型中选择。
type instance

对于主机上的唯一统计指标（如：load），就不需要使用 type instance 了，如果是主机上的非唯一统计指标（如：各分区使用率、进程 cpu 占用率等），则可以使用 type instance 来区分（如：填写为分区路径、进程名称等）。

多个 type instance 会在同一张图表中各使用一条曲线展示，如果放在一起展示没有意义，则可能更适合使用 plugin instance 进行标识。

突破 root 帐号限制

引用自 man 5 collectd-exec

CAVEATS · The user, the binary is executed as, may not have root privileges, i. e. must have an UID that is non-zero. This is for your own good.

Exec 插件不允许以 root 权限执行。

温和的解决办法

引用自 Plugin:Exec - collectd Wiki

The security concerns are addressed by forcing the plugin to check that custom programs are never executed with superuser privileges. If the daemon runs as root, you have to configure another user ID with which the new process is created. To circumvent missing access privileges to files, you need to lean on the unix group concept. I.e. your script requires access to /var/log/messages, which is owned by root, its common practice to have this file being group readable by the admin-group. Given the used ID corrosponds to MyWatcherUser, you need to add that user to the admin group via /etc/group (or what else manages users / groups on your system).

将原本需要 root 才能访问的文件，改变属组（ group ）为 admin ，权限为 group 可读，然后将插件账号的 group 也改为 admin 。
暴力的解决方法

利用 setuid ，允许可执行程序以 root 身份运行。

参考 linux下允许普通用户执行需要root权限的命令 | 看看俺 – KanKanAn.com