New Ceph plugin for CheckMK

In my last post I showed you that a bug exists regarding the monitoring of Ceph in CheckMK (at least until 2.3.0p18). Also I reported that an alternate plugin exists and that I will test it. Here are my findings.

Installation

First you have to remove /etc/check_mk/ceph.cfg and /usr/lib/check_mk_agent/plugins/mk_ceph from all monitored Ceph hosts. It belongs to the standard CheckMK Ceph monitoring. After that, return to CheckMK, reschedule a Check_MK Discovery and accept to remove the vanished Ceph services.

Then go to the CheckMK Exchange, you can proceed directly to the download page for the new Ceph statistics plugin. Copy the download link for Version 2.x, then login to your CheckMK server. Switch to your site user (in this example: “MYSITE”):

# su - MYSITE

Then download the .mkp file:

$ wget https://exchange.checkmk.com/packages/ceph/.../ceph-...mkp

I deliberately shortened the download link, paste the link you copied from the plugin site here. Then add & enable the new plugin and exit from being the site user:

$ mkp add ceph-...mkp
$ mkp enable ceph
$ exit

You can now proceed and copy the plugin files to your Ceph hosts (replacing MYSITE with your site and PVE with the name of your Ceph servers):

# cd /opt/omd/sites/MYSITE/local/share/check_mk/agents/plugins
# scp ceph.py ceph_2.py PVE:/usr/lib/check_mk_agent/plugins/.

The new plugins don’t need configuration in /etc/check_mk/ceph.cfg, they read /etc/ceph/ceph.conf which is a symbolic link to /etc/pve/ceph.conf on Proxmox servers.

Return to your CheckMK GUI and run a service discovery for the Ceph hosts. It will offer you several new Ceph services. Accept to monitor them and activate the changes.

Differences between old and new plugin

Let’s compare what we got with the original plugin to what we get with the alternate plugin:

  • Ceph OSDs was a service for all OSDs without graphics were you could apply upper levels of OSDs which are out or down for warning or critical.

    Now you get a service for each OSD and another for its database and graphics for size and used space, growing, shrinking, trends, latency and number of placement groups (PGs). The parameters are those for filesystems where you can apply really a lot of checks. Just to name a few:
    • used/free space as percentage, absolute or dynamic.
    • time range for trend computation
    • levels on trends per time range
    • levels for the percentual growth per time range
    • levels on time left until full
  • Ceph PGs just listed the number of PGs and their state. No parameters.

    This service has no direct equivalent in the new plugin but the PGs and their states are monitored in the new overall Ceph status service.
  • Ceph pools are mostly the same for the old and the new plugin. In both cases the graphs show the usual values for filesystems. The new plugin additionally shows the disk throughput in Bytes per second and the disk I/O operations per second.

    The main point here is that the new plugin shows the correct value for “total size” out of the box, without editing files on the CheckMK server.

    The old plugin showed a Ceph pool summary for which I did not have much use because it showed the gross (raw) capacity of all OSDs instead of the net capacity.
  • The old Ceph status service had no graphs. The new plugin shows a lot of graphs, mostly like file system parameters, but additionally the number and state of PGs, objects and degraded objects. Beware that the values are based an raw/gross capacity.
  • A completely new service is Ceph class (typically ssd or hdd). These are the usual filesystem values but per storage class.

To check for warnings I shut down one of the Ceph hosts. In the old plugin I got CRIT for the OSDs and WARN for PGs and Status. The new plugin only gave me a WARN for the Status, but detailed that each one MON/OSD were down and that several PGs are degraded. That’s good enough for me.

Both the old and the new plugin failed to mention which components on which host exactly are down. But you should run the Ceph plugin on all your hosts anyway so CheckMK gives you a good impression of what is down and where to look first.

Conclusion

The new Ceph plugin from the CheckMK exchange fixes the bug I mentioned in my last post. Also it gives you some interesting additional insights through graphs.

The CheckMK people mentioned that the new plugin will be “mainlined”, whatever that means exactly. Until today I did not see a CheckMK version were it is included.

Although there is some manual installation work to do I advise to use the new plugin. It will become standard sooner or later anyway.

Update: meanwhile Robert Sander stated, that the new Ceph plugin will be included in the upcoming CheckMK release 2.4 and will replace the old mk_ceph.


Posted

in

, ,

by

Tags: