The State of Ceph Support in openATTIC (August 2016)

In May, I posted an update on the state of the Ceph support in openATTIC.

Since then, we released openATTIC 2.0.12 and 2.0.13 and are currently working on the next release, 2.0.14.

With each release, we have added more Ceph management and monitoring functionality or refined existing features.

In this post, I'd like to summarize these changes as well as giving an update on what we're currently working on.

Developing new Ceph features in openATTIC ususally requires building the backend functionality and corresponding REST API first. Once this foundation is in place, we can begin adding UI elements to the web-based adminstration interface. So even if there were no visual changes in the UI, the backend might have gained a number of new features that will be consumed by the web frontend in an upcoming release. We're using a "bottom-up" approach here, in which we aim to provide basic functionality first and then refining and improving it over time, based on user/community feedback.

So what are some of the key Ceph-related changes?

Version 2.0.12 added a new Ceph management page to the WebUI that lists all existing RADOS block devices (RBDs) and their details on the selected cluster. This release also added initial Nagios/Icinga monitoring support for keeping track of a Ceph Cluster's overall health status and performance, storing it in RRD files using PNP4Nagios. On the backend side, a lot of functionality was added to the REST API - from the CHANGELOG:

  • Added possiblility to create erasure-coded Ceph pools in the REST API (OP-546)
  • Added API call for creating new Ceph pools (OP-1024)
  • Added modifying requests to Ceph pools (OP-1170, OP-1172)
  • Added Ceph Pool snapshots to the REST API (OP-1242)
  • Added support for Ceph cache tiering (OP-1184)
  • Added API call to activate and deactivate Ceph OSDs (OP-1212)
  • Added Ceph RBD REST Collection (OP-1214)
  • Added a Nagios plugin to monitor basic performance data of a Ceph cluster (OP-1222) (thanks to Christian Eichelmann for giving us the permission to integrate a part of his check-ceph-dash implementation)
  • Added a basic infrastructure to create Nagios service definitions for known Ceph clusters (OP-1235)
  • Added CephFS REST Collection (OP-1245)

In version 2.0.13, we made further improvements to the Ceph RBD management UI, particularly the option to create and delete RBDs. Based on user feedback, we also cleaned up many Ceph-related detail-information tabs in order to display only useful data, especially on the Ceph RBD Page.

The Ceph monitoring with Nagios/Icinga was also improved, by adding support for monitoring performance data of individual Ceph pools. Also, we are now tracking the run time of these monitoring commands, to measure the responsiveness of the Ceph cluster over time. The performance data is stored in RRD files and can be obtained in the form of JSON objects via the REST API.

We also made some performance improvements, by running only the commands which are actually used by the REST API. Here's a more detailed list from the 2.0.13 CHANGELOG:

  • Added the performance data (in JSON) of a Ceph cluster to the REST API (OP-1279)
  • General Ceph RBD improvements (OP-1309, OP-1302, OP-1305 and OP-1133)
  • Improved compatibility with Ceph Hammer (OP-1303)
  • Fixed RBDs with same name in multiple pools (OP-1313)
  • Fixed ceph df missing a newly created pool (OP-1282)
  • Optimized calls to librados (OP-1321)
  • Added Ceph pool Nagios monitoring (OP-1292)
  • The check_cephcluster Nagios plugin contains another value exec_time. It represents the run time of the cluster check. (OP-1307)

So what's cooking for version 2.0.14? One key change on the backend side that has already been merged is likely the implementation of a task queue module (OP-1360), to track long-running commands. This became a requirement, as many operations in Ceph can take a long time to complete. We did not want the UI to get stuck or run into a timeout error before the task has finished. We also improved the handling of optional Ceph pool attributes in the REST API (OP-1416).

On the WebUI side, we have a pending pull request that adds Ceph pool management capabilities for adding and removing pools (OP-1299 and OP-1417).

Also, now that the Monitoring subsystem keeps track of the Cluster's health and performance data, it would be nice to display it in the form of graphs on a "Ceph dashboard" on the WebUI. The implementation of this new dashboard is nearing completion - a pull request has been posted here and we anticipate to merge it in time for the 2.0.14 release.

This dashboard will also replace the existing dashboard for the "traditional" storage management part. You will be able to customize it to your personal taste and requirements: widgets can be arranged and resized on a grid layout and you can create multiple dashboards.

Now with the Ceph pool performance monitoring in place, we'll also look into displaying these performance graphs in a pool's detail view, similar to how they are displayed for regular storage volumes. But instead of displaying static PNG images, we'll use the NVD3 Javascript library here. This will give us some more flexibility for displaying and analyzing the performance data.

We're also working on adding remote node management and deployment capabilities in cooperation with SUSE. This functionality will be built on top of SUSE's Salt Pillar work. As a first step, it should be possible to assign roles to newly deployed hosts, so they can be added to an existing Ceph cluster that has been created using this Salt framework.

By the way, a good method of getting a glimpse on what's happening in openATTIC development is keeping an eye on pending pull requests (and you're more than welcome to review and comment on these!), as well as the CHANGELOG file in the openATTIC development branch.

As usual, we're looking forward to your feedback and suggestions!

Comments

Comments powered by Disqus