Monitoring Documentation Improvement

Hi,

I’m currently writing a custom monitoring solution by the help of the documentation about monitoring and metrics in order to enable Prometheus scraping transactor metrics. Unfortunately the metrics mentioned there doesn’t seem to completely represent what the transactor is currently able to produce / hand over to a registered callback function.

To be a bit more specific on this:

  • PodUpdateMsec was part of the metrics handed over to my callback function but it’s not part of the documentation
  • StorageBackoffhas been declared deprecated as of version 0.8.3826 and should have been replaced by StorageGetBackoffMsec and StoragePutBackoffMsec but is still produced by the transactor in the current version (0.9.5966) without the replacements
  • WriterMemcachedPutMusec, WriterMemcachedPutFailedMusec, ReaderMemcachedPutMusec and ReaderMemcachedPutFailedMusec that have been introduced in version 0.9.5078 are just part of the change log and not the documentation itself

Furthermore I would appreciate if the monitoring documentation linked above would include a snippet of metrics as handed to a registered callback function. This would eliminate the need to fire up a transactor in the first place just to get a grasp on the structural layout of the information. The latter also seems to be more difficult than it should since it’s not always clear at first glance what metric statistics are mapped to (:li :ho :sum :count).

At this point I just wanted to ask if there is a plan on improving the documentation by adding missing information, bundling information spread over multiple sites (changelog included) and make them more accessible by being more specific on what they describe (cross-links are cool, too) and how they are structured within a metrics ‘blob’?

P.S.
Since this issue affects several categories (Datomic Cloud, Datomic On-Prem and Datomic Applications) General seemed to be a good fit. Please feel free to move if desired.

Hi Ninja, and welcome to the Datomic forum. I want to make sure you have all the information you need to write a correct and robust custom monitoring callback.

The most important thing to understand is that the set of metrics is dynamic, open, and subject to change over time. So a correct implementation needs to be tolerant of

  • metric names it has never seen, and does not understand
  • metric values in either documented shape (numbers or maps), regardless of what shape it has seen for a metric previously

With that mind, we can turn to your specific questions:

  1. There are and will continue to be undocumented metrics such as PodUpdateMsec. We document them only if they become important in helping users troubleshoot systems.
  2. I see StorageGetBackoffMsec in my custom callback – can you please start a new forum post with a repro on this?
  3. The memcached metrics you mention were high volume and low value, so we removed them in 0.9.5783. Thanks for pointing this out! We will fix the changelog on our next release.

I will update the docs to include the advice from this thread.

Thanks for your suggestions, and please let me know if this gives you the information you need to proceed.

Stu