What is worth instrumenting?

Services and libraries have different needs. Further, not all services are alike in the types of work they perform or what types of work are important to measure

Online-serving systems

These are services that have a person or client waiting for a response.

As such, the RED method captures key metrics which are Requests, Errors and Duration.

It’s worth noting that there may be a tendency to exclude failed requsts when capturing duration but this temptation should be avoided.

In the event that you only had successes, a long running request that ultimate failed after 15 seconds would be excluded for example, despite any reasonable initial assumption that errors may tend towards having a lower duration.

Offline-serving systems

These are services that operate continually in the background. Their workloads are generally in batches and may utilise multiple steps, buffered with a queuing system.

The USE method captures key metrics which are Utilisation, Saturation and Errors.

Batch jobs

Similar to offline-serving systems, these may be kicked off upon request (ie sending an email in the background) or something akin to a cronjob.

Given that they aren’t suitable for serving a persistent HTTP endpoint for scraping, it’s best to push metrics to a monitoring solution such as Prometheus upon work being completed.