What is worth instrumenting?

· Around 1 minute

Services and libraries have different needs. Further, not all services are alike in the types of work they perform or what types of work are important to measure

Online-serving systems

These are services that have a person or client waiting for a response.

As such, the RED method captures key metrics which are Requests, Errors and Duration.

It's worth noting that there may be a tendency to exclude failed requsts when capturing duration but this temptation should be avoided.

In the event that you only had successes, a long running request that ultimate failed after 15 seconds would be excluded for example, despite any reasonable initial assumption that errors may tend towards having a lower duration.

Offline-serving systems

These are services that operate continually in the background. Their workloads are generally in batches and may utilise multiple steps, buffered with a queuing system.

The USE method captures key metrics which are Utilisation, Saturation and Errors.

Batch jobs

Similar to offline-serving systems, these may be kicked off upon request (ie sending an email in the background) or something akin to a cronjob.

Given that they aren't suitable for serving a persistent HTTP endpoint for scraping, it's best to push metrics to a monitoring solution such as Prometheus upon work being completed.