driohq.net

The power of observability with Prometheus and Grafana

Prometheus and Grafana are two pieces of technology that I find invaluable. When used properly, they give you great insights into how your systems are doing.

Prometheus pulls data from service and stores them on local storage. This pull action is called scraping in prometheus lingo. Your systems expose prometheus data (metrics) by providing a http endpoint that prometheus scrapes. You can use many of the prometheus libraries to instrument your applications. Alternatively, there are standalone tools (exporters) that grab the metric data and expose it to prometheus via the http endpoint.

Prometheus comes with service discovery functionality, which means it can discover endpoints and start scraping them. It understands different service discovery protocols and alternatively you can just write your own. Prometheus can also monitor a file to "discover" services. I find that method very simple and powerful.

Prometheus provides alerts via a single binary tool called alertmanager. Prometheus push alerts to alertmanager which in turn notifies whatever system you configure, like email, pagerduty, etc...

Metrics and series

Prometheus is a time series database for storing metrics. Grafana queries prometheus (and other data sources) to visualize metrics. The definition of metric is tricky (at least for me). A metrics is something you can measure over time. In the context of computer systems it could be things like the cpu load of a machine, memory usage, number of requests, etc... Another example could be the temperature on a house. You can measure temperature in different locations of the house: attic, office, living room, kitchen... All those measurements belong to the same metric (temperature) but the metric resolves to many time series. In our temperature example, if you just query by metric (temperature) you would get five time series, one per each room in the house. In promQL (prometheus query language) lingo, we use labels to select the series in the metrics we want to focus on. The relationship between metrics and series and the role of labels is important to be able to use prometheus effectively.

Metric types

Prometheus has four type of metrics but the ones you want to start with are gauges and counters. Gauges are for measuring metrics that capture the current value. This value can go up or down. One example would be temperature or the current cpu usage in a computer machine.

We use counters for values that only increment. Some examples are: http requests or network traffic. The current value of a counter metric is useless, instead you want to see the rate of change over a particular period of time. We use the promQL rate like of functions to accomplish that.

One interesting usage for prometheus is to create a metric that returns the current timestamp. Let's say you have a piece of software that performs a task periodically. You can tell Prometheus to scrap the application and the application can expose a metric (gauge) that will return the timestamp of the last time that performed an action correctly. Then you can monitor that value and alert if the current(ts) - last(ts) > constant. So much potential.

Prometheus (and Grafana) are invaluable tools to bring understanding on how your computer systems are operating. Are you using them? Let me know how and if they are as impactful for you as they are for me.

drio out.