One of the important thing in IT is maintaining the infra more reliable and companies are investigating a good amount of money for this. In modern world, the tools are sufficient to collect as many number of metrics as we need and we can create visualisations too. Modern systems can emit thousands or millions of metrics, and modern monitoring tools can collect them all.
But is this good to collect maximum number of metrics from servers or clusters, without knowing its actual power?!?!
Nooo! This won’t help you during the actual outages / incidents. You need proper dashboards with relevant data. The visualisation / dashboards are need to be created in such a way to reduce the troubleshooting pain during outages.
In this category, we will discuss about Prometheus as a metrics platform and it’s advantages. Please follow this page to get updates.
One of the important responsibilities of SRE (Site Reliability Engineer) is building these king of useful dashboard to reduce the time for troubleshooting. If we have enough metrics from our systems and apps, and we have relevant dashboards – we are good to fight with outages. We can find out the issue early as possible by using this dashboards.
What are the modern methodologies to build a perfect monitoring setup?
Yeah, the SRE signals. Mainly there are three modern methodologies available in building a perfect dashboard.
1, USE method by Brendan Gregg.
2, RED method by Tom Wilkie.
3, LTES method from Google SRE book.
These methodologies varies in FOCUS, however, these are inter related as well. If you create visualisations using above methodologies, it should be useful for the troubleshooting perspective. Let’s discuss these methodologies in detail.
U : Utilization
S : Saturation
E : Errors
This methodology is useful to get an internal view of your infrastructure. It mainly focus on your systems Utilizations, Saturation and Errors. These dashboards are helpful to identify how busy the resources are in your servers and it’s performances.
This methodology will help to understand your resources (like, CPU, Disk, Mem, Busses etc) Utilization (the average time that the resource was busy servicing work), Saturation (queue length) and Errors (the count of error events). I suggest to read this article to get more clear idea about this methodology.
R : Request Rate
E : Request Error
D : Request Duration
This methodology is useful to get an external overview of your infrastructure. This method is mainly focusing on request based metrics, not resource based metrics. Duration is explicitly taken to mean distributions, not averages. This part I will explain later, while explaining design aspects.
In this method, we are mainly checking the capacity of our system in terms of the request that it can handle. What about the total errors returned for requests, and the total duration taken for the requests. This method is actually based on the principles mentioned in Google’s SRE book.
L : Latency
T : Traffic
E : Errors
S : Saturation
As I mentioned the RED method is based on this LTES method by Google. Google added an extra metric, Saturation, over and above the RED method. In common cases, the RED is enough.
In the latency part, we are mainly considering the time taking for a request to server. It include both successful and failed response. We should include those matrices in Dashboard to analyse it. Then it includes the total requests. The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content.
This is explained in Google’s SRE book.
Okay, these are the monitoring methodologies. Now we can start the Prometheus section. In this article I will only explain the basic concept of Prometheus and it’s components. We will discuss this in detail in upcoming posts. You can follow my LinkedIn page for updates.
What is Prometheus?
Prometheus is an open-source systems monitoring and alerting toolkit. It was originally build at SoundCloud. Prometheus is one of the best monitoring toolkits for containers and micro services. The toolkit is highly customizable and designed to deliver rich metrics without creating a drag on system performance.
It is easy to setup the Prometheus server. You can install and start Prometheus on Physical/VM servers as a standalone service or containerised setup. This we will discuss in upcoming posts. This is just an introduction post, here we only discuss the basic components.
Working principle of Prometheus
This is not an agent based tool. Here you do not need to configure any agents on servers to send metrics. Prometheus is not a push based monitoring tool. It’s pull based. Prometheus scraps metrics from it end point targets.
Prometheus is a monitoring platform that collects metrics from monitored targets by scraping metrics HTTP endpoints on these targets.
We need to configure different exporters on target nodes. The main and important exporter is Node Exporter. This is for collecting the system metrics. Prometheus server have configuration to this targets, so it starts scrapping metrics from the target server. The main three things for start collecting metrics are:
1, Configure and start exporter/s on Target server/s.
2, Make sure the connectivity from Prometheus server to Target servers on ports (which are used by the exporters - 9100 is the default port for Node Exporter). Enable the connectivity in Firewall.
3, Configure the target in Prometheus server.
That’s it! It will start scraping metrics from the Target’s exporters.
How much time node_exporter or other exporters holds data locally?
Worry about the data loss? In case of any connectivity issue between Prometheus server and the Targets, you will loose metrics. Exporters doesn’t save any data locally, we can not configure like that. Exporters are binaries, it runs and fetches the data when a request comes to the socket it is listening at. If nobody fetches (scrapes) data, no data is gathered and the node_exporter instance idles waiting for incoming requests. Refer this discussion for more details.
Architecture of Prometheus
Prometheus server: It scrapes and stores time series data. We can use PromQL to query data from this TSDB.
Alertmanager: The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integrations such as email & other messaging systems.
Pushgateway: The Pushgateway is an intermediary service which allows you to push metrics from jobs which cannot be scraped. Read this document for more detail.
Web UI: By default, Prometheus has a Web UI where we can visualise the metrics scraped from targets. We can also integrate other tools like Grafana for making better dashboards. We use PromQL for querying data from TSDB.
Exporters: These are present in Target nodes. We can configure node exporter for System Metreics. It also supports special-purpose exporters for services like HAProxy, StatsD, Graphite, etc.
I hope you got a brief introduction on modern monitoring methodologies and the Prometheus monitoring toolkit. We can focus on Prometheus on upcoming posts.
Suggested post: Advantages of Prometheus monitoring tool