From 6a977bee052bb95e93bf973117e30230156d83ed Mon Sep 17 00:00:00 2001 From: Christopher Baines Date: Sun, 20 Mar 2016 18:27:52 +0000 Subject: Add new post "Prometheus in early 2016" --- posts/prometheus_in_early_2016.mdwn | 104 +++++++++++++++++++++ .../request_rate_by_request_path.png | 1 + .../request_rate_by_status_code.png | 1 + ...cessing_duration_by_server_and_request_path.png | 1 + 4 files changed, 107 insertions(+) create mode 100644 posts/prometheus_in_early_2016.mdwn create mode 120000 posts/prometheus_in_early_2016/request_rate_by_request_path.png create mode 120000 posts/prometheus_in_early_2016/request_rate_by_status_code.png create mode 120000 posts/prometheus_in_early_2016/response_processing_duration_by_server_and_request_path.png diff --git a/posts/prometheus_in_early_2016.mdwn b/posts/prometheus_in_early_2016.mdwn new file mode 100644 index 0000000..36aa184 --- /dev/null +++ b/posts/prometheus_in_early_2016.mdwn @@ -0,0 +1,104 @@ +[[!meta title="Prometheus in early 2016"]] + +[[!taglink Prometheus]] ([[website|http://prometheus.io/]]) has been used on +and off by [[!taglink Thread]] since May 2015 (before I [[joined in +June|posts/first_week_at_thread]]). Its a [[!taglink free_software]] time +series database which is very useful for monitoring systems and services. + +There was a brief gap in use, but then I set it up again in October, driven by +the need to monitor machine resources, and monitor the length of the queues +(Thread use [[django lightweight +queue|https://github.com/thread/django-lightweight-queue]]). This proved very +useful, as suddenly when a problem arose, you could look at the queues and +machine stats which helped greatly in determining the correct course of action. + +When using Prometheus, the server scrapes metrics from services that expose +them (e.g. the prometheus-node-exporter). This is a common pattern, and I had +already thrown together a exporter for django lightweight queue (that just +simply got the data out of redis), so as new and interesting problems occured, +I began looking at how Prometheus could be used to provide better visiblity. + +## PGBouncer + +The first issue that I addressed was a fun issue with exausting the pgbouncer +connection pool. The first time this happened, it was only noticed as emails +were being sent really slowly, and eventually it was determined that this was +due to workers waiting for a database connection. [[!taglink PGBouncer]] does +expose metrics, but having them in Prometheus makes them accessible, so I wrote +a [[Prometheus exporter for +PGBouncer|http://git.cbaines.net/prometheus-pgbouncer-exporter/about/]]. + +The data from this is displayed on relevant dashboards in Grafana, and has +helped more than once to quickly solve issues. Recently, the +prometheus-pgbouncer-exporter was [[accepted in to +Debian|https://tracker.debian.org/news/753455]], which will hopefully make it +easy for others to install and use. + +## HAProxy logs + +With the success of the PGBouncer exporter, I recently started working on +another exporter, this time for [[!taglink HAProxy]]. Now, there is already a +HAProxy exporter, but the metrics which I wanted (per HTTP request path rates, +per HTTP request path response duration histograms, ...) are not something that +it offers (as it just exposes metrics on the status page). These are something +that you can get from the HAProxy logs, and there are [[existing libraries to +parse this|https://github.com/gforcada/haproxy_log_analysis]], which made it +easier to put together an exporter. + +It was using the data from the HAProxy log exporter that I began to get a +better grasp on the power of aggregating metrics. The HAProxy log exporter, +exports a metric haproxy_log_requests_total and this can have a number of +labels (status_code, frontend_name, backend_name, server_name, +http_request_path, http_request_method, client_ip, client_port). Say you enable +the status_code, server_name and http_request_method labels, then, if you want +to get a rate of requests per status code (e.g. to check the rate of HTTP 500 +responses), you just run: + + sum(rate(haproxy_log_requests_total[1m])) by (status_code) + +[[!img request_rate_by_status_code.png + caption="Per second request rates, split by status code (key shown at the bottom)" +]] + +Perhaps you want to compare the performance of two servers for the different +request paths, you would run: + + sum( + rate(haproxy_log_requests_total[1m]) + ) by ( + http_request_path, server_name + ) + +[[!img request_rate_by_request_path.png + caption="Each metric represents the request rate for a single request path (e.g. /foo) for a single server" +]] + +And everything you can do with a simple counter, you can also do with +histograms for response duration. So say you want to know how a particular +request path is being handled over a set of servers, you can run: + + histogram_quantile( + 0.95, + sum( + rate(haproxy_log_response_processing_milliseconds_bucket[20m]) + ) by ( + http_request_path, le, server_name + ) + ) + +[[!img response_processing_duration_by_server_and_request_path.png + caption="Each metric represents the 95 percentile request rate for a single request path, for a single server" +]] + +This last query is aggregating the culamative histograms exported for each set +of label values, allowing very flexible views on the response processing +duration. + +## Next steps + +At the moment, both exporters are running fine. The PGBouncer exporter is in +Debian, and I am planning to do the same with the HAProxy log exporter +(however, this will take a little longer as there are missing dependencies). + +The next things I am interested in exploring in Prometheus is its capability to +make metrics avaialble for automated alerting. diff --git a/posts/prometheus_in_early_2016/request_rate_by_request_path.png b/posts/prometheus_in_early_2016/request_rate_by_request_path.png new file mode 120000 index 0000000..199f196 --- /dev/null +++ b/posts/prometheus_in_early_2016/request_rate_by_request_path.png @@ -0,0 +1 @@ +../../.git/annex/objects/Qx/K8/SHA256E-s55153--f145907842a5a097b753eb8977673eff3fdda9bcbc0186560e1d892b9bc68546.png/SHA256E-s55153--f145907842a5a097b753eb8977673eff3fdda9bcbc0186560e1d892b9bc68546.png \ No newline at end of file diff --git a/posts/prometheus_in_early_2016/request_rate_by_status_code.png b/posts/prometheus_in_early_2016/request_rate_by_status_code.png new file mode 120000 index 0000000..4aacc64 --- /dev/null +++ b/posts/prometheus_in_early_2016/request_rate_by_status_code.png @@ -0,0 +1 @@ +../../.git/annex/objects/4F/p0/SHA256E-s54790--d24ba401f9c89a676b86911e559a90c78ad69984a53ad4270f7e4f86d9cab88d.png/SHA256E-s54790--d24ba401f9c89a676b86911e559a90c78ad69984a53ad4270f7e4f86d9cab88d.png \ No newline at end of file diff --git a/posts/prometheus_in_early_2016/response_processing_duration_by_server_and_request_path.png b/posts/prometheus_in_early_2016/response_processing_duration_by_server_and_request_path.png new file mode 120000 index 0000000..1f7dc46 --- /dev/null +++ b/posts/prometheus_in_early_2016/response_processing_duration_by_server_and_request_path.png @@ -0,0 +1 @@ +../../.git/annex/objects/58/70/SHA256E-s56196--20a9b1b55fffa6b5c7e5177341291519f23ab17aed71eb434b1d18604f974152.png/SHA256E-s56196--20a9b1b55fffa6b5c7e5177341291519f23ab17aed71eb434b1d18604f974152.png \ No newline at end of file -- cgit v1.2.3