From 6a977bee052bb95e93bf973117e30230156d83ed Mon Sep 17 00:00:00 2001
From: Christopher Baines <mail@cbaines.net>
Date: Sun, 20 Mar 2016 18:27:52 +0000
Subject: Add new post "Prometheus in early 2016"

---
 posts/prometheus_in_early_2016.mdwn                | 104 +++++++++++++++++++++
 .../request_rate_by_request_path.png               |   1 +
 .../request_rate_by_status_code.png                |   1 +
 ...cessing_duration_by_server_and_request_path.png |   1 +
 4 files changed, 107 insertions(+)
 create mode 100644 posts/prometheus_in_early_2016.mdwn
 create mode 120000 posts/prometheus_in_early_2016/request_rate_by_request_path.png
 create mode 120000 posts/prometheus_in_early_2016/request_rate_by_status_code.png
 create mode 120000 posts/prometheus_in_early_2016/response_processing_duration_by_server_and_request_path.png

diff --git a/posts/prometheus_in_early_2016.mdwn b/posts/prometheus_in_early_2016.mdwn
new file mode 100644
index 0000000..36aa184
--- /dev/null
+++ b/posts/prometheus_in_early_2016.mdwn
@@ -0,0 +1,104 @@
+[[!meta title="Prometheus in early 2016"]]
+
+[[!taglink Prometheus]] ([[website|http://prometheus.io/]]) has been used on
+and off by [[!taglink Thread]] since May 2015 (before I [[joined in
+June|posts/first_week_at_thread]]). Its a [[!taglink free_software]] time
+series database which is very useful for monitoring systems and services.
+
+There was a brief gap in use, but then I set it up again in October, driven by
+the need to monitor machine resources, and monitor the length of the queues
+(Thread use [[django lightweight
+queue|https://github.com/thread/django-lightweight-queue]]). This proved very
+useful, as suddenly when a problem arose, you could look at the queues and
+machine stats which helped greatly in determining the correct course of action.
+
+When using Prometheus, the server scrapes metrics from services that expose
+them (e.g. the prometheus-node-exporter). This is a common pattern, and I had
+already thrown together a exporter for django lightweight queue (that just
+simply got the data out of redis), so as new and interesting problems occured,
+I began looking at how Prometheus could be used to provide better visiblity.
+
+## PGBouncer
+
+The first issue that I addressed was a fun issue with exausting the pgbouncer
+connection pool. The first time this happened, it was only noticed as emails
+were being sent really slowly, and eventually it was determined that this was
+due to workers waiting for a database connection. [[!taglink PGBouncer]] does
+expose metrics, but having them in Prometheus makes them accessible, so I wrote
+a [[Prometheus exporter for
+PGBouncer|http://git.cbaines.net/prometheus-pgbouncer-exporter/about/]].
+
+The data from this is displayed on relevant dashboards in Grafana, and has
+helped more than once to quickly solve issues. Recently, the
+prometheus-pgbouncer-exporter was [[accepted in to
+Debian|https://tracker.debian.org/news/753455]], which will hopefully make it
+easy for others to install and use.
+
+## HAProxy logs
+
+With the success of the PGBouncer exporter, I recently started working on
+another exporter, this time for [[!taglink HAProxy]]. Now, there is already a
+HAProxy exporter, but the metrics which I wanted (per HTTP request path rates,
+per HTTP request path response duration histograms, ...) are not something that
+it offers (as it just exposes metrics on the status page). These are something
+that you can get from the HAProxy logs, and there are [[existing libraries to
+parse this|https://github.com/gforcada/haproxy_log_analysis]], which made it
+easier to put together an exporter.
+
+It was using the data from the HAProxy log exporter that I began to get a
+better grasp on the power of aggregating metrics. The HAProxy log exporter,
+exports a metric haproxy_log_requests_total and this can have a number of
+labels (status_code, frontend_name, backend_name, server_name,
+http_request_path, http_request_method, client_ip, client_port). Say you enable
+the status_code, server_name and http_request_method labels, then, if you want
+to get a rate of requests per status code (e.g. to check the rate of HTTP 500
+responses), you just run:
+
+    sum(rate(haproxy_log_requests_total[1m])) by (status_code)
+
+[[!img request_rate_by_status_code.png
+  caption="Per second request rates, split by status code (key shown at the bottom)"
+]]
+
+Perhaps you want to compare the performance of two servers for the different
+request paths, you would run:
+
+    sum(
+      rate(haproxy_log_requests_total[1m])
+    ) by (
+      http_request_path, server_name
+    )
+
+[[!img request_rate_by_request_path.png
+  caption="Each metric represents the request rate for a single request path (e.g. /foo) for a single server"
+]]
+
+And everything you can do with a simple counter, you can also do with
+histograms for response duration. So say you want to know how a particular
+request path is being handled over a set of servers, you can run:
+
+    histogram_quantile(
+      0.95,
+      sum(
+        rate(haproxy_log_response_processing_milliseconds_bucket[20m])
+      ) by (
+        http_request_path, le, server_name
+      )
+    )
+
+[[!img response_processing_duration_by_server_and_request_path.png
+  caption="Each metric represents the 95 percentile request rate for a single request path, for a single server"
+]]
+
+This last query is aggregating the culamative histograms exported for each set
+of label values, allowing very flexible views on the response processing
+duration.
+
+## Next steps
+
+At the moment, both exporters are running fine. The PGBouncer exporter is in
+Debian, and I am planning to do the same with the HAProxy log exporter
+(however, this will take a little longer as there are missing dependencies).
+
+The next things I am interested in exploring in Prometheus is its capability to
+make metrics avaialble for automated alerting.
diff --git a/posts/prometheus_in_early_2016/request_rate_by_request_path.png b/posts/prometheus_in_early_2016/request_rate_by_request_path.png
new file mode 120000
index 0000000..199f196
--- /dev/null
+++ b/posts/prometheus_in_early_2016/request_rate_by_request_path.png
@@ -0,0 +1 @@
+../../.git/annex/objects/Qx/K8/SHA256E-s55153--f145907842a5a097b753eb8977673eff3fdda9bcbc0186560e1d892b9bc68546.png/SHA256E-s55153--f145907842a5a097b753eb8977673eff3fdda9bcbc0186560e1d892b9bc68546.png
\ No newline at end of file
diff --git a/posts/prometheus_in_early_2016/request_rate_by_status_code.png b/posts/prometheus_in_early_2016/request_rate_by_status_code.png
new file mode 120000
index 0000000..4aacc64
--- /dev/null
+++ b/posts/prometheus_in_early_2016/request_rate_by_status_code.png
@@ -0,0 +1 @@
+../../.git/annex/objects/4F/p0/SHA256E-s54790--d24ba401f9c89a676b86911e559a90c78ad69984a53ad4270f7e4f86d9cab88d.png/SHA256E-s54790--d24ba401f9c89a676b86911e559a90c78ad69984a53ad4270f7e4f86d9cab88d.png
\ No newline at end of file
diff --git a/posts/prometheus_in_early_2016/response_processing_duration_by_server_and_request_path.png b/posts/prometheus_in_early_2016/response_processing_duration_by_server_and_request_path.png
new file mode 120000
index 0000000..1f7dc46
--- /dev/null
+++ b/posts/prometheus_in_early_2016/response_processing_duration_by_server_and_request_path.png
@@ -0,0 +1 @@
+../../.git/annex/objects/58/70/SHA256E-s56196--20a9b1b55fffa6b5c7e5177341291519f23ab17aed71eb434b1d18604f974152.png/SHA256E-s56196--20a9b1b55fffa6b5c7e5177341291519f23ab17aed71eb434b1d18604f974152.png
\ No newline at end of file
-- 
cgit v1.2.3