Thanks @bjk-gitlab!
Looking at the code GC statistics should be straightforward to implement.
However the 'transaction' related metrics seem to be the most complex.
I also think that cardinality of some (if not all) of transaction metrics might be quite high especially:
@yorickpeterse We will be able to drop the prefix for Prometheus. We will have the job label to separate them.
We already have the http histogram from the rack middleware. I would suggest we modify that to include separate controller and action labels instead of the single action label we get out of influxdb. While the cardinality is high for controller and action. There is a large value in having response times separated out.
Based on the survey of access to staging so far, there are about 250 Controller#Action combinations. If we had 10 histgram buckets + Inf + _sum + _count we get about 3200 series. For a typical GitLab install with one unicorn server, this is not too bad. Even for gitlab production where we have 40 unicorns, that's 130k series. This is quite a lot, but the value for monitoring is important.
We already have the http histogram from the rack middleware. I would suggest we modify that to include separate controller and action labels instead of the single action label we get out of influxdb. While the cardinality is high for controller and action. There is a large value in having response times separated out.
This would require changing a ton of things in Grafana, so I'm not a fan of this. Further, this doesn't play nice with Grape which doesn't really have a concept of controllers vs actions. I'd rather just keep action with values such as UsersController#show instead of controller = UsersController, action = show.
We can split the controller and action labels for Prometheus, but leave the internal representation as-is for Grape.
We already have to completely re-do Grafana as the InfluxDB query language (SQL-ish) is completely different from Prometheus PromQL.
There are a number of advantages to splitting up controller and action labels in Prometheus.
It's easier to do selections like metric{controller="Foo"} than have to resort to regexps like metric{controller=~"Foo#.*"}.
The label indexing in Prometheus allows for faster selection when using string literals like above than regexps that have to do full index scans.
We can still use Grafana formatting to produce the same "view" with things like {{controller}}#{{action}}.
Grafana will template each label individually, so you can select a Controller, and see all actions. Or select a Controller and a specific set of Action(s).
Grafana's templating is smart enough to give you only the Actions related to a specific Controller.
Current sample output with all the actions split into controller & action labels. I've also split call signatures for autoinstrumented methods.
https://gitlab.com/snippets/1674154
Performance of the serialization requests have approached 1s for 95th percentile. I'm worried longer running instance with more workers than 2 will take couple times more to process all metrics.
Requests per second: 1.38 [#/sec] (mean)Time per request: 726.380 [ms] (mean)Time per request: 726.380 [ms] (mean, across all concurrent requests)Transfer rate: 433.66 [Kbytes/sec] receivedConnection Times (ms) min mean[+/-sd] median maxConnect: 0 0 0.1 0 0Processing: 480 726 669.0 590 4758Waiting: 479 670 338.5 589 2567Total: 480 726 669.1 590 4759Percentage of the requests served within a certain time (ms) 50% 590 66% 608 75% 681 80% 744 90% 942 95% 1140 98% 4759 99% 4759 100% 4759 (longest request)