With the introduction of support for metrics beyond just CPU and memory, we will also need to improve our environment dashboard. Right now it shows just two metrics, CPU and Memory.
We should prepare for a larger quantity of metrics, each grouped under it's service name, and ones that may return more than one series in a query.
Support for additional queries
To keep the page simple and easy to render, we should:
Group all of a service's metric charts together, under a simple title. The services should be ordered by the Priority metric, lowest being most important. If a priority is duplicated, alphabetic order should be used by service name.
Add a “Waiting for deployment…” state that re-uses the illustration from the currently implemented “Waiting for performance data” state.
All charts are stacked (not in columns) if the total amount of charts on the page is =< 3
If the total amount of charts on the page is > 3 display them in columns
cc @bjk-gitlab and @pedroms. We will need to do some FE work to support #28717 (closed) which I hope will arrive as part of 9.2. We can cut this down depending on FE resourcing, but would like your input.
At a minimum, we need to continue the 1 chart per line without ability to re-arrange, but still display the additionally configured queries.
One other item we may need to think about, is where the title of the graph and units come from. We already have title as a configured parameter in #28717 (closed), so I think we can re-use that. However we will probably need a way to communicate units, perhaps an additional configuration option as part of that issue.
@bjk-gitlab Do you have perspective on how we should control the Legend name for each time series in a single chart? Does it make sense to simply ask for a label name to use for the time series name?
@bjk-gitlab Generally speaking when in this scenario, only one label is actually changing in the set right? If so we should be able to autodetect that and simply select that as the legend names. Or am is that not typically true and we instead need to offer a method to select it.
@joshlambert I suggest we focus and continue with the 1 chart per line without the ability to re-arrange. It's definitely something to strive for, but I think we should first nail the more basic chart functionality that you describe.
For units, we can add a list of options to choose for each metric. For labels, we can add an input for each query. Are multiple series defined in the same query, or does each series require its own query?
@joshlambert I would argue that our first priority is to decide on a smaller set of metrics we deem to be most valuable, and optimize display around that, rather than the arbitrary "larger quantity of metrics" goal. Going down the extensible path is great, but might lead us to sub-optimal experiences. If we consider Response time, Throughput (and errors), CPU, and memory, we might come up with a better experience.
I also have another concern, if we have rearrangeable graphs, some UX would change I think, such as all of the graphs moving at the same time, since this would be more close to a dashboard having the flag go through all the graphs at the same time would be kinda counterintuitive
@pedroms@markpundsack Thanks for all the feedback, I absolutely agree we want this to be magic. Ensuring it "just works" will be a major differentiator compared to a third party solution. (Reinforced here: https://gitlab.com/gitlab-org/gitlab-ce/issues/28717#note_27222712) We should automate what we can, use convention for what we can't, and fall back to configuration if no other options are available.
After our call today based on the feedback, I spent some time thinking about how to reduce the configuration requested. Please keep in mind that we want this to be usable by a healthy range of customers, and we can't predict what exporters they will or won't have available.
Autodetect metrics
As presently designed in #28717 (closed), we currently allow some customization of Library metrics:
To support alternative tagging methods other than environment=CI_ENVIRONMENT_SLUG. This is not automatic for anyone outside of Auto Deploy with provided Prometheus config.
Provide an option to customize the query, for example setting a target response time for calculating Apdex.
With some extra development effort however, I think we can improve #1 (closed):
Do not ask for any additional configuration in Prometheus service setup. Just Active? and URL.
Insert language saying we will detect metrics after next deploy. "Waiting for next deploy to learn metrics..." or something.
On next deploy, pull all metrics from Prometheus server.
Compare metric names against Library.
For matching metric names, search all tags for one that matches CI_ENVIRONMENT_SLUG. Use this tag for filtering on environment.
Add all metrics that matched an entry in the Library, and had a label matching the CI_ENVIRONMENT_SLUG.
Repeat metric detection every deploy.
Unfortunately I don't know a good way to deal with #2 (closed), however if our Library is a simple YML file a server admin could always manually edit it to add something they wanted. Not great, but possible for now.
Library Enhancements
I think the core idea of the Library is a very really good one, and we were already planning to be very opinionated on which metrics we included from each exporter. Where we can go further I think, is adding a Type, Exporter, and Priority tag for each metric.
For example, if we identified a set of metrics:
For Postgres exporter, these would be marked with a Type of Database and Exporter of Postgres and a Priority of say 5.
We may then also detect HA Proxy metrics, which we would Load Balancer, HA Proxy, and a Priority of say 1.
Today this data would then be used to drive the chart. Looking out further however, this data can also be used to prioritize metrics for display on the MR workflow, etc. (Error rate increasing by 10% is way more critical than memory usage going up 15%)
Chart Improvements
Proceed with 2 per row, but no need to support re-arranging.
Order & Grouping
For charting, we would group all metrics from an exporter into a section with that title. The priority of which group would then determine order of placement on the page. For example since HA Proxy is closest to what the customer experiences, it gets top billing. (And include error rates!)
Multiple time series
We should still support multiple time series per chart. Again using HA Proxy's request data, we would want to show HTTP 200, 400, and 500 rates.
@joshlambert I think the approach you are suggesting is much more in line with our “convention over configuration” value, that @markpundsack brought up (). It’s really worth looking through this lens, especially as a means of differentiation. If people want they can setup Grafana or something similar to customize it all the way. But the value we are delivering is in working through the choices and simplifying the whole process for the users. So they don’t have to configure anything or just the bare minimum.
Autodetect metrics: Why do we have to wait for the next deploy to pull metrics from Prometheus? The type of metrics can vary from env. to env., correct?
Order & Grouping: Maybe I'm missing something, but can't the type and exporter order be inferred from the order on the library YML file? Why would we need the priority?
@pedroms I think we will need to eventually get to a more fully featured dashboard with custom query support. For example we have already gotten feedback that some customers who were interested (and on k8s) couldn't use it, because they wanted to customize the queries.
I do like this change though, because it makes getting started easier and with less clicks. Ideally if you follow our conventions, it shouldn't require any configuration or extra steps. We will also still broaden the potential user base for this (by supporting, which means we should hopefully start getting more feedback.
Autodetect metrics: This comes down to two items I think.
The most important, how much load overhead is in all of the auto detection routines? Is this practical to run on every metric, or do we need to cache this data. For each metric we want to show, it could be 3-4 queries for each refresh of the page. Some of these would be very broad.
What should we show on the Prometheus Service screen. I would think we want to show the metrics we have identified and the environment tag used. This is because if we are going to cache items, we need a way to clear it in the event someone changes their configuration. For example if someone alters the metric their app exports, some of the tags could change. If we are caching the data, our queries could return no data.
We essentially try to learn the metrics and tags they are using, so it would be a function to reset that learning. We could try to detect this state, perhaps by seeing queries return no data for a period of time (~10 minutes?), but this is even more effort.
@bjk-gitlab and @pchojnacki what do you think of the load required to learn the metrics, on both the Prometheus server and GitLab?
I was thinking of the following steps:
Retrieve all metric namesFor each metric that matches one in Library { Retrieve all unique records of that plain metric, to get the full set of tags. Search all tags for one that matches `$CI_ENVIRONMENT_SLUG` If a tag matches: { Save the tag name Run real query w/ environment tag specified Confirm data comes back, and then save metric/tag combination. } Else Discard metric}Else Discard metric
Thanks @pedroms. I thought a bit this some more and what do you think about the following?
On the Integration services page, we do the following:
When no deploy has occurred to any environment in the project yet, we simply state we are waiting for the next deploy.
After the first deployment, we run the autodetect logic.
We then display an expanded list, of metrics groups we recognize (e.g. HA Proxy, or Postgres) and were able to find a tag matching $CI_ENVIRONMENT_SLUG.
Last, we have a collapsed list of for any metric groups we found (we see metric names), but were unable to find a matching environment tag, we list them here. If no groups in this state, should not appear.
This would allow users to easily see what we found, provide insight into what is happening behind the scenes, and allow folks to troubleshoot without diving into logs. It would also hopefully encourage people to add the proper environment tag to metric groups that do not have it.
Then on the dashboard:
We display each found metric, under the group titled ordered by appearance in the Library YML file. So this would be something like Ingress, then App tier, then DB tier as one example. We do not need to support re-arranging, but should support 2 charts per line to reduce space requirements when there are 6-10 metrics displayed.
We should definitely use worker requests/cache the results of the metric/label queries. We do not want to hit the Prometheus server many times for each page load.
Loading the cached results set from gitlab API via redis cach should be fine.
@jivanvl Examples: if a group only has 1 chart, it would be full-width, because it's odd-numbered. If it has 2 or 4 charts, those will be put into two columns. If it has 3 charts, the first two will be put into columns, and the 3rd one will be full-width, because it's odd-numbered.
This image illustrates the behavior with 1, 2, or 3 charts. Does this answer your question?
@joshlambert I have a question, for the new state that says Waiting for deployment… This should be displayed instead of the Waiting for performance data state? or what's the condition to trigger this, because currently the 3 states that were before work as intended
@jivanvl The intent of that screen was to inform users if there have been no deployments to a given environment. When no deploy has occurred to any environment in the project yet, we simply state we are waiting for the next deploy.
I think we can remove the "or redeploy" test and the "Redeploy" button, since we changed the autodetect logic to always run on every cache fill.
I think we can remove the "or redeploy" test and the "Redeploy" button, since we changed the autodetect logic to always run on every cache fill.
@joshlambert to clarify, if a deployment has occurred before the Prometheus integration is configured, will the user need to re-deploy? Or is a deployment only necessary if no deployments have ever occurred?