Improve Environment Monitoring dashboard to support additional metrics

cc @bjk-gitlab and @pedroms. We will need to do some FE work to support #28717 (closed) which I hope will arrive as part of 9.2. We can cut this down depending on FE resourcing, but would like your input.

At a minimum, we need to continue the 1 chart per line without ability to re-arrange, but still display the additionally configured queries.

One other item we may need to think about, is where the title of the graph and units come from. We already have title as a configured parameter in #28717 (closed), so I think we can re-use that. However we will probably need a way to communicate units, perhaps an additional configuration option as part of that issue.

added Deliverable label

mentioned in issue #29212 (closed)

assigned to @pedroms

@bjk-gitlab Do you have perspective on how we should control the Legend name for each time series in a single chart? Does it make sense to simply ask for a label name to use for the time series name?

added frontend label

There are a few options.

Add a Grafana-style label input box and support things like {{label_name}}
Execute the query, and have the backend have an endpoint that returns a list of labels available.
Some combination of both.

@bjk-gitlab Generally speaking when in this scenario, only one label is actually changing in the set right? If so we should be able to autodetect that and simply select that as the legend names. Or am is that not typically true and we instead need to offer a method to select it.

added ~901060 label

changed title from Improve Environment Monitoring dashboard to support additional metrics to fr

changed title from fr to Improve Environment Monitoring dashboard to support additional metrics

If there is only zero or one labels left over from a query, we can automatically display it.

@joshlambert I suggest we focus and continue with the 1 chart per line without the ability to re-arrange. It's definitely something to strive for, but I think we should first nail the more basic chart functionality that you describe.

For units, we can add a list of options to choose for each metric. For labels, we can add an input for each query. Are multiple series defined in the same query, or does each series require its own query?

@joshlambert I would argue that our first priority is to decide on a smaller set of metrics we deem to be most valuable, and optimize display around that, rather than the arbitrary "larger quantity of metrics" goal. Going down the extensible path is great, but might lead us to sub-optimal experiences. If we consider Response time, Throughput (and errors), CPU, and memory, we might come up with a better experience.

I also have another concern, if we have rearrangeable graphs, some UX would change I think, such as all of the graphs moving at the same time, since this would be more close to a dashboard having the flag go through all the graphs at the same time would be kinda counterintuitive

@pedroms @markpundsack Thanks for all the feedback, I absolutely agree we want this to be magic. Ensuring it "just works" will be a major differentiator compared to a third party solution. (Reinforced here: https://gitlab.com/gitlab-org/gitlab-ce/issues/28717#note_27222712) We should automate what we can, use convention for what we can't, and fall back to configuration if no other options are available.

After our call today based on the feedback, I spent some time thinking about how to reduce the configuration requested. Please keep in mind that we want this to be usable by a healthy range of customers, and we can't predict what exporters they will or won't have available.

Autodetect metrics

As presently designed in #28717 (closed), we currently allow some customization of Library metrics:

To support alternative tagging methods other than environment=CI_ENVIRONMENT_SLUG. This is not automatic for anyone outside of Auto Deploy with provided Prometheus config.
Provide an option to customize the query, for example setting a target response time for calculating Apdex.

With some extra development effort however, I think we can improve #1 (closed):

Do not ask for any additional configuration in Prometheus service setup. Just Active? and URL.
Insert language saying we will detect metrics after next deploy. "Waiting for next deploy to learn metrics..." or something.
On next deploy, pull all metrics from Prometheus server.
- Compare metric names against Library.
- For matching metric names, search all tags for one that matches CI_ENVIRONMENT_SLUG. Use this tag for filtering on environment.
- Add all metrics that matched an entry in the Library, and had a label matching the CI_ENVIRONMENT_SLUG.
Repeat metric detection every deploy.

Unfortunately I don't know a good way to deal with #2 (closed), however if our Library is a simple YML file a server admin could always manually edit it to add something they wanted. Not great, but possible for now.

Library Enhancements

I think the core idea of the Library is a very really good one, and we were already planning to be very opinionated on which metrics we included from each exporter. Where we can go further I think, is adding a Type, Exporter, and Priority tag for each metric.

For example, if we identified a set of metrics:

For Postgres exporter, these would be marked with a Type of Database and Exporter of Postgres and a Priority of say 5.
We may then also detect HA Proxy metrics, which we would Load Balancer, HA Proxy, and a Priority of say 1.

Today this data would then be used to drive the chart. Looking out further however, this data can also be used to prioritize metrics for display on the MR workflow, etc. (Error rate increasing by 10% is way more critical than memory usage going up 15%)

Chart Improvements

Proceed with 2 per row, but no need to support re-arranging.

Order & Grouping

For charting, we would group all metrics from an exporter into a section with that title. The priority of which group would then determine order of placement on the page. For example since HA Proxy is closest to what the customer experiences, it gets top billing. (And include error rates!)

Multiple time series

We should still support multiple time series per chart. Again using HA Proxy's request data, we would want to show HTTP 200, 400, and 500 rates.

@joshlambert I think the approach you are suggesting is much more in line with our “convention over configuration” value, that @markpundsack brought up (). It’s really worth looking through this lens, especially as a means of differentiation. If people want they can setup Grafana or something similar to customize it all the way. But the value we are delivering is in working through the choices and simplifying the whole process for the users. So they don’t have to configure anything or just the bare minimum.

Autodetect metrics: Why do we have to wait for the next deploy to pull metrics from Prometheus? The type of metrics can vary from env. to env., correct?

Order & Grouping: Maybe I'm missing something, but can't the type and exporter order be inferred from the order on the library YML file? Why would we need the priority?

- Load Balancer: 
  - HA Proxy
- Database: 
  - Postgres

@pedroms I think we will need to eventually get to a more fully featured dashboard with custom query support. For example we have already gotten feedback that some customers who were interested (and on k8s) couldn't use it, because they wanted to customize the queries.

I do like this change though, because it makes getting started easier and with less clicks. Ideally if you follow our conventions, it shouldn't require any configuration or extra steps. We will also still broaden the potential user base for this (by supporting, which means we should hopefully start getting more feedback.

Autodetect metrics: This comes down to two items I think.

The most important, how much load overhead is in all of the auto detection routines? Is this practical to run on every metric, or do we need to cache this data. For each metric we want to show, it could be 3-4 queries for each refresh of the page. Some of these would be very broad.
What should we show on the Prometheus Service screen. I would think we want to show the metrics we have identified and the environment tag used. This is because if we are going to cache items, we need a way to clear it in the event someone changes their configuration. For example if someone alters the metric their app exports, some of the tags could change. If we are caching the data, our queries could return no data.

We essentially try to learn the metrics and tags they are using, so it would be a function to reset that learning. We could try to detect this state, perhaps by seeing queries return no data for a period of time (~10 minutes?), but this is even more effort.

@bjk-gitlab and @pchojnacki what do you think of the load required to learn the metrics, on both the Prometheus server and GitLab?

I was thinking of the following steps:

Retrieve all metric names
For each metric that matches one in Library {
  Retrieve all unique records of that plain metric, to get the full set of tags.
  Search all tags for one that matches `$CI_ENVIRONMENT_SLUG`
  If a tag matches: {
    Save the tag name
    Run real query w/ environment tag specified
    Confirm data comes back, and then save metric/tag combination.
  }
  Else
    Discard metric
}
Else
  Discard metric

Some notes after talking with @joshlambert about this issue:

Add two waiting/empty states:
- Waiting for new deploy (applies to the environment; a re-deploy is needed to autodetect metrics)
- No data (applies to charts; corresponds to the description item “Support displaying a "No data" chart state for a query which returned nothing.”)
Improve the “Waiting for performance data” state:
- List the auto-detected metrics
Add an always-visible environment slug: (e.g. 'staging' or '1234-this-is-an-issue-name' for review apps; see https://gitlab.com/gitlab-org/gitlab-ce/issues/26852 for other examples)

@joshlambert feel free to complement with anything else I might have missed. If not, let me know so I can update the description.

Thanks @pedroms. I thought a bit this some more and what do you think about the following?

On the Integration services page, we do the following:

When no deploy has occurred to any environment in the project yet, we simply state we are waiting for the next deploy.
After the first deployment, we run the autodetect logic.
We then display an expanded list, of metrics groups we recognize (e.g. HA Proxy, or Postgres) and were able to find a tag matching $CI_ENVIRONMENT_SLUG.
Last, we have a collapsed list of for any metric groups we found (we see metric names), but were unable to find a matching environment tag, we list them here. If no groups in this state, should not appear.

This would allow users to easily see what we found, provide insight into what is happening behind the scenes, and allow folks to troubleshoot without diving into logs. It would also hopefully encourage people to add the proper environment tag to metric groups that do not have it.

Then on the dashboard: We display each found metric, under the group titled ordered by appearance in the Library YML file. So this would be something like Ingress, then App tier, then DB tier as one example. We do not need to support re-arranging, but should support 2 charts per line to reduce space requirements when there are 6-10 metrics displayed.

@joshlambert yup, that's it Feel free to update the description with the SSOT

We should definitely use worker requests/cache the results of the metric/label queries. We do not want to hit the Prometheus server many times for each page load.

Loading the cached results set from gitlab API via redis cach should be fine.

@pedroms I updated the description. Could you attach a mockup? Note I removed the multiple series per chart for this release, to keep it simple.

We can add support for that later.

assigned to @joshlambert

Edited the description:

Added designs
Removed item “Support displaying a "No data" chart state for a query which returned nothing.” since we won't be needing that anymore

mentioned in commit gitlab-design@d23681f2

Thanks @pedroms! Moving to 9.3.

changed milestone to %9.3

removed assignee

mentioned in issue #27442 (closed)

assigned to @jivanvl

Question

Odd-numbered charts are always full-width

Does this include the first one? or the first one can be stacked?

@jivanvl Examples: if a group only has 1 chart, it would be full-width, because it's odd-numbered. If it has 2 or 4 charts, those will be put into two columns. If it has 3 charts, the first two will be put into columns, and the 3rd one will be full-width, because it's odd-numbered.

This image illustrates the behavior with 1, 2, or 3 charts. Does this answer your question?

Yeah, yeah it does, thanks @pedroms

mentioned in issue #32638 (moved)

mentioned in merge request !11740 (merged)

@joshlambert I have a question, for the new state that says Waiting for deployment… This should be displayed instead of the Waiting for performance data state? or what's the condition to trigger this, because currently the 3 states that were before work as intended

@jivanvl The intent of that screen was to inform users if there have been no deployments to a given environment. When no deploy has occurred to any environment in the project yet, we simply state we are waiting for the next deploy.

I think we can remove the "or redeploy" test and the "Redeploy" button, since we changed the autodetect logic to always run on every cache fill.

@joshlambert if we remove the “Re-deploy” button, what action can we surface so that the user can move forward? I don't want them to face a “dead end”

mentioned in issue #33556 (moved)

changed milestone to %9.4

Moving to 9.4.

@pedroms maybe the CTA could be "Deploy" and take you to the Pipelines list screen?

I think we can remove the "or redeploy" test and the "Redeploy" button, since we changed the autodetect logic to always run on every cache fill.

@joshlambert to clarify, if a deployment has occurred before the Prometheus integration is configured, will the user need to re-deploy? Or is a deployment only necessary if no deployments have ever occurred?

https://gitlab.com/gitlab-org/gitlab-ce/issues/28717 Got merged with the new additional dashboard, should we close this and open a new one with any new UX improvements that we want to include?

closed

reopened

@pedroms should be only on first deployment, since we run the detection scans on each cache fill.

closed

mentioned in issue #34332 (moved)

Created https://gitlab.com/gitlab-org/gitlab-ce/issues/34332 to implement the missing “Waiting for deployment…” state.

mentioned in issue #34653 (closed)

mentioned in merge request !12616 (merged)

mentioned in issue #35712 (moved)

mentioned in issue gitlab#10665

Improve Environment Monitoring dashboard to support additional metrics

Resources

Description

Support for additional queries

Designs

Documentation blurb

Designs

Child items ...

Activity

Autodetect metrics

Library Enhancements

Chart Improvements

Order & Grouping

Multiple time series

Admin message

Admin message

Improve Environment Monitoring dashboard to support additional metrics

Resources

Description

Support for additional queries

Designs

Documentation blurb

Relates to

Activity

Autodetect metrics

Library Enhancements

Chart Improvements

Order & Grouping

Multiple time series