As part of GitLab 9.0, we are shipping support for two metrics as our MVP. CPU and Memory utilization, pulled from Kubernetes. While these two metrics are critical pieces of information to have, there are a large number of other metrics available which customers will want to keep an eye on. In order to support a broader set of metrics, like request and error rates, we need to expand our support for additional metrics.
There are two main categories of metrics:
Common metrics from well known Exporters, as defined on the Prometheus Exporter page.
Customer specific metrics which a customer may have added themselves to our app, much like we have done with gitlab-monitor.
For this issue, we will focus on the common metrics that are included in the list of well-known Prometheus Exporters.
Common Metrics
In most cases, customers will be using a metric from a well known Exporter. These are by far the most common, and likely to be used. In order to make this as easy as possible, we should offer a "Metric Library" which contains a preset list of queries for well defined metric names. When you first configure the Prometheus server, we can then attempt to perform auto detection of the metrics which are being monitored, based on this library.
Library Format
The library of metrics should be specified in a YAML file. The YAML file should be a list of Services (like "HA Proxy", Apache", etc.) which each have a collection of metrics.
Service Name: Name of the service, for example "HA Proxy" or "Apache"
Priority: Relative priority to show on the page. (Higher has more priority and should show first)
Array of Metrics
Each metric then has its own properties:
Metric: Base metric name, for auto detection purposes
Metric Name: English language explanation of metric. For example, "CPU Utilization" or "Error Rate".
Array of Queries
Query: Prometheus query to be used, with variables included.
Query Name: Name of the query, for example "Average CPU Utilization".
Query Units: Unit type returned by the query. For example "MB" or "Requests/sec".
Weight: Float integer from 0-1 which indicates relative importance metric. Used when attempting to decide which metric to show where we have to choose a small set, like the Merge Request flow.
For now, we will limit the number of queries per metric to one, but we should plan for more within the same format for future use.
Variables
In some cases, the Prometheus server may be picking up more than a single environment. A good example of this is the Kubernetes exporter, which will report data on the entire cluster. In cases like these, we need to provide for a way to distinguish one environment from another. To do this, we need to support substitution of the CI flag CI_ENVIRONMENT_SLUG. This is the identifier used across GitLab to identify an environment, and we should continue to use that here.
Autodetection Process
We will run auto detection on every cache fill. (Currently 30s)
Detection method:
Retrieve the list of scraped metric names from the Prometheus server.
For each metric on the Prometheus server that matches one in the Library
Perform a simple query on that metric, to return all entries of that metric. Use the maximum supported time scale (currently 8h) to filter out old entries.
Search all entries for ones that have a label matching CI_ENVIRONMENT_SLUG
Save to the cache any matches.
Once a metric has been successfully detected, it should be added to the monitoring list for this environment and scraped.
In the event a cached metric is not returning data, we should attempt another 9 times (for a total of 10, which is 5 minutes) before purging it from the cache.
API
As part of this, we should ensure we continue to allow full configuration via the API as well. So that would include being able to add new queries, etc.
The “More information” links to the documentation section about metrics auto-detection logic
The “Missing environment variable” panel is collapsed by default and only shown if there are exporters with missing environment variable
Documentation blurb
As part of GitLab 9.0 we launched application performance management integrated with CI/CD deployments, monitoring deployed applications on Kubernetes by tracking CPU and Memory utilization. This was a great first step, and with GitLab 9.3 we are excited to launch significantly expanded support for other metrics and deployment types.
Now with 9.3, GitLab will automatically detect common system services like web servers and databases, tracking key metrics like throughput and load. With support for such a wider set of metrics, performance monitoring is now available for all deployments.
@tauriedavis, would you be able to consider this from a UX perspective? Essentially we would like to provide the option to expand our metrics, and allow administrators to enter their own if they would like.
Ideally we would also offer a button of sorts to scan the configured Prometheus server for known metrics.
I am hoping we can do this with 9.1, so would be great if you could take a look soon!
For the metric library dropdown, rather than making the user refresh/scan what is available would it be possible to do this in the background, update the list automatically, and show a tag that says # new options available? As they scroll the dropdown, the list would highlight the new options and then tag would be removed for users who viewed the new options.
Can you provide an example of what a user may enter for the customer specific metrics? I'm a little confused on this part of the proposal. If a user adds custom info that we don't know how to represent, how will be know what to display from a UI perspective?
Thanks @tauriedavis for taking a look. We can certainly capture all the ones we recognize, but then maybe we offer an option on whether or not it should be shown on the Environments tab. My primary concern is there could be quite a few metrics we recognize for someone who has broadly adopted Prometheus (like GitLab, probably upwards of 50), and we don't yet have good methods to deal with a high volume of charts on that page.
An example of a customer specific metric would be something like the gitlab_monitor exporter which we built. We built our own logging to report on metrics that are relevant specifically to our app, like Git SSH timings. There wouldn't be a way for us to know these ahead of time, so we'd have to allow an admin to type them in.
They could provide the full PromQL query like this one for Memory, since they would know the best way to represent it. (Gauge, Rate, etc.)
I meant that we automatically check for new metrics but they aren't automatically added.
Here you can search for an identified metric or add your own. This design is similar to our new search. It also shows what new metrics would look like, and they would appear at the top. If you begin typing, it should filter the dropdown. You could also add your own query and add that. I imagine we will know when a query is incorrect so we can throw an error?
We will know if they are attempting to enter an invalid metric name, but the rest of the query could still contain errors which would be hard to detect. (A specific environment may not yet be up and running.)
Unfortunately, the metric names themselves are not always obvious or self-evident. You'd have to go consult the Prometheus server or read the explorer documentation, to know which ones are available to you. Because of that, I think we should instead detect the metric names and then offer a more "English" version of what the metric is. For example, instead of node_filesystem_free we could simply say "Free Disk Space (GB)".
Once selected, we would then show the complete query so they can see exactly how it is set up and change if necessary.
@tauriedavis thanks for looking into this and providing a working basis, it was very helpful
I've mostly refined @tauriedavis's proposal after a call with @joshlambert. The help text under the “Active” and “API URL” fields only applies if using Omnibus Prometheus. However, I think the help text for “API URL” is useful, so I tweaked it a bit and propose that it's shown conditionally if using Omnibus Prometheus. If not, it's hidden.
Here are the designs for the various states:
Empty
Filled
Add metric: open
Add metric: filter
Add metric: filled
Edit metric
This layout changes the common service page layout, so I'm not sure if it will be possible to implement this on the first iteration as presented.
This looks super awesome @pedroms Thanks for picking this up!
We currently use a solid green button for adding items to a list throughout the site (members, approvers, etc.). Same with the save button. I see why you made them an outline though. I wonder if we should make a separate issue to look into these situations throughout GitLab to ensure our use of primary buttons are consistent. What are your thoughts?
This is the same for the delete button.
We also always place cancel to the right, although we have an issue to change that. https://gitlab.com/gitlab-org/gitlab-ce/issues/26248 I think they should all be changed at once, though, because switching them on different pages could be really confusing to muscle memory.
@tauriedavis I'm ok with having the “Save” and “Add metric” as solid buttons, instead of outline buttons, for this issue. I agree that there is a lack of rules on when and how to use different button styles. I've created https://gitlab.com/gitlab-org/gitlab-ce/issues/29641 to address that.
However, I have a strong opinion about the “Delete” button. IMO, destructive actions should always use an outlined button, unless they are the primary action on the screen (e.g. delete confirmation modal). Even so, I'm ok with postponing this change to do it all in one go. Do you think that's better?
For the time being, I'll swap the “Save” and “Cancel” buttons. Thank you for linking to that issue, I'll take a look.
@pedroms Thanks for working on this - I think your proposal looks great!
One item I added to the description, that we should also think about is if someone enters a bad query. If the query has syntax errors, it will actually generate an error and in that case we should report the issue and refuse to save it. (Since it is a syntax error, it will never work.)
@joshlambert yup, that makes sense. I hadn't designed the error state yet, will do so tomorrow or Monday.
@jivanvl do we have any idea if any of the lists (dropdown and monitored metrics) will require async loading in any situation (upon load, upon save, etc.)?
@pedroms For the save action depends on the controller/service that takes care of the prometheus integration, it could refresh the page and show the saved prometheus queries, for the dropdown I can see it definitely having an async loading mechanism.
As for verifying if the queries are correct or not. I'd like to say we could test them asynchronously for syntax errors but I can't say for sure. I'd say we'd need some more backend expertise here.
My concern right now is how are we going to get those prometheus queries. Are we going to store them in our database?
@jivanvl ok, thanks for your feedback. I'll refrain from designing any of the loading and error states until we are certain of what we need so there isn't lost effort on this.
@joshlambert IIRC during one of the prometheus weekly's we wanted to check the possibility of having a database table to store the queries, also we wanted to discuss if backend-wise we were going to support things like having an endpoint to obtain the queries asynchronously.
@jivanvl You mean the "library" of queries to compare against, or the list of queries that have been configured?
I was thinking for the "library" a YML file or similar may be best, which could then allow the system administrator to customize as needed. A table could certainly work as well.
For the list of metrics that have already been configured, yes I think we will absolutely need a table for this.
I think a "library" could be realized as a good documentation with lots of examples. But once metrics are added to project they should be kept in a DB table.
I just thought it would be great to be able to set the metrics via functionality similar to .gitlab-ci.yml.
So you could deploy new metrics along with the code that will expose them.
So this YML file would be included within GitLab or how should we add this? I'm thinking if we have that library we should load the examples via a dropdown or something on the frontend so people can choose from those examples
Taurie Davisadded ~974571 and removed ~19492 labels
@jivanvl@pchojnacki, I think a lot of the value with Prometheus and GitLab will be in how "magical" the experience feels. We need this to "just work" without much in the way of configuration, by autodetecting what we can and using conventions for what we can't. If someone needs to manually configure an edge case thats fine, but for items we can "know" we should detect and automate.
So my thought would be that we package (either file or DB) this library, and then can use that to automatically detect and identify known metrics. This way when someone enables a Node Exporter which has an environment tag, it is a pretty delightful experience.
@joshlambert I like the magical part about autodetecting common metrics. However I think it would be implementation detail where those defaults will be stored. For simplicity I think that could even be in the code, depending how many those will be and how autodetection will work etc.
@jivanvl My Idea about YML file configuring metrics would be to be able to check in .gitlab-metrics.yml to repository that contains code to be deployed. And that YML file would configure what metric queries and where.
So for example, if I develop web application and I'll add metric that exposes count of Fatal errors. When I'll deploy I'll have to go to UI and modify displayed metrics manually. However what I think could be useful is to be able to reconfigure Gitlab metrics monitoring along with a MR by using this .gitlab-metrics.yml file.
A normal rails yaml config file should be enough to support various auto-detection schemes. I don't think it's necessary to have this be in the database.
The things we want to store are very common, like http_request_duration_seconds, or similar. They can be divided into categories like gauge, counter, histogram, etc.
Thanks @pedroms. Yes, we can delay the custom metric support to a future release and focus on automating the Prometheus configuration as much as we can.
I spent some additional time thinking about how to best convey what we detected and will be monitoring, as well as what isn't working.
If no deploys have occurred, we simply have text or indicator that we are waiting for the first deploy.
After the first deploy, we run our detection logic.
In a table or list titled something like "Configured Metrics" we list each group of metrics we found. (E.g. "HA Proxy" or "Postgres"). This would be the list of things that we are now monitoring, and are working.
Below that, a list which defaults to collapsed. This would be a list of metrics we found but could not find an tag that matched $CI_ENVIRONMENT_SLUG. Basically these are metrics that are in Prometheus, but aren't working.
I think this would provide a healthy amount of insight into what "magic" we are performing, and some helpful troubleshooting information in the event something isn't working. (For both our Support and the customer.)
@joshlambert yup, that's it Feel free to update the description with the SSOT. Thanks for the sketch, that's exactly what I had in mind as well (”boring solutions” )
Ops Don't know what happened with accidental close and unlabel of this issue.
I've read through the issue and the only thing I'd add besides what @bjk-gitlab suggested. Is the support for more than one metric when doing the autodetection, specifically to support queries like:
@bjk-gitlab@pchojnacki what aspects of the detection are you concerned with? Is it the process of detecting, or how often it is triggered?
@bjk-gitlab your comments seem to stem more from how often it is triggered, but the reason I have it triggered often is because it can take time for a brand new environment to fully get up and running, and then for Prometheus to scrape the exporters. Think of a Review app for example. That is the main reason for the increasing delays, to try to not overload the system but also to better handle environments where it can take time to get all metrics scraped.
@joshlambert I think it will be great to just simplify it by reusing the same mechanism we use for current metrics. I.e. When you fetch the metrics the code checks if they are available in cache, if not a background job gets triggered to fill the cache. While 'check later' response is returned to the client.
When filling the cache the background job can will do the autodetection.
Compared to active cache filling:
automatically adjusts to access patterns
arguably reduces complexity
The down side is - first page load may take a little bit longer, as the page needs to periodically poll for the first batch.
@pchojnacki if we can handle the extra load of doing the metric detection on each cache fill that is even better. Right now it is invalidated every 30s, I believe.
I wasn't trying to suggest actively fill it, but rather to throttle how often it can run to reduce impact. But if it's light weight, doing the simple approach and running it every time is even better from both customer UX and simplicity.
@joshlambert No, the problem is this introduces a new query scheduling feature. I don't believe we have such a feature already, and I don't think we need the extra complexity.
By using the standard caching TTL semantics like @pchojnacki is saying, we avoid needing to create a new scheduler.
@bjk-gitlab I'm all for keeping it simple, but we need to ensure we meet two requirements:
The charts render in a reasonable time during page load. Right now it can take a 2-3 seconds to appear, this is really the maximum that is reasonable before users start to think something is broken.
Cache lifetime of 30s should remain as the maximum. Ideally we could get this to be down to the scan interval of the Prometheus server which is probably something like 15s.
If we can meet the above requirements, then I'm fine with keeping the caching strategy as is and doing the whole auto-detect process every cache load (30s).
As per our call, we will go with the current 30s TTL on the cache and do auto detection on each fill. Confident that this will not be a performance problem.
@joshlambert It sounds like this issue is about a few well-known metrics rather than custom metrics, but the doc blurb implies otherwise:
Developers can also instrument their own code for Prometheus and add them to GitLab, so it captures metrics specific to their application. With additional support for any metric, performance monitoring is now available for all deployments.
As mentioned before, we should consider custom metrics to be an EEP feature, not CE.
@pchojnacki Quick question, the API endpoint you're creating will return object literal per group with their respective data and all of that good jazz, right?
@jivanvl Yes, also remember this endpoint is for FE only so whatever suggestion you have I'll happily incorporate. So let me know if I miss some important data, or when something could make life easier on FE side.
Oh and this endpoint is also 'reactive' like other metrics endpoint i.e. you'll get 204 on first call and you'll have to call again to fetch the data with exponential backoff if you'll keep getting 204 responses.
@pchojnacki@joshlambert There's something that came to my attention when we support multiple prometheus metrics, we have a queries array, that means in a single object literal, we can get multiple graphs as well?
Also there's something that I'm not sure if is in the current scope for the backend (apologies if it is and I didn't probably see it). If there's not unit available, should I display a N/A when there's no units available? Or should we make the user provide the units and not show the graph if they do not (I honestly think this is rather extreme but you know I wanted to cover all of the bases).
Finally I noticed that in the snippet there are a couple of query results that look like this:
This only has one value, should I create a graph with this or should we have an empty state, or maybe we could just create a graph with a single plot point? I could use a hand determining what's the best way to display data that only has one element to plot.
That's among the things that are on top on my head. Let me know if there's anything that needs clarification and Thanks
It's good the backend has support for multiple queries, but lets focus on shipping support for only one right now. If we get this merged and we have time left, we can start working on multiple queries per chart. =)
If no units are present, N/A seems reasonable. None of our default metrics will be this way, but in the event a user edits the file and fails to add a unit, this seems like the friendliest outcome.
I would imagine this is for a gauge type metric? We won't have any of these in the initial release, it should all be line charts. We can add this later.
@joshlambert Its more instant vs range type query. I've followed example of what we have already for other metrics so I included support for Instant queries to see if some code unification could be done on the backend side.
I think if unit is not present then we shouldn't display anything.
As for the multiple queries, yeah right now only one can be rendered, and in future we can add more.
@pchojnacki thanks. So that's great we have backend support for gauges and the like, and we can add that to FE once we have time. @jivanvl lets focus on delivering line charts with one line for now.
Just one last thing, is there a possibility to get the type of metric validated somehow, either via the API prior the file being sent, or something or the sort, what do you think?
@jivanvl him, AFAIK it's not possible. @bjk-gitlab is there a way to run a query to determine the metric type? (counter, gauge, histogram, summary, etc.)
@joshlambert No, Prometheus doesn't currently store the metric type as exposed by the TYPE annotation in the tsdb. It is a planned feature to support this metadata, but there is some debate as to the semantics of doing so, so development has stalled.
@joshlambert As I can see in attached screenshots, we have Metrics are automatically configured and monitored based on a library of metrics from popular exporters. as description of Metrics, along with a link More information, where should that link point to?
@joshlambert@pchojnacki I have a quick suggestion, I know we ask for the units, but would it be possible to ask for the text that we want to display in the y-axis, this way we could defer to the user what they want displayed as a label in that specific axis and display a Values as a default in case no label is present.
That is an excellent point, we should include this. @pchojnacki can we add a "y-axis" item to the YML and JSON formats? It would be at the chart level, so along with title, priority, etc.
@joshlambert I'm waiting for review of backend code. Also I'm not sure what's the status of @jivanvl additional metrics dashboard. The prometheus preferences page is ready.
Thanks @pchojnacki. I believe @jivanvl is going back and forth on reviews. Do you need some help getting someone to review the MR? Just concerned we are approaching freeze.