Currently when a Merge Request is deployed, it is not easy to view the impact to performance of the changes in a simple way. We can improve this by having a sparkline on the Merge Request page illustrate the impact.
Proposal / Design
We can offer a simple and concise summary of the impact of a Merge Request on performance, by display a sparkline on the workflow itself. The best single metric for now would be Memory, which would be applicable for more deployment environments than CPU. We should display 30 minutes before and after the merged code was deployed.
We can use the mockup created for the Prometheus monitoring demo as a good starting point.
@tauriedavis What would you recommend for layout if we are unable to compare before/after to get actual delta in a metric? For this release, we may only be able to display the sparkline without the associated analysis.
What would you recommend for layout if we are unable to compare before/after to get actual delta in a metric? For this release, we may only be able to display the sparkline without the associated analysis.
I think we need to label the sparkline in some way. If we don't have the delta information, can we change the text from Memory increased from [#] to [#] to Memory usage over the last [time period] instead?
@bjk-gitlab what do you think about +/- 10 minutes from the deployment time? What are your thoughts around the data, especially depending on the retention length on the Prometheus server?
Ideally we'd store the time series data we are interested in, but I imagine that will be more effort. That could be a challenge given the shortened window we have in 8.17.
Let's start with +-30min. The default graphs many people look at are a 1 hour timeframe. Depending on the code, deployment size, and warm-up for the app, it could take 5-10min just to deploy.
When displaying the chart, when there has been less than 30 minutes since deployment... We should show the chart left aligned with the 30m prior to deployment. Then after that the sparkline should build to the right, until we hit the 30m time window after deployment.
We should make sure we gracefully fail if for some reason there isn't any Prometheus data available for this environment. This could happen if the Prometheus server went down, has no data, or it expired.
I have opened #27545 (moved) to locally persist this data after 30m, as people will load the MR up later and expect it to be there, perhaps a significant time later. We can do this later, though.
When displaying the chart, when there has been less than 30 minutes since deployment... We should show the chart left aligned with the 30m prior to deployment. Then after that the sparkline should build to the right, until we hit the 30m time window after deployment.
I'm confused by what you mean by left aligned with the 30m prior to deployment. Can you elaborate @joshlambert ?
@tauriedavis, sorry for not being clear. Typically with charts illustrating performance, "current time" is at the far (right) end of the X axis, and data then starts to fill in as the timescale on the X axis proceeds forward. And indeed, this is how the behavior of our Environment chart should work.
However with our sparkline here, we should anchor the timestamp of the MR in the center. And as time continues on (if it has been less than 30m since merging) the sparkline should then build out to the right until it hits 30 minutes.
Essentially it should start something like this "---| " and then proceed "---| ", "---|-- " etc. Hopefully this is more clear.
I think we should pull metrics from environment which is running the target branch.
If we pull them from the environment running the source branch, it really isn't the MR that is the event but each and every commit that makes up the MR.
cc: @dzaporozhets@fatihacet@selfup (as this is marked as a Deliverable and can perhaps be related to existing work already done, although i don't see big hurdles or anything)
@dimitrieh Updated designs would be fantastic. I will move this to Stretch, based on the fact that our backend resource is still working on 8.17 regressions/issue and has #26910 (closed) up first.
Would be great to get the UX nailed down though just in case we have time.
@joshlambert I've uploaded my current changes to !9950 (closed) MR. The only thing its missing right now are some rudimentary unit tests for new query functionality. This MR introduces endpoint that will return metrics only concerning 1hr timeframe around Deployment.
The only difference now is I've added additional field called cpu_previous that along with cpu_current can be used to display total change between deployments. i.e. 130MB -> 143MB
@fatihacet It's kind of cheap to use the canvas you add to the screen through the template but it works. I can't think of another way unless I create a directive which would work too. I like to component better.
@psimyn Cool. six of one, half a dozen of the other. They would prob both be just as fast. It's just 60 points being draw. Even 2000 points would be nothing. You can see the Vue component above. You could make yours a vue component too.
Mine is prob upside down but the thing you'd have to fix is the scale. I think my calcs are most likely off a little but I multiply by the difference percentage so if it is rendering numbers in the thousands or millions it should scale to though. You can't just render whatever the numbers are or it won't work everytime. That's what all the extra stuff is.
@jschatz1 the SVG uses viewBox to handle the scaling - Y values are between 0 and diff, and retina/non should Just Work. May need scaling for aspect ratio
For either of them, just need to do maxNum - Y for converting from Y0 being at the top
Not Vue for mine yet :)
Edit: yeah, canvas perf will demolish SVG (especially if you want any animation or draw lots of points).
Yours scaled very nicely. Mine did not. Maybe yours is better for this case. https://codepen.io/jschatz1/pen/XMEQGL Unless I fix the scaling issues for mine.
Thanks @jschatz1 and @psimyn, this is looking good! Is it possible to left align the sparkline in the event we don't have the full dataset (+/- 30 mins), rather than center align?
@psimyn as @jschatz1 mentioned we unfortunately aren't able to know the number of points in advance, as we do support customers bringing their own Prometheus server. This means they can control the sample interval, which would affect the number of points within the given hour period.
That said we do know the time bounds and we will always get 30 minutes prior to the merge which is the first half of the chart, and eventually 30 minutes after the merge. (We may not get all 30 minutes in the event this runs into the future) Could we use this timestamps to align instead of just the # of data points?
Because of challenges with displaying the deploy time (the ball in the chart) we will push that for now and just display the sparkline for 9.1. Issue for displaying the deploy time: #30576 (closed)