Highest Throughput and Most Time consuming transaction scripts
Merge request reports
Activity
- lib/app_transactions/highest_throughput.rb 0 → 100644
1 # This script calculates the actions with highest throughput in the past 24 2 # hours. 3 require_relative '../config' 4 5 HOURS = (ENV.fetch('INTERVAL') { 24 }).to_i 6 7 rows = CLIENT.query(<<-SQL 8 SELECT count("duration") 9 FROM "rails_transactions" 10 WHERE time > NOW() - #{HOURS}h 11 GROUP BY "action"; @pacoguzman If we can somehow downsample this and visualize it in Grafana that may be easier than using a script. Having said that I think InfluxDB still only supports sorting by timestamp (and not any value), this would make it hard to sort a list in descending order on for example the number of transactions/sec.
mentioned in issue gitlab-com/infrastructure#211 (closed)
Reassigned to @pacoguzman
@pacoguzman which one is the most expensive on average. To disregard the fact that builds have 5M calls. Which one is "slow" with a low number of calls but still representative?
mentioned in merge request influxdb-management!6 (merged)
@yorickpeterse I decided to use the existing downsample data to get the data for these scripts, if you can take a look.
Reassigned to @yorickpeterse
@pcarranza how do you want the info about slow endpoints but representative in the total number of requests (probably this means users request those endpoints)?
Reassigned to @pacoguzman
Added 1 commit:
- 4847251d - Highest Throughput and Most Time consuming transaction scripts
Reassigned to @yorickpeterse
@pacoguzman it should be representative, but we should also worry about the P99 and not about the sum of all the requests, what I mean by that is that I want to know the slowest endpoints, not the endpoints that summed up the most time because there are millions of really fast calls.
Does that make sense?
@pcarranza what about this measure https://en.wikipedia.org/wiki/Apdex ? I guess we can set satisfied request those under 2s and tolerance to 4s but these numbers are something that we need to align as we improve the app, but as a start point should be good. The idea is move the numbers to get worst Apdex and then work again to move to 1.0 and move the numbers again and keep iterating. Does it make sense?
It's worth mentioning that over time I kinda want to drop this repository all together. That is, all this information should be in Grafana so everybody can view it without having to set up credentials and what not.
Apdex is quite interesting and I have used it in the past somewhat, though not extensively. We could start with trying to calculate it in InfluxDB for some of the controllers/API endpoints.
@yorickpeterse how far are we from starting to get this data? can we merge and improve later?
I'm good with dropping the repo later on.
@pacoguzman that's exactly what I want.
Eventually we need to have monitoring in place to trigger an alarm whenever a request goes over the threshold of the SLO.
But for now what we need is to bring order into the chaos.
@pcarranza In theory we should have the data as the apdex depends on the number of samples and a threshold (which we have to define ourselves). Using this we should be able to for example calculate the apdex per controller per a certain time interval. For the apdex I don't think we need to measure everything per minute, though it's of course still possible.
Edited by yorickpeterse-staging@pacoguzman So can we close this MR in favour of getting Apdex calculations in InfluxDB?
I guess if we only want to order endpoints based on user perceived performance like Apdex. I've been trying to figure out how to get the data on Grafana or in InfluxDB, but I wasn't able to get it, I've made a script to get the data from the last hour where we get all the transactions for the moment -> https://gitlab.com/gitlab-org/gitlab-ce/issues/19273#note_14143191.
Do you have an idea on how to do this?
@yorickpeterse a general index will not tell me what page we need to work on. I need a list of pages sorted by the p99 request time.
@pcarranza Ah right. For that we just need to add a query that aggregates mean/p95/p99 timings per day (or another time frame larger than 1 minute), then we can sort the data in Grafana (InfluxDB can only sort by timestamp :/).
@yorickpeterse that sounds great
@pcarranza These queries have been set up and group data per day. I'll check back tomorrow and set up a Grafana dashboard once the data is there.
I'm good with this as it, thanks @yorickpeterse
@pcarranza Data is now here and is visualised at http://performance.gitlab.net/dashboard/db/daily-overview.