Capacity / infra planning for services used by app
The goal is to break this issue into smaller actionable issues
This is a bit of a meta issue and the goal is to break it up into smaller bite size actionable issues as quickly as possible; but for now it will help to gather my thoughts. This issue was inspired by https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/10280#note_27747364 and other similar issues / discussions in recent past.
Not expecting any action from others on this right now; gathering my thoughts.
The issue
- This is about services used by the GitLab app; such as Redis, Sidekiq, Workhorse, (others?)
- These are services that have their respective experts, and are widely used by the GitLab application.
- These services and their code bases do not have owners, but they may have maintainers who review merge requests to the service code and such.
- In production, the services are run from nodes that are built and maintained by the Production team.
- Most of these services have some amount of monitoring hooked up to them, and alerts based on the monitoring.
- Various outages and performance problems of GitLab.com have resulted from regular but unforeseen behavior of the GitLab app in its use of the services. Regular in the sense that the app is behaving per the way it was intended, and the infrastructure is running the way it was intended. Unforeseen in that perhaps an edge case is being executed or adequate load testing of regular cases wasn't / couldn't be done, and now it leads to an outage or performance problem.
- There have been discussions about ownership of such services as a way to address these issues. To date, we have chosen not to assign individual owners, but to tackle issues as they arise on a case by case basis. This generally works, but it does mean that the issues arise in an unscheduled manner and compete for scheduled time of team members.
- There have also been discussions about increasing the observability of the load that a feature or action in the app places on the underlying services and infrastructure (needs link), and to explicitly build monitoring and alerting around such things as part of feature development
Some ideas on how to tackle
- It seems to me we need some sort of capacity planning for the services, which means that we need to understand how current feature sets / in-app-actions place load on the service. We don't have that at the moment, and it would seem to me that it requires effort from backend development.
- On the production environment side, the thinking is in the direction of adding more service nodes that service different kinds of requests, but that too will likely require involvement from backend development. (needs link).