Add details about CI architecture, graphs and troubleshooting
This is an introduction for CI architecture with the description of data that we have for monitoring.
I would like it to be a base for CI shared runners training that I would like to organize in order to pass as much of my knowledge.
We could then collaboratively try to work on more accurate run books for most common problems that we faced recently.
The current state is that the resolutions are part of the docs, and this is also what I did in a couple of recent days trying to figure out and solve it.
Merge request reports
Activity
mentioned in issue infrastructure#1260 (closed)
We should also add resolution of these problems: https://gitlab.com/gitlab-com/infrastructure/issues/1421.
@ayufan thanks for writing all this documentation. Can you also link to the relevant runbooks from associated alerts. Are those alerts already built?
@pcarranza what do you think of the idea of having @ahanselka review these docs, and help the CI team set up relevant alerts? That can help to form handbook guidance on how to write alerts in a useful manner, which in turn helps with https://gitlab.com/gitlab-com/organization/issues/62#note_25914642
@ernstvn I will work on that still, but I would love to hear some feedback and maybe help with systematising the knowledge.
@ayufan I think that @ernstvn is right here, @ahanselka is your guy
- Resolved by username-removed-274314
- Resolved by Alex Hanselka
- Resolved by Alex Hanselka
I think this is a great start @ayufan we should merge as is and continue with discussions on how can we get better offline.
assigned to @ahanselka
mentioned in issue infrastructure#1451 (closed)