We have noticed that when the application is in a state that cannot be started (missing NFS mount) we can't actually bring the deployment page up.
We should use a different mechanism to do this, probably doing it at the HAProxy level instead of relying on the monolith as when it can't start, we just can't do it.
We should use the opportunity to change the page to a downtime page instead of a deployment one and return a different error code to account for the unplanned downtime correctly.
Designs
An error occurred while loading designs. Please try again.
Child items
0
Show closed items
GraphQL error: Variable $iid of type ID! was provided invalid value
Given that we need this in order to have a reliable measure of our availability, it is top of mind for me. Where does it sit in the stack of priorities, @pcarranza ? \cc @sitschner
This is a really easy fix for us, the logic is already in the HA-Proxy configuration, I'll just add a custom http page for our 503 message and modify the http checks to fetch a specific string from the backend servers (more than just a 200 status check).
@pcarranza that's a little harder to do, because you're talking about dynamically altering the HA-Proxy configuration. I will look into this for a boring and elegant solution.
Ok - work was taken towards this in a now live refactor of the HA Proxy configuration. We check the back-end hosts on a more frequent basis with dynamic checking times i.e.
We poll every backend node every 2 seconds to see if it's reachable and the service is responding. We use a splay of 2% to make sure that not all checks happen at the exact same time.
if a node misses a health check, we then move to a more rapid polling once every second. Three missed polls (or 3 seconds later) the node is ejected from the backend and marked as done.
Once a node is marked as down, we ease off of polling and only poll ever five seconds to see if it's back up.
Just seeing it as being scheduled for next next wow... even though as @pcarranza notes, it is really important to have proper downtime accounting. Any way this can be expedited @pcarranza ? \cc @sitschner
@sitschner I think I have a working model for this, I'm going to stub a MR for the model and then put work towards it. My goal is to make the checks more robust, reaching into the application for more than just dumb "did the port return open", and then when we fail to have any backend servers, we'll throw a "GitLab is Currently unavailable" page. When we're deploying, that also counts as 'down'.