Add retry to reconfigure method to avoid flaky failures
What does this MR do and why?
Works around for https://gitlab.com/gitlab-org/gitlab-qa/-/issues/677
To avoid these flaky failures, this MR introduces a retry mechanism to reconfigure
which restarts the container which will hopefully not encounter the same failure upon retrying.
How to set up and validate locally
To replicate the errors found in our pipelines is difficult, as they appear to be intermittent and flaky.
However the following steps can trigger the error described by deleting config files during startup.
- In terminal window 1 -
while true; do sleep 1; docker exec $(docker ps -q) bash -c "rm -rf /var/opt/gitlab/"; done
- In terminal window 2 -
./exe/gitlab-qa Test::Instance::Image EE
With this change in place,
- If the errors continue happening - it will fail 3 times, before exiting with a
(Gitlab::QA::Docker::Shellout::StatusError)
as happens today. - If the errors only happen once (cancel the script in Terminal Window 1 after it fails once) - you should see the container start as normal after restarting
Working example of change
On the pipelines for this MR see pipelines#539645510 / job ce:update-parallel 2/5
Note how we encountered a error [31mError executing action `enable` on resource 'runit_service[gitlab-exporter]'
which previously would have caused the job to fail - but we now restart the containers instead of allowing the raised error fail the job, and see that the job passes successfully.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.