Add runbooks for oncall common situations
Since our last NFS outage it just makes sense to collect all the experience to quickly rule out what are the common problems (What should we be paying attention to) and what are the solutions (how to fix it)
Lest's start with just gathering the NFS case as that is the main reason why we have problems right now, and keep adding.
-
Document how to handle NFS going dark -
Reboot NFS Server -
Restart Redis -
Restart Unicorn workers -
Restart Sidekiq workers -
Restart PostgreSQL -
Recover PostgreSQL replication -
Upgrade to latest stable EE version -
Rollback to previous EE version -
Handle common infra tasks with HAProxy (deny path, disable service) -
Update packages fleet-wide. -
Create a new host, bootstrap it and add it to the monitoring tools -
Destroy a host and remove it from the fleet and the monitoring tools
I've gathered graphs already, will setup the document.