GitLab Geo: SSH keys Sync
Rationale
Current Geo (#76) implementation allows both ssh and https clone on a secondary node, but only https is being actively synced (it's synced by the database replication).
SSH keys are also store on the database, but requires an extra step to be useful (it must be added to .ssh/authorized_keys
by gitlab-shell).
SSH key changes doesn't happen so often, but as changes to the .ssh/authorized_keys
can only happen by one process at time (because we have to acquire file locking to prevent data-loss), it's a good candidate to be enqueued/buffered and updated in batches.
gitlab-shell only provides batch support to add keys, not to remove old ones. The primary node also doesn't do any batched operation, so we can safely postpone any decision on that matter and change our minds in the future if that proves to be a bottleneck.
SSH Keys Sync
As changes to .ssh/authorized_keys
involve file locking, buffering changes is not useful unless we can also make changes in a batched way on each secondary node.
I'm proposing we send notification changes immediately in an async way to each secondary node. So after any addition, update o removal of keys we should create an async job to notify nodes.
Job retries should be customized to happen more often. It is expected to have at least the first execution failing because of database replication latency. We can either start by scheduling job execution on the secondary node to be performed after X amount of seconds or let a custom retry logic deal with that (documentation here).
I'm more inclined in using a custom retry logic of about 5 to 10 seconds (randomly).
We should always query database for the key when updating in a job or we could replace an older key when multiple updates happen at the same time for the same key.
Checklist
- When a key is created or removed we notify secondary nodes of that change.
-
A secondary node must receive a notification to change
.ssh/authorized_keys
When a notification to modify authorized_keys
is received:
- It must generated async job to add or remove keys
- Generated job should retry a reasonable amount of times for a short period of time (30 times, waiting from 5 to 10 seconds)
- Generated jobs should also have an exponential backoff logic after the short period of time retries.