Can not replicate Geo database on Docker

changed title from Can not install GitLab Geo on Docker to Can not replicate database on Docker

changed title from Can not replicate database on Docker to Can not replicate Geo database on Docker

@dewetblomerus FYI, there is a --no-wait operator to skip that 30-second timeout.

assigned to @stanhu

It looks like gitlab-ctl stop in a Docker container runs error 127:

root@stanhu-geo-secondary2:/var/log/gitlab# gitlab-ctl stop
ok: down: geo-postgresql: 211s, normally up
ok: down: gitaly: 210s, normally up
ok: down: gitlab-monitor: 210s, normally up
ok: down: gitlab-workhorse: 210s, normally up
ok: down: logrotate: 209s, normally up
ok: down: nginx: 209s, normally up
ok: down: node-exporter: 209s, normally up
ok: down: postgres-exporter: 209s, normally up
ok: down: postgresql: 209s, normally up
ok: down: prometheus: 208s, normally up
ok: down: redis: 208s, normally up
ok: down: redis-exporter: 207s, normally up
ok: down: sidekiq: 206s, normally up
ok: down: unicorn: 206s, normally up
root@stanhu-geo-secondary2:/var/log/gitlab# echo $?
127

This is failing the error? check here: https://github.com/chef/mixlib-shellout/blob/0455024df6b737e959fc7c2129aafc8cf4fddc88/lib/mixlib/shellout.rb#L266-L268

If you don't use a Docker container, the return value is 0.

Adding --security-opt=seccomp:unconfined to the docker run command allow strace -f to be run.

This is what happens when gitlab-ctl stop is run in a Docker container:

<snip>
[pid  1118] stat("down", 0x7ffd4bcf4600) = -1 ENOENT (No such file or directory)
[pid  1118] write(1, "ok: down: unicorn: 841s, normally up\n", 37ok: down: unicorn: 841s, normally up
) = 37
[pid  1118] fchdir(3)                   = 0
[pid  1118] exit_group(0)               = ?
[pid  1118] +++ exited with 0 +++
[pid  1102] <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 1118
[pid  1102] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1118, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
[pid  1102] rt_sigaction(SIGCHLD, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 0x7f0e498894b0}, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 0x7f0e498894b0}, 8) = 0
[pid  1102] rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7f0e498894b0}, {0x7f0e49d77da0, [], SA_RESTORER|SA_SIGINFO, 0x7f0e498894b0}, 8) = 0
[pid  1102] rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f0e498894b0}, {SIG_IGN, [], SA_RESTORER, 0x7f0e498894b0}, 8) = 0
[pid  1102] close(4 <unfinished ...>
[pid  1103] <... poll resumed> )        = 1 ([{fd=3, revents=POLLHUP}])
[pid  1103] read(3, "", 1024)           = 0
[pid  1103] read(5, 0x7f0e4a106720, 1024) = -1 EAGAIN (Resource temporarily unavailable)
[pid  1103] close(3)                    = 0
[pid  1103] close(5)                    = 0
[pid  1103] exit(0)                     = ?
[pid  1103] +++ exited with 0 +++
[pid  1102] <... close resumed> )       = 0
[pid  1102] close(6)                    = 0
[pid  1102] munmap(0x7f0e4a223000, 1052672) = 0
[pid  1102] exit_group(127)             = ?
[pid  1102] +++ exited with 127 +++
<... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 127}], 0, NULL) = 1102
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f98f52044b0}, {0x444900, [], SA_RESTORER, 0x7f98f52044b0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1102, si_uid=0, si_status=127, si_utime=11, si_stime=6} ---
wait4(-1, 0x7ffc77f10c90, WNOHANG, NULL) = -1 ECHILD (No child processes)
rt_sigreturn({mask=[]})                 = 0
read(255, "", 1262)                     = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
exit_group(127)                         = ?
+++ exited with 127 +++

The WEXITSTATUS seems to be 127 above coming from the process running /opt/gitlab/embedded/bin/omnibus-ctl.

If I look for execve lines (as described in http://stackoverflow.com/questions/9673662/why-does-system-fail-with-error-code-127), I see:

$ grep execve fail.txt                                                                                                                                           
execve("/opt/gitlab/bin/gitlab-ctl", ["gitlab-ctl", "stop"], [/* 8 vars */]) = 0
[pid  1102] execve("/opt/gitlab/embedded/bin/omnibus-ctl", ["/opt/gitlab/embedded/bin/omnibus-ctl", "gitlab", "/opt/gitlab/embedded/service/omnibus-ctl", "stop"], [/* 9 vars */]) = 0
[pid  1104] execve("/opt/gitlab/init/geo-postgresql", ["/opt/gitlab/init/geo-postgresql", "stop"], [/* 9 vars */] <unfinished ...>
[pid  1104] <... execve resumed> )      = 0
[pid  1105] execve("/opt/gitlab/init/gitaly", ["/opt/gitlab/init/gitaly", "stop"], [/* 9 vars */] <unfinished ...>
[pid  1105] <... execve resumed> )      = 0
[pid  1106] execve("/opt/gitlab/init/gitlab-monitor", ["/opt/gitlab/init/gitlab-monitor", "stop"], [/* 9 vars */] <unfinished ...>
[pid  1106] <... execve resumed> )      = 0
[pid  1107] execve("/opt/gitlab/init/gitlab-workhorse", ["/opt/gitlab/init/gitlab-workhorse", "stop"], [/* 9 vars */] <unfinished ...>
[pid  1107] <... execve resumed> )      = 0
[pid  1108] execve("/opt/gitlab/init/logrotate", ["/opt/gitlab/init/logrotate", "stop"], [/* 9 vars */] <unfinished ...>
[pid  1108] <... execve resumed> )      = 0
[pid  1109] execve("/opt/gitlab/init/nginx", ["/opt/gitlab/init/nginx", "stop"], [/* 9 vars */] <unfinished ...>
[pid  1109] <... execve resumed> )      = 0
[pid  1110] execve("/opt/gitlab/init/node-exporter", ["/opt/gitlab/init/node-exporter", "stop"], [/* 9 vars */] <unfinished ...>
[pid  1110] <... execve resumed> )      = 0
[pid  1111] execve("/opt/gitlab/init/postgres-exporter", ["/opt/gitlab/init/postgres-exporter", "stop"], [/* 9 vars */] <unfinished ...>
[pid  1111] <... execve resumed> )      = 0
[pid  1112] execve("/opt/gitlab/init/postgresql", ["/opt/gitlab/init/postgresql", "stop"], [/* 9 vars */] <unfinished ...>
[pid  1112] <... execve resumed> )      = 0
[pid  1113] execve("/opt/gitlab/init/prometheus", ["/opt/gitlab/init/prometheus", "stop"], [/* 9 vars */] <unfinished ...>
[pid  1113] <... execve resumed> )      = 0
[pid  1114] execve("/opt/gitlab/init/redis", ["/opt/gitlab/init/redis", "stop"], [/* 9 vars */] <unfinished ...>
[pid  1114] <... execve resumed> )      = 0
[pid  1115] execve("/opt/gitlab/init/redis-exporter", ["/opt/gitlab/init/redis-exporter", "stop"], [/* 9 vars */] <unfinished ...>
[pid  1115] <... execve resumed> )      = 0
[pid  1116] execve("/opt/gitlab/init/sidekiq", ["/opt/gitlab/init/sidekiq", "stop"], [/* 9 vars */] <unfinished ...>
[pid  1116] <... execve resumed> )      = 0
[pid  1117] execve("/opt/gitlab/init/sshd", ["/opt/gitlab/init/sshd", "stop"], [/* 9 vars */]) = -1 ENOENT (No such file or directory)
[pid  1118] execve("/opt/gitlab/init/unicorn", ["/opt/gitlab/init/unicorn", "stop"], [/* 9 vars */] <unfinished ...>
[pid  1118] <... execve resumed> )      = 0

What is up with sshd stop? Is this causing the problem?

root@stanhu-geo-secondary2:/# echo "exit 0" >> /opt/gitlab/init/sshd
root@stanhu-geo-secondary2:/# chmod +x /opt/gitlab/init/sshd
root@stanhu-geo-secondary2:/# /opt/gitlab/init/sshd         
root@stanhu-geo-secondary2:/# echo $?
0
root@stanhu-geo-secondary2:/# gitlab-ctl stop
ok: down: geo-postgresql: 1465s, normally up
ok: down: gitaly: 1465s, normally up
ok: down: gitlab-monitor: 1465s, normally up
ok: down: gitlab-workhorse: 1464s, normally up
ok: down: logrotate: 1464s, normally up
ok: down: nginx: 1463s, normally up
ok: down: node-exporter: 1463s, normally up
ok: down: postgres-exporter: 1463s, normally up
ok: down: postgresql: 1463s, normally up
ok: down: prometheus: 1463s, normally up
ok: down: redis: 1462s, normally up
ok: down: redis-exporter: 1462s, normally up
ok: down: sidekiq: 1461s, normally up
ok: down: unicorn: 1461s, normally up
root@stanhu-geo-secondary2:/# echo $?
0

Why are we trying to stop sshd? :)

mentioned in merge request omnibus-gitlab!1556 (merged)

@dewetblomerus A quick fix would be to run this inside the Docker container:

ln -sf /opt/gitlab/embedded/bin/sv /opt/gitlab/init/sshd

I submitted a fix via https://gitlab.com/gitlab-org/omnibus-gitlab/merge_requests/1556.

I'm still investigating why gitlab-ctl status returns an error code of 45 in a Docker container, but at least gitlab-ctl stop and gitlab-ctl start are doing the right thing.

Ok, I think gitlab-ctl status returns an error code of 4x normally if services are down, so I think we are good with https://gitlab.com/gitlab-org/omnibus-gitlab/merge_requests/1556.

mentioned in commit omnibus-gitlab@4a098168

Huh, why are we doing anything with sshd? @stanhu Where is that coming from, any idea?

@marin It looks like we have been running sshd inside our Docker container for quite some time: https://gitlab.com/gitlab-org/omnibus-gitlab/blob/master/docker/assets/setup#L11

Checked with @ayufan on the need for sshd. Our sshd needs the git user in /etc/passwd and access to authorized_keys, and that is managed by omnibus internally.