Can not replicate Geo database on Docker
I don't think this is related to #2176 (closed) but the discussion started there.
Summary
For some reason gitlab-ctl replicate-geo-database
errors out, which means that Geo can not be installed on Docker following our documentation.
Steps to reproduce
Start installing Geo on Docker. This can get you started https://gitlab.com/gl-support/testlab/tree/geo/geo
What is the current bug behavior?
gitlab-ctl replicate-geo-database --host=primary
errors out with the following:
[ERROR] Failed to execute: gitlab-ctl stop
When I run gitlab-ctl status
after that, it shows that everything did shut down correctly.
What is the expected correct behavior?
The database should be replicated.
Relevant tips for working on this
The file that probably needs to be worked on is https://gitlab.com/gitlab-org/omnibus-gitlab/blob/master/files/gitlab-ctl-commands-ee/replicate_geo_database.rb
On an Omnibus installation the file lives in /opt/gitlab/embedded/service/omnibus-ctl/replicate_geo_database.rb
Removing the 30 second sleep on line 42 is great for debugging
Designs
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- De Wet changed title from Can not install GitLab Geo on Docker to Can not replicate database on Docker
changed title from Can not install GitLab Geo on Docker to Can not replicate database on Docker
- De Wet changed title from Can not replicate database on Docker to Can not replicate Geo database on Docker
changed title from Can not replicate database on Docker to Can not replicate Geo database on Docker
- De Wet mentioned in issue #2176 (closed)
mentioned in issue #2176 (closed)
@dewetblomerus FYI, there is a
--no-wait
operator to skip that 30-second timeout.Edited by Stan Hu 1It looks like
gitlab-ctl stop
in a Docker container runs error 127:root@stanhu-geo-secondary2:/var/log/gitlab# gitlab-ctl stop ok: down: geo-postgresql: 211s, normally up ok: down: gitaly: 210s, normally up ok: down: gitlab-monitor: 210s, normally up ok: down: gitlab-workhorse: 210s, normally up ok: down: logrotate: 209s, normally up ok: down: nginx: 209s, normally up ok: down: node-exporter: 209s, normally up ok: down: postgres-exporter: 209s, normally up ok: down: postgresql: 209s, normally up ok: down: prometheus: 208s, normally up ok: down: redis: 208s, normally up ok: down: redis-exporter: 207s, normally up ok: down: sidekiq: 206s, normally up ok: down: unicorn: 206s, normally up root@stanhu-geo-secondary2:/var/log/gitlab# echo $? 127
This is failing the
error?
check here: https://github.com/chef/mixlib-shellout/blob/0455024df6b737e959fc7c2129aafc8cf4fddc88/lib/mixlib/shellout.rb#L266-L268If you don't use a Docker container, the return value is 0.
Adding
--security-opt=seccomp:unconfined
to thedocker run
command allowstrace -f
to be run.This is what happens when
gitlab-ctl stop
is run in a Docker container:<snip> [pid 1118] stat("down", 0x7ffd4bcf4600) = -1 ENOENT (No such file or directory) [pid 1118] write(1, "ok: down: unicorn: 841s, normally up\n", 37ok: down: unicorn: 841s, normally up ) = 37 [pid 1118] fchdir(3) = 0 [pid 1118] exit_group(0) = ? [pid 1118] +++ exited with 0 +++ [pid 1102] <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 1118 [pid 1102] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1118, si_uid=0, si_status=0, si_utime=0, si_stime=0} --- [pid 1102] rt_sigaction(SIGCHLD, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 0x7f0e498894b0}, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 0x7f0e498894b0}, 8) = 0 [pid 1102] rt_sigaction(SIGINT, {SIG_IGN, [], SA_RESTORER, 0x7f0e498894b0}, {0x7f0e49d77da0, [], SA_RESTORER|SA_SIGINFO, 0x7f0e498894b0}, 8) = 0 [pid 1102] rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f0e498894b0}, {SIG_IGN, [], SA_RESTORER, 0x7f0e498894b0}, 8) = 0 [pid 1102] close(4 <unfinished ...> [pid 1103] <... poll resumed> ) = 1 ([{fd=3, revents=POLLHUP}]) [pid 1103] read(3, "", 1024) = 0 [pid 1103] read(5, 0x7f0e4a106720, 1024) = -1 EAGAIN (Resource temporarily unavailable) [pid 1103] close(3) = 0 [pid 1103] close(5) = 0 [pid 1103] exit(0) = ? [pid 1103] +++ exited with 0 +++ [pid 1102] <... close resumed> ) = 0 [pid 1102] close(6) = 0 [pid 1102] munmap(0x7f0e4a223000, 1052672) = 0 [pid 1102] exit_group(127) = ? [pid 1102] +++ exited with 127 +++ <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 127}], 0, NULL) = 1102 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f98f52044b0}, {0x444900, [], SA_RESTORER, 0x7f98f52044b0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1102, si_uid=0, si_status=127, si_utime=11, si_stime=6} --- wait4(-1, 0x7ffc77f10c90, WNOHANG, NULL) = -1 ECHILD (No child processes) rt_sigreturn({mask=[]}) = 0 read(255, "", 1262) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 exit_group(127) = ? +++ exited with 127 +++
The
WEXITSTATUS
seems to be 127 above coming from the process running/opt/gitlab/embedded/bin/omnibus-ctl
.If I look for
execve
lines (as described in http://stackoverflow.com/questions/9673662/why-does-system-fail-with-error-code-127), I see:$ grep execve fail.txt execve("/opt/gitlab/bin/gitlab-ctl", ["gitlab-ctl", "stop"], [/* 8 vars */]) = 0 [pid 1102] execve("/opt/gitlab/embedded/bin/omnibus-ctl", ["/opt/gitlab/embedded/bin/omnibus-ctl", "gitlab", "/opt/gitlab/embedded/service/omnibus-ctl", "stop"], [/* 9 vars */]) = 0 [pid 1104] execve("/opt/gitlab/init/geo-postgresql", ["/opt/gitlab/init/geo-postgresql", "stop"], [/* 9 vars */] <unfinished ...> [pid 1104] <... execve resumed> ) = 0 [pid 1105] execve("/opt/gitlab/init/gitaly", ["/opt/gitlab/init/gitaly", "stop"], [/* 9 vars */] <unfinished ...> [pid 1105] <... execve resumed> ) = 0 [pid 1106] execve("/opt/gitlab/init/gitlab-monitor", ["/opt/gitlab/init/gitlab-monitor", "stop"], [/* 9 vars */] <unfinished ...> [pid 1106] <... execve resumed> ) = 0 [pid 1107] execve("/opt/gitlab/init/gitlab-workhorse", ["/opt/gitlab/init/gitlab-workhorse", "stop"], [/* 9 vars */] <unfinished ...> [pid 1107] <... execve resumed> ) = 0 [pid 1108] execve("/opt/gitlab/init/logrotate", ["/opt/gitlab/init/logrotate", "stop"], [/* 9 vars */] <unfinished ...> [pid 1108] <... execve resumed> ) = 0 [pid 1109] execve("/opt/gitlab/init/nginx", ["/opt/gitlab/init/nginx", "stop"], [/* 9 vars */] <unfinished ...> [pid 1109] <... execve resumed> ) = 0 [pid 1110] execve("/opt/gitlab/init/node-exporter", ["/opt/gitlab/init/node-exporter", "stop"], [/* 9 vars */] <unfinished ...> [pid 1110] <... execve resumed> ) = 0 [pid 1111] execve("/opt/gitlab/init/postgres-exporter", ["/opt/gitlab/init/postgres-exporter", "stop"], [/* 9 vars */] <unfinished ...> [pid 1111] <... execve resumed> ) = 0 [pid 1112] execve("/opt/gitlab/init/postgresql", ["/opt/gitlab/init/postgresql", "stop"], [/* 9 vars */] <unfinished ...> [pid 1112] <... execve resumed> ) = 0 [pid 1113] execve("/opt/gitlab/init/prometheus", ["/opt/gitlab/init/prometheus", "stop"], [/* 9 vars */] <unfinished ...> [pid 1113] <... execve resumed> ) = 0 [pid 1114] execve("/opt/gitlab/init/redis", ["/opt/gitlab/init/redis", "stop"], [/* 9 vars */] <unfinished ...> [pid 1114] <... execve resumed> ) = 0 [pid 1115] execve("/opt/gitlab/init/redis-exporter", ["/opt/gitlab/init/redis-exporter", "stop"], [/* 9 vars */] <unfinished ...> [pid 1115] <... execve resumed> ) = 0 [pid 1116] execve("/opt/gitlab/init/sidekiq", ["/opt/gitlab/init/sidekiq", "stop"], [/* 9 vars */] <unfinished ...> [pid 1116] <... execve resumed> ) = 0 [pid 1117] execve("/opt/gitlab/init/sshd", ["/opt/gitlab/init/sshd", "stop"], [/* 9 vars */]) = -1 ENOENT (No such file or directory) [pid 1118] execve("/opt/gitlab/init/unicorn", ["/opt/gitlab/init/unicorn", "stop"], [/* 9 vars */] <unfinished ...> [pid 1118] <... execve resumed> ) = 0
What is up with
sshd stop
? Is this causing the problem?root@stanhu-geo-secondary2:/# echo "exit 0" >> /opt/gitlab/init/sshd root@stanhu-geo-secondary2:/# chmod +x /opt/gitlab/init/sshd root@stanhu-geo-secondary2:/# /opt/gitlab/init/sshd root@stanhu-geo-secondary2:/# echo $? 0 root@stanhu-geo-secondary2:/# gitlab-ctl stop ok: down: geo-postgresql: 1465s, normally up ok: down: gitaly: 1465s, normally up ok: down: gitlab-monitor: 1465s, normally up ok: down: gitlab-workhorse: 1464s, normally up ok: down: logrotate: 1464s, normally up ok: down: nginx: 1463s, normally up ok: down: node-exporter: 1463s, normally up ok: down: postgres-exporter: 1463s, normally up ok: down: postgresql: 1463s, normally up ok: down: prometheus: 1463s, normally up ok: down: redis: 1462s, normally up ok: down: redis-exporter: 1462s, normally up ok: down: sidekiq: 1461s, normally up ok: down: unicorn: 1461s, normally up root@stanhu-geo-secondary2:/# echo $? 0
Why are we trying to stop
sshd
? :)- Stan Hu mentioned in merge request omnibus-gitlab!1556 (merged)
mentioned in merge request omnibus-gitlab!1556 (merged)
@dewetblomerus A quick fix would be to run this inside the Docker container:
ln -sf /opt/gitlab/embedded/bin/sv /opt/gitlab/init/sshd
I submitted a fix via https://gitlab.com/gitlab-org/omnibus-gitlab/merge_requests/1556.
I'm still investigating why
gitlab-ctl status
returns an error code of 45 in a Docker container, but at leastgitlab-ctl stop
andgitlab-ctl start
are doing the right thing.Edited by Stan Hu 1Ok, I think
gitlab-ctl status
returns an error code of4x
normally if services are down, so I think we are good with https://gitlab.com/gitlab-org/omnibus-gitlab/merge_requests/1556.- Stan Hu mentioned in commit omnibus-gitlab@4a098168
mentioned in commit omnibus-gitlab@4a098168
- Maintainer
Huh, why are we doing anything with
sshd
? @stanhu Where is that coming from, any idea? @marin It looks like we have been running sshd inside our Docker container for quite some time: https://gitlab.com/gitlab-org/omnibus-gitlab/blob/master/docker/assets/setup#L11
Checked with @ayufan on the need for sshd. Our sshd needs the git user in
/etc/passwd
and access toauthorized_keys
, and that is managed by omnibus internally.- Maintainer
I know that we are running
sshd
daemon but I am now not at all sure why it needed to be within the omnibus management https://gitlab.com/gitlab-org/omnibus-gitlab/blob/master/docker/assets/setup#L11-16 .@twk3 Do you remember why this was?
- Contributor
@marin It was before my time. (A community contribution in 2014 I think: https://gitlab.com/gitlab-org/gitlab-ce/commit/0dcc1e88a4a9a1fe4745421474fcb3e93bfb87ef) But setting it up the way it is, allows us to start everything with the one runit command here: https://gitlab.com/gitlab-org/omnibus-gitlab/blob/master/docker/assets/wrapper#L79
And it tails/persists the sshd log with the rest of our logs.
- Maintainer
- Marin Jankovski closed via commit omnibus-gitlab@88d07fb4
closed via commit omnibus-gitlab@88d07fb4
- Stan Hu closed via commit omnibus-gitlab@4a098168
closed via commit omnibus-gitlab@4a098168
- Marin Jankovski mentioned in issue omnibus-gitlab#2352
mentioned in issue omnibus-gitlab#2352
- Marin Jankovski mentioned in commit omnibus-gitlab@88d07fb4
mentioned in commit omnibus-gitlab@88d07fb4
- Marin Jankovski mentioned in commit omnibus-gitlab@dea59efd
mentioned in commit omnibus-gitlab@dea59efd
- Marin Jankovski mentioned in commit omnibus-gitlab@46781902
mentioned in commit omnibus-gitlab@46781902