We heard from many customers that they want geographically distributed GitLab. We always said that this is impossible because you can't write to multiple databases.
But for customers with geographic teams cloning everything over the WAN is a problem, especially for local CI servers. The WAN is slow and the data is costly. Some customers have 10,000 runners in operation that are cloning.
More and more customers are asking for this, as Job already mentions. Just this week I've heard from 2 so far, and have a conversation scheduled for tomorrow about the same.
The routing piece @sytses mentions is very intriguing and would cover many use-cases. If users can simply point at their 'local' instance and have it seamlessly forward requests to the active cluster it would be great.
A challenge I see is that customers already think we have a magic HA configuration for local HA clusters. They don't want to configure an external HA database, HA Redis, or an HA filesystem. In many cases they don't know how, nor do they have the resources to maintain it. I see this problem only being increased with global HA. How do they configure the database and filesystem replication? These don't sound trivial.
It would be nice that if you push to GitLab RE it forwards your request to the master server.
Agree. But might be hard to do from start. I propose to leave it for next iteration
According to application changes we can start with next:
UI in application settings that allow you to configure app either as master or as RO
Make worker for master that push to RO server(s)
Make some middleware that reject any POST/PUT/DELETE requests in Rails app if RO enabled
When I think about attachments - for now we can just point it to master server so we dont need to sync it. It might require just single change in Markdown library
Do we need a worker for master or can we just use a system hook?
we need a worker. System hooks has nothing to do with pushing. Even if you trigger system hook on each push we should anyway implement logic that does push from master or fetch from slave. I think its easier to just make a separate worker for master. We might want to re-use this worker in future for 2-side mirror feature.
The attachments can point to master indeed if they are images, non-images require authentication I think.
I think we don't require authentication for attachments.
I don't understand what this worker looks like. Does it periodically push every repo? What kind of period are we talking about?
http://doc.gitlab.com/ce/security/user_file_uploads.html says "Note that non-image attachments do require authentication to be viewed." (which I find strange, but if we don't require authentication for attachments we should update that)
@dzaporozhets answered 1. during a talk, the primary will push every repo when it is updated to all the secondaries. This is simpler because it has one action, no notify and pull. We can't use existing system hooks since they don't fire on push. Dmitriy also said it enabled two way mirror (can't that also happen with notify pull?) and the secondary triggers and activity stream would be updated normally (I probably misunderstood since the activity stream is in the readonly database. Regarding 2. he thinks both attachments work the same (no-auth) so we should update the documentation.
during a talk, the primary will push every repo when it is updated to all the secondaries. This is simpler because it has one action, no notify and pull. We can't use existing system hooks since they don't fire on push. Dmitriy also said it enabled two way mirror (can't that also happen with notify pull?
@sytses it can but only for case when both instances are GitLab. If we want GitLab to mirror on kernel.org or github or bitbucket we want to make push. Anyway we will discuss it one more time with developer on stage of implementation. The simplest solution will win.
and the secondary triggers and activity stream would be updated normally (I probably misunderstood since the activity stream is in the readonly database.
my bad. you are right. With read-only database it wont be a case.
@JobV this has milestone 8.5 here but 8.4 on https://about.gitlab.com/direction/ I really hoped for this to ship in 8.4 but probably missed setting the milestone
@sytses@dzaporozhets shame. Thanks for updating and organising. I'll make sure to double check whether /direction items are properly set to the corresponding milestone.
Authentication should happen on the primary server because it changes a lot of states (most related with security measures, like brute force prevention, password recovery, etc).
If this becomes an issue, there is an alternative to store the states in Redis, but will take some time to implement.
We have a middleware to prevent potentially writing operations on secondary servers (to make sure its readonly)
Next steps
How we are going to update repositories on secondary servers:
I had a call with @dzaporozhets today, and we discussed about the following approaches:
Primary should push to all secondaries
Each Secondary should pull from primary
The conclusion is that if Primary can push to any secondary, then can any other user.
To overcome this, we will have to break our permission system.
It's easier to have Secondary servers pulling from Primary, and it also makes easier to deal with
connection issues, as primary doesn't have to keep track of that.
We are already replicating PostgreSQL from Primary to Secondary, and we do need to replicate Redis
to share session data, so it's the perfect place to store data and coordinate secondary repositories updates.
We don't want to do "cron like" pooling of every single repository, but update only the modified ones.
The idea we discussed is to create one queue for each instance in redis, where primary will enqueue repositories that secondaries needs to fetch.
This queue should be namespaced with something like geo/[:secondary_host]. We can use redis MULTI to bulk update every namespace in a single transaction, to make sure we don't have inconsistencies between secondary instances.
As we don't want to require Procfile updates to run Geo, we will try to use Sidekiq-cron to pool from the repository updates queue. As a starting point we will pool every 10 seconds, but will try to make it down to 5.
@dzaporozhets: I'm sorry I was a little off the radar this week. I had some unplanned personal life issues to solve, I will catch-up on the weekend and be on schedule again.
The original idea to use Redis as a communication bus from primary to secondary will not work as expected for the queues, as we can't write on secondaries (I initially had the impression we could have master-master replication), and the latency with them could make things complicated to have secondaries connecting to master redis directly.
So there is the need for a fix on the idea.
What I come up with is, we can still use redis to buffer updates to a queue, and then submit a single HTTP request from master to each secondary, with all project ids that have their repositories updated recently (we consume the queue to get thous ids).
On the secondary nodes side, when receiving this request, it will enqueue sidekiq jobs to fetch repositories/changes.
@DouweM I made the documentation on friday (forgot to push, will do it now). After I setup the demo instances, it became pretty obvious that the current (hacky way) authentication wasn’t working as expected (we were sharing session using redis).
During development, as it was on the same machine everything worked fine, but sharing a remote redis is far from good and requires actually two redis instances on the secondary nodes to work (one for the session and another for sidekiq).
So I’m moving forward to the OAuth authentication, which I was going to do anyway as a next step.
There will be an “OAuth application” registered for every node, and secondaries will authenticate based on that. I'm not using omniauth but plain OAuth client in a similar way old CI worked. I decided to go this route without omniauth because the integration with devise does not provide an obvious way to dynamically define a provider.
I will deploy OAuth in a separated branch based on the !179 (merged) to have the demo instance working and make the final tests. We can release after that.
@brodock Thanks for the update! Is it an option to use (something similar to) gitlab-shell's .gitlab_shell_secret (http://gitlab.com/gitlab-org/gitlab-ce/blob/master/config/gitlab.yml.example#L413), and skip OAuth? Every machine will already have that. It's currently used to authenticate between gitlab-shell, which terminates SSH connections, and the internal API, which verifies user access to repositories.
@DouweM Not really, for the user authentication, the problem is reliably sharing the state "user X is authenticated" between instances in the context of a user session.
OAuth is perfect for that. If I use otherwhise my "own protocol" I will end up implementing OAuth from scratch which is not a good idea.
@brodock Do we now have issues for everything Geo related that still needs to happen? I see https://gitlab.com/gitlab-org/gitlab-ee/issues/371 for syncing SSH keys, but I think there is some other on-disk data that potentially need to be synced, like uploaded files and avatars, build traces, build artifacts, LFS objects. I think we should have an issue for everything Geo related that still needs to happen; If it's not in the issue tracker, but just in people's heads, it doesn't exist. :)
Can you also link to all Geo related issues from a task list in the description of this issue, so we can see at a glance what parts of Geo are done, and which aren't yet?
We should discuss what / how to do with the following items:
Geo: Git LFS support / sync (gitlab-org/gitlab-ee#415)
Geo: Build traces Sync
Geo: Build artifacts Sync
Geo: Uploaded files and Avatar Sync
I believe syncing local files are not something we should do using REST requests and sidekiq (which is what we used for repository synchronizations).
There are two different approaches that could help us here. In gitlab-org/gitlab-ce#13825, we will use S3 storage object compatible daemon to handle LFS support, and we are also experimenting with CEPH, which also supports geographical replication (http://docs.ceph.com/docs/master/radosgw/federated-config/).
@brodock read-only LFS support should not be too difficult to make possible. To replicate the files I would use periodic rsync jobs, I think we can get away with that.
@jacobvosmaerrsync might be good only for small repositories with infrequently updates as it can take ages to even start the process if you have a bigger one.
what we current replicate are ~10 seconds only behind primary, if we go the rsync route it would be minutes or even hours, right?
@jacobvosmaer when I mentioned repositories I was thinking about LFS, sorry for the confusion.
In my university we used to store linux .iso images and big files in git repositories (which was a bad idea but a perfect use-case for LFS).
@brodock OK :) I agree there will come a point when just starting up rsync for the (global!) lfs-objects directory of a GitLab server becomes painful. I think the appropriate solution for that, when that time comes, is to store LFS objects on S3 (or something like it).
I do not see a testing plan. I propose that Gabriel works with Pablo to combine this testing with Ceph testing, as proposed in https://gitlab.com/gitlab-com/operations/issues/1/#note_4527326 (please note that this isn't about LFS, Ceph is just used as storage for the repo's themselves)
At the start of this issue I proposed using a system hook https://gitlab.com/gitlab-org/gitlab-ee/issues/76#note_3049586 This would be a new system hook for push. At the time we didn't do this since we were thinking about pushing the data completely, but later on we switched back to 'send notification to secondary and then have secondary pull'. Is it possible to use a new system hook after all?
Avatars, LFS and builds artifacts should be synced with one technology. We can start with rsync and then consider some replicated FS
SSH keys sync logic for primary server should use System hooks
We consider use system hooks for repository sync. This means we need
implement push event for system hooks (CE)
use system hooks for Geo feature in EE
Advantages of using system hooks to inform secondary server about new ssh keys or git push:
minimal amount code needed (creating system hook per node in database)
single logic for repos and keys
we re-use existing feature (system hooks) that is well tested and used by people in CE and EE
Disadvantages:
sidekiq job on every push multiplied by amount of secondary servers. If instance has 10 pushes per second and 2 secondary nodes it will be 20 sidekiq jobs per second.
I've been thinking about this issue a lot lately, particularly with my main project I am working on.
It would be really good to be able to synchronise the whole DB between a local instance and a remote 'master' instance.
For me the solution would seem to be the following.
Each local instance of a gitlab install has a 'special' tag or prefix for all the issues, comments, files etc that are submitted.
for example:
I have my personal laptop, I would commit all my stuff locally and tell it to prefix (or postfix) all my stuff with my username, and PL (to signify my personal laptop)
Then at my office, I can have the same setup, but with B (for bureau)
Now when I make a comment, the system will be able to distinguish all my comments etc, and there should be no clashes with others.
by tagging / linking my localy made comments into the principle (parent) issue / comment, it would seem reasonably that any conflicts would be minimal.
If there are any conflicts due to the nature of the postfix detail the server will be able to determine who is responsible for 'merging' stuff into the main master repository.
I don't feel that this should be a huge required modification in the system, the prefix could even look like a tag ~ or maybe it could be a tag, and could be added automatically.
After all, all the comments, issues, commits are already 'tagged' with the user who up sent them, so having the PK in the db to cross all these fields shouldn't be an issue.
I have a feeling that just the username would probably work, provided people always work on their local system only, and it regularly syncs to the main parent.
It makes sense, but it would be great if signout worked :) Either by redirecting to primary to sign out, by directly deleting the session cookie for the secondary, or by showing a link to the primary.
Found what is causing the 500 error on geo:
Gitlab::Application.secrets.db_key_base is different on both machines. For some reason a few repositories work just fine while others tries to #external_import? and triggers project.import_data.credentials which crashes with:
OpenSSL::Cipher::CipherError: bad decrypt from /home/git/gitlab/vendor/bundle/ruby/2.1.0/gems/encryptor-1.3.0/lib/encryptor.rb:73:in `final' from /home/git/gitlab/vendor/bundle/ruby/2.1.0/gems/encryptor-1.3.0/lib/encryptor.rb:73:in `crypt' from /home/git/gitlab/vendor/bundle/ruby/2.1.0/gems/encryptor-1.3.0/lib/encryptor.rb:44:in `decrypt' from /home/git/gitlab/vendor/bundle/ruby/2.1.0/gems/attr_encrypted-1.3.4/lib/attr_encrypted.rb:197:in `decrypt' from /home/git/gitlab/vendor/bundle/ruby/2.1.0/gems/attr_encrypted-1.3.4/lib/attr_encrypted.rb:280:in `decrypt' from /home/git/gitlab/vendor/bundle/ruby/2.1.0/gems/attr_encrypted-1.3.4/lib/attr_encrypted.rb:143:in `block (2 levels) in attr_encrypted'
I tried to edit settings to see what it would tell me. In my case it spun forever in 'Saving'. Only when I hit another button finally did it redirect again and show the 'You cannot do writing operations on a secondary Gitlab Geo instance' error.
Can we show this banner preemptively when a user goes to any edit page? Since there's otherwise no indication you're on a secondary node it can be confusing until you see that error.
@dblessing I have a different idea to propose... I think we could add a "geo-secondary" class to body and disable forms (with few exceptions). What you think? (idea for after 8.7)
Found a problem acessing merge requests with "unchecked" merge_status in a secondary geo (see: !366 (merged)) and logout isn't triggering flash errors for @DouweM anymore, but we are not logging out the user from their primary node (we don't do single sign off yet, ticket to implement it with a simple proposal: https://gitlab.com/gitlab-org/gitlab-ee/issues/522)
Webhooks backport will be ready today: gitlab-org/gitlab-ce!3940
@brodock@dzaporozhets can you give an estimation for this? It doesn't matter how long it takes, but we (I) need to make sure this gets delivered to customers, the sales team and marketing properly.
@JobV our target is to get this in GA for 8.8... we had no major issue during last test phase, so I'm confident that whatever we may find on the next ones are more in cosmetics/usability area, that we can quickly fix and don't delay anymore.
Created / added some missing issues to the summary.
Single Sign On still has some things to fix on the code, will handle that tomorrow. There is a proposal for the benchmarking (stress testing): #560 (closed). I will ambush @JobV and @vsizov (maybe @DouweM too) in Austin to help me testing.
I'm fixing a few things into how Wiki Page events webhook work so we can use it for page updates and remove custom code from Geo. We still need to make a system_hook for it, but most of the code will be based on what I was fixing.
This is the last thing that is currently using the buffered notification.
@brodock thank you for update. wiki via system_hook we can do for 8.9. Make sure for 8.8 we have Single Sign On fixed and benchmarking. Then collaborate with @JobV to announce Geo for 8.8 properly
Geo Single Sign Out is in 8.8, benchmark is done see (#560 (closed)) (jump to the last comment for a summary)
This is nice to have for a GA: "Geo: Wiki sync using system_webhook" but doesn't block anything.
License check is disable right now, but re-enabling it to release is a few lines patch (there is code there already)
I was talking to @dzaporozhets and @DouweM about an idea that came after a call with a prospect. Geo currently requires us to elect a primary node where writing operations will happen, all the other nodes will be read-only. This is a requirement right now as setting up a multi-master geographical distributed setup requires a lot of coordination we can't do right now:
Database must be able to handle multi-master updates in a clustered environment (preferably with no global lock, as to prevent latency from killing us)
Git (requires either a global lock mechanism or an intermediary step to merge, elect a winner and rollback others trying to concurrently update a repository, all this 'transactionally')
So an idea is to have a hybrid type of node synchronization, instead of electing a primary, let's say we have 3 regions: A B C, all this regions can do writing operations locally but they may or may not be interested in having access to repositories being worked by the other regions.
Region A, wants region B repositories
Region B, wants region A and C repositories
Region C, is fine
So you will get in Region A, all their repositories as expected, but also a automatically mirrored (read only) copy of repositories in B.
Some other things that may be good to consider:
How to do authentication?
We can setup the same OAuth mechanism so you can join from A to B with your current "A" account.
Should we place a link somewhere like: "Go to primary"?
Should we display menu itens for issues/merge request etc that when clicked will bring us back to the primary node for that repository?
@JobV is there a timeline on when Geo will be GA. Still showing in alpha and many large customers can not use a product that is not GA. In addition, the documentation states that there is a good chance of data loss. Does this issue identify all the tasks needed to move GEO to GA? What is the timing of Geo being GA so we can communicate to prospects and customers?
@ChadMalchow we hope to have Geo GA by end of year. We're starting testing Geo on GitLab.com in the coming weeks. Hopefully we'll be able to move it to a Beta status by then, which paves the road to GA.