Geo: Disaster Recovery

mentioned in issue #76

Added ~139512 ~333913 labels

Milestone changed to %Backlog

The following customer is interested in using Geo for Disaster Recovery: https://na34.salesforce.com/00161000002xBbj

Interested customers and prospective customers:

@brodock what are your current thoughts on this? Should we schedule individual improvements so we can move this forward?

@JobV I think it's the best way to deal with this. Most of what is required is not specific to Geo, but more like "foundation work", and we could speed it up if it joins the general roadmap, like adding external storage support for most of the listed things. The "wire-up" necessary for Geo will then be a lot easier.

@brodock can you make sure individual issues exist?

@regisF it's about time that we make sure this works completely.

The first step is to decide what is the storage solution we will use to store assets saved on disk. This will have a big product impact on all the other steps required to finish this, so this is a product decision - as both options can be achieved technically.

I've talked to @brodock about it and this is the summary of the discussion:

It comes down to two solutions.
- Are we going to trust a filesystem (distributed/replicable/etc)?
- Are we going to trust a external service with an exposable API? (something that speaks S3 protocol, something that relies on another protocol but exposes the file with and http interface, etc)
There are costs to the two solutions.

Solution A: we store assets on a filesystem

Pros:

We don't have to change the current code on how we store assets

Cons:

Might be harder for clients because it might require specific knowledge
Very rigid solution: clients won't have flexibility over their storage strategy.

Solution B: we store assets in "the cloud" (something that speaks S3 protocol,...)

Pros:

Complete flexibility for the storage strategy: on S3, on Azure, Openshift, they can rely on their own S3 implementation on the cloud, or completely custom strategy

Cons:

We have to modify our code on how we store data on file
Existing customers would have to change their setup unless we build an adapter inside gitlab where we will branch from saving to disk if this setting is enabled or save to external service if this setting is enabled.

@brodock have I forgotten something?

@regisF the only think I would change is in "Solution B" not say it will be stored in "the cloud" as it can give the wrong idea that it doesn't work on premise.

I think the better way to name it is to "use an object storage". This can be either something provided by the big cloud providers in a standardized way (speaking "S3" protocol), or with a custom solution that relies on specific implementation (like GridFS from MongoDB, or Swift from OpenStack as an example).

Some examples of implementations compatible with "S3" protocol:

Ceph (the thing behind the "CephFS" can emulate the S3 protocol)
MinIO (I think we already use it for the CI, it should be something we could use as "default" for anyone not using Geo)
OpenIO
Riak S2

Interesting detail about object storages is that nowadays even some hardware storage solutions can emulate a few widespread protocols (like Swift and S3).

@northrup can you tell us what you think about the choice of the storage strategy? We need all inputs before taking a decision.

@brodock does the discussion happening at https://gitlab.com/gitlab-com/infrastructure/issues/415 (Recovery plan for Cephfs) have an impact of the choice of the solution we use to store assets? Sid mentions we can use Geo to duplicate Cephfs content...

@regisF if I understood correctly Geo is being considered as a way to backup repos. But yes, whatever decision we make here we must also consider gitlab.com as it can also be used for the other things (assets / avatars / build-logs etc).

This customer is interested in testing this functionality: https://na34.salesforce.com/00261000003jJQg

@regisF @brodock

Why are we not doing the simplest solution? I don't see that answered above.

Our customers want to have the ability to do disaster recovery. We can't force them on a particular filesystem, unless absolutely necessary. And even then, why not have assets be a single path, just like how you configure GitLab now? Then let the customer deal with scaling and performance.

We can also build more than that, but we should focus on solving the problem in the simplest way possible for everyone. Solution, unless absolutely necessary, smells like overengineering.

@JobV the simplest solution is to delegate the complex part (replication) to a different daemon (let it be a filesystem or an object storage mounted as a filesystem or mapped inside the application).

If we had to pick some straight way I would pick object storage doing the needed changes in the application as mapping it to filesystem is not the "fastest" way, can cost more (if using cloud) and has all the OS part complexities.

so following object storage solution, we either add MinIO (that I believe is already required for CI) and forget about branching the code to save on filesystem or remotely, or we don't impose another requirement (thinking about raspberry-pi) and create our adapter.

Why don't we replicate it ourselves?

Sidekiq (free) is not a reliable queue. It can loose jobs, and in this case loosing jobs means loosing data, as assets replication isn't something like git that gets re-updated often and every new update assures complete state.

If we want to do it ourselves we will need a new reliable queue + re-implement many things an object storage already do, probably in a less optimized, less tested way.

Mentioned in issue #76

Thanks @brodock.

We can't require our customers to have a specific filesystem. And the simplest solution is to simply copy the assets to another server. I understand your concern about reliability, but at the moment, for things that are replicable (database, git repositories, wikis, ssh keys), how are making sure the replication is accurate and that we haven't lost anything? I wonder if we should put a complex system to replicate assets instead of a simple "copy and paste" between servers (and figure if the md5 of these files are the same when the copy is finished, something like that).

@regisF speaking about what we replicate right now, everything is either in the database itself (that has it's own replication system, that is monitorable, etc) or can be re-generated from the database (ssh keys for example), or are git repositories (projects and wikis), which is distributed in at least +1 machine (developer one). So if sidekiq fails and loose a job that is replicating a repository, you will still have it but in an early state (lacking commits), but as soon as another push arrives, what was missing gets re-updated again and they get in sync.

In the situation of a DR, you still have copies on developer machines so the missing update can be easily fixed. This is also a risk you have because of the async nature of it (let's say that you have a latency of 15s to replicate something and you loose your primary node before that, you can easily push code again to the new restored server and restore the state of this small time.

In terms of assets, this never happens. We don't do any kind of activity that makes sure every asset is present every-time (as git does when you push code). This is a hard problem to try to use brute-force. Even an rsync will take hours or days depending on the amount of data you have. So using an external system here is the only sane way. I can't think of any other thing that could handle gitlab.com or our big clients (and the really small setups are not really a product fit for a payed DR solution IMHO).

If anyone have a different idea and want to discuss I think it's better we make a call so it's faster to discuss pros and cons.

Thanks for the great explanation @brodock, I see the problems now.

Can you elaborate on

so following object storage solution, we either add MinIO (that I believe is already required for CI) and forget about branching the code to save on filesystem or remotely, or we don't impose another requirement (thinking about raspberry-pi) and create our adapter.

What would be required of a customer to set a solution like this up? Assuming they want to use their own servers.

If we decide to change every place in our codebase that save assets to disk to simple upload it to an S3 compatible endpoint, users will need to have one of these:

A physical storage that exposes an S3 compatible protocol (most of the shines EMC things does)
An S3 object storage compatible service provided by their private cloud solution (almost all of them provide that)
Use an external cloud solution from the big players (AWS, Azure, Google cloud etc)
Run a solution we provide

For Enterprise Edition + Disaster Recovery / Geo, we will need to ship something that can do multi-datacenter replication (this is the fancy word to: "we can replicate asynchronously and not degrade performance of the main location).

My go-to solution for this was Riak S2, but I discovered that multi-datacenter replication is paid only, so it's not a fit. These are the others that can do that and we need to test/benchmark: OpenIO, Ceph (without the FS part), LeoFS (it also has an S3 compatible protocol and an NFS one)

For Community Edition or Enterprise Users who don't use Geo we could run MinIO (which is a very lightweight go implementation of a local object storage). It would ideally be running in a separate "storage" machine, but can also work well in a single setup.

If we comes to conclusion that this adds enough complexity and we would like to remove the extra complexity for users who don't care about using an external storage etc, we need to code something like an "Storage API" where we either save in S3 compatible endpoint or at the local disk, and make sure we are using that everywhere we want to have data replicated (assets, avatars, build logs etc).

For users using an S3 compatible object storage, we will need to ask for few credential information (API KEY, endpoints etc) and it will be ready-to-go from the gitlab-rails perspective. If we ship something in omnibus, we need to make what parts that are configurable, configurable, etc.

If we make something like an "Storage API" users will be able to define where they want to store that data (filesystem or external object storage).

If we think about steps to deliver this, first one will be to code an Storage API and change the pieces of code in gitlab-rails we want to point there. This will enable anyone using Geo to use their own storage solution already to improve a DR scenario. We can then ship our own solution that can be configured to do geographical replication etc. But even on the initial step, we can point all geo instances to this single storage endpoint an have it fully working (with the added latency for assets, etc)

@brodock again I have a lot of questions :-)

Even an rsync will take hours or days depending on the amount of data you have. So using an external system here is the only sane way. I can't think of any other thing that could handle gitlab.com or our big clients

Ok. But in any case, when we setup DR, we will need to duplicate the assets anyway the first time we set it up. So it will take a huge amount of time to replicate the content no matter the solution we use. How storing assets to an external system would improve the time it takes to sync data? Why do we need to store the assets of the primary server anywhere else than the disk of this server.

@regisF Yes, the first data migration will take a lot of time.

It's not just "storing it in an external system" (like a remote folder). The way this systems work is different than having a remote space somewhere. Whenever you create/delete a file, they know (because you have to upload or issue a deletion through the APi), and so they know what needs to be sent to the remote destination or removed. They can also do replication in the local "cluster" which is a way to prevent data-loss anyway.

So the reason to have it in a remote solution is because it's how they work. You usually setup a small cluster and grow as your data grows. Just get Ceph as an example (which is what we are using to store git repositories). The reason we are using Ceph is both to be able to grow in storage and to handle load better.

The reason we want an external daemon for DR and Geo is to handle all the things related to keeping data secure (in terms of file corruption, hard disks failure etc), and be able to replicate geographically (so people in a different continent can have access to the data with local latency, and so they can have a different "site" storing their data, in case of a DR).

As a side-effect, using an external storage for this kind of data is a way to help GitLab scale horizontally.

So the major difference here is:

Doing rsyncs from time to time, it will always take a lot of time (because you have to search and check all the files to calculate the hashes and compare the state with the N remote machines you are replicating to, so this time will always grow if the amount of files and sizes keeps growing).

Using any of the solutions, as a side effect, the load and the data will be distributed among different nodes, so replicating to a new location will take less time than a normal "rsync" as the "cluster" will be able to handle that with more machines.

@brodock rsync may take time but it is pretty efficient. As stated above, this is not a trivial problem we are trying to solve. I've yet to find a faster generalized solution to this sort of problem than rsync.

@xyzzy yes, the thing is what we 'really' need is something in between rsync and a distributed filesystem/object storage, which is something doesn't exist right now.

rsync takes a lot of time because it has to calculate the hashes and deltas in every execution. Something that listen to "file notifications" from the kernel (Inotify) could give us an improvement in not having to "brute force" search everytime, but this also is not 100% reliable (whenever you restart your system or the "watcher" daemon, you have to do a full scan to not loose any change), etc.

So an intermediate daemon is a must, whenever change you need to make, it must go through this daemon, so if it is offline, you can't change files, etc.

The daemon will have to implement almost everything an object storage does, and use a reliable queue to synchronize with remote nodes etc (for replication) or keep a local database of changes that can be queried by the remote node to synchronize from time to time.

This is basically 95% what an object storage does with the other 5% being convenience part + scaleable architecture.

I don't like the increase of complexity that this puts on the table, but I can't stop thinking that anything less will only bring us headaches, and we will end-up into this anyway.

What we can try to do here is look for the most "friction less" solution (even if it's not the most performatic one). Something that require very little to no configuration will be gold.

I have high hopes for MinIO to fill this gap, but they are not quite there yet (you can't do geo replication in the way we want), although they have a manual mirroring feature (which I still need to test). If it can be used to mirror without requiring an "all files scan" as rsync does, I think we are in warmer waters.

@brodock Well, rsync has a couple of scan methods one of which is to use the filesystem modified date (I think that's the default). Scanning everything can take time but if one breaks a huge directory tree into chunks and runs the rsync checks in parallel it can be a whole lot more efficient. There are a few ways to do this and which is more efficient depends on how the data is distributed within the file tree.

Yes, with more than two nodes this can be far far more complex! I'm glad we're looking at this from multiple angles.

Rsync is the quickest solution (use a sidekiq cron job?) and it will be the easiest to sell. It will be harder to maintain and scale. In the end we will have written something like S3's replication inside GitLab.

Object storage is the best engineering solution but we can only sell it to customers who are already in the cloud, or have an internal cloud.

If we go with Rsync we may be stuck with it. Some major customer who would not have bought Geo if S3 was a requirement will start using it, so if we ever want to drop Rsync in Geo it would be bad for that customer.

Making Rsync work requires exploiting metadata in SQL (look up the 'things' that changed/were created since the last sync, only rsync those). For Docker Registry, which we bundle, this is not an option because we do not have direct access to its database. There we might have to use object storage anyway.

I think I am leaning towards "S3" because it is boring and reliable. But it only fits the bill for Geo for customers who already have it (or something like it).

The way I envision the 'S3' scenario is that we bundle Minio and use 'S3' style file access everywhere, from GitLab CE on Raspberry Pi on up. That would make it more reliable than some custom rsync code that only runs for a handful of customers who bought the Geo product.

Allowing either local disk or 'S3' is something we should avoid. So an 'S3' scenario involves migrating all existing installations into that.

This need not be as hard as it sounds. Several years ago GitLab supported storing user uploaded files in S3 via Fog. We used this on gitlab.com. Then one day we decided to stop using S3. All I had do was copy the S3 bucket to disk on the GitLab server with s3cmd sync and then Fog was able to find all its files locally. If we were to move from local disk to Minio (backed by local disk) then if we are lucky no files has to be copied / moved at all.

@jacobvosmaer-gitlab thanks!

@xyzzy @regisF can you figure out whether using the S3 solution would be a problem to them? Maybe ask existing Geo customers? See Jacob's comment above.

Our typical product strategy here would be to go for Rsync, making us solve Docker Registry later (simply not supporting it at first), so I'm leaning to that. BUT if we see that S3 is favorable from a technical standpoint AND it's not a problem for our customers, it might be worth the bump - avoiding a future where we have to migrate customers from and Rsync to a S3 solution.

cc @sytses @stanhu

I should add that S3 does not solve all problems: it would not deal with git-annex and GitLab Pages.

Doing rsyncs from time to time, it will always take a lot of time (because you have to search and check all the files to calculate the hashes and compare the state with the N remote machines you are replicating to, so this time will always grow if the amount of files and sizes keeps growing).

So an intermediate daemon is a must, whenever change you need to make, it must go through this daemon, so if it is offline, you can't change files, etc.

In defence of Rsync, I don't see what is wrong with doing periodic rsyncs, or why we need a daemon. The SQL database keeps track of when 'things' like uploaded files or CI build logs were added / changed so we can compile a list of things that need to be synced. That cuts down the start-up cost of rsync dramatically (in fact I don't think not building such a list is an option). You then end up doing bookkeeping in SQL where the cutoff point of the last successful sync is so you can skip things on the next sync. Getting these things right is where the bugs, failures, potential data loss will be.

We shouldn't make Rsync sound worse than it is.

@jacobvosmaer-gitlab Thank for your input.

I should add that S3 does not solve all problems: it would not deal with git-annex and GitLab Pages.

Does that mean we need another extra solution to backup these, even if we use S3? Or would this be covered if we chose rsync?

@jacobvosmaer-gitlab you are right, I have forgotten we DO have the metadata in SQL (so we don't need a full-scan every time). Thanks for bringing this up.

I don't know the code behind GitLab Pages, but I've coded something similar to that at $oldjob and we used S3 to store the files, so I'm sure this could be changed to use the new protocol.

Object storage is the best engineering solution

Can we use then this to perform the replication, but instead of using a real cloud (therefore asking our customers to purchase a S3 compatible cloud on top of the cost of GitLab DR), customers would use the local disks on both primary and secondary servers? @brodock @jacobvosmaer-gitlab

@regisF yeah, I think I didn't made it clear, any of the mentioned S3 compatible solutions I made in the discussion are "free" to install locally, so no need to purchase anything extra, it's just the burden on our side to integrate them well with GitLab / omnibus.

I think we can start with MinIO and see what we can use to make the replication work.

@brodock we also need to solve the cases of git-annex and GitLab Pages with a S3 approach now to make a decision. Can you tell us how we would backup those?

I'm with @jacobvosmaer-gitlab here, we need to have consistency through all the storage using something that looks like S3 (doesn't have to be it), then we can start making background copies to different places (with some form of even log like Time Machine does?) that is the only thing that will actually scale.

One thing that doesn't make sense in my head is StorageCephFS abstraction because we picked CephFS for being posix compliant, so it's a normal filesystem from the process perspective.

As much as I love rsync, I just don't think that it will scale and keep things up to date (I'm thinking GitLab scale at this point)

Maybe a question for you @brodock is if you are planning to use a pull or a push pattern, that is, will the secondary node be polling for updates or will we have a node/worker/job pushing changes into this secondary node? did you consider some form of federation for having multiple backup nodes? How will this backup system behave? how does GEO behaves now?

@pcarranza my wish is that whatever solution we pick, we can delegate the replication part to them (MinIO will release soon "Distributed" feature, which sounds like what we need, otherwise any "multi-datacenter" replication feature will do).

Geo today works in a push pattern (all changes are notified from primary to secondary nodes, but git replication isn't a git push from master, but a git fetch from secondaries, after receiving the notification).

The proposal until now is to use Geo nodes as some kind of "availability zones", where you can have a complete functional copy of the application and it's data, but in read-only mode, that in case of a DR, you can use either to restore data to the primary, or promote it as primary to reduce downtime.

any of the mentioned S3 compatible solutions I made in the discussion are "free" to install locally, so no need to purchase anything extra, it's just the burden on our side to integrate them well with GitLab / omnibus.

I disagree. It is not just a burden on our side. And this sort of solution is not 'free' at all for the customer. Deploying something like Ceph requires knowledge and serious effort. And then you need to have people on staff who keep it running. The customer would have to train/hire and retain specialists who can do this.

If we go the Rsync way then admins at the customer only need to interact with GitLab and they get support from us. If we go the S3 way then those admins need to interact with GitLab and Ceph/Openstack/etc., and unless they have experts already, they probably end up paying a Ceph/Openstack/etc. consultant. In reality that means that unless it is super easy (because the customer is already on AWS where S3 is a commodity), the total cost of ownership for GitLab Geo is higher with the S3 solution.

"Use Minio" is not a solution for this issue. If am hopeful that it is good enough for single-server use. But assuming it does reliable replication, and it does it in the way we need it, and that it is not a pain to configure, is not a good bet for such a young product that aims to be a 'mini' solution.

Regarding gitlab-pages and git-annex in the S3 scenario:

git-annex should be replicated as part of the same mechanism that handles Git itself (I don't know what mechanism that is currently)
gitlab-pages would have to be rewritten to act as a local cache: download the appropriate zip from S3 on the first request and serve local files from there on. I expect this can be done but this would be a bit of an architecture change.

To summarize, I think the following would be needed if we go the S3 way:

change all uploaders in GitLab to use 'S3 storage' (not optional, make it the only behavior)
bundle Minio so you can have local 'S3 storage' on all GitLab servers
find a pain-free and automatic way to convert existing GitLab servers to Minio file access
fix the problem of git-annex replication in Geo
rework gitlab-pages so that it knows how to fetch missing user sites from 'S3 storage'
change CI artifact browsing to use local caching (because it relies on local access to zip files and metadata files now)
... we will probably discover a few other things along the way
it is easy to underestimate how much work we have to do before we can ship this

And then you have an implementation of GitLab Geo that you can only sell to customers/deployments (gitlab.com) who already have access to an S3-like service. Those customers / deployments will then have the best possible durability / replication for non-repository files in GitLab. The system would have fewer bugs with file replication because all we have to do in GitLab is to use S3 API's correctly and if we do it wrong we find out quickly. The customers who can use Geo are not likely to be the first to discover bugs in the S3 object storage code because everybody uses it.

If we go the Rsync way

we have less work to do to have a working prototype
the solution can be deployed by any customer, in principle
in practice we will face scaling issues. These will become apparent quickly on gitlab.com.
in the long run we end up with replication code hidden somewhere in a corner of GitLab, with recurring incidents where a new feature adds e.g. a new type of file and we forget to replicate it because nobody updated the rsync code
we don't have to change how any of GitLab's existing features use local files, except for tracking carefully when files are created / updated / removed
it is easy to underestimate how complex our replication code ends up and what maintenance pressure this complexity puts on Geo

Because the replication code would be concentrated somewhere in GitLab as a paid add-on many of the inevitable bugs in it would be discovered by the customers who paid extra money to have Geo.

@regisF asked on Slack: what is the right technical solution if we forget for a moment about how to sell this.

For gitlab.com the S3 way would be best: it out-sources one of our scaling problems, without vendor lock-in.

For small and medium sized GitLab installations an S3-based solution improves nothing and only adds complexity. (We are hoping that we can minimize the pain caused by that extra complexity to a point where most people won't notice.) For larger non-Geo installations the option to store CI artifacts and LFS uploads in S3 is valuable improvement.

I think S3 is the better architecture both for Geo and for deploying GitLab at large scale, but it is a big change. Even if we forget about how well we can sell it we still have to weigh the benefits against the cost and risks of the change.

I am calling in the higher authority of @dzaporozhets . :)

@jacobvosmaer-gitlab thanks for putting our alternatives into better words :)

I completely agree with your analysis @jacobvosmaer-gitlab

From the GitLab.com perspective, the rsync solution just does not scale and we are already hitting a huge issue with it at our current state (4 days so far to copy the artifacts and still waiting) Imagine how much badder it will be as we continue growing at this pace.

I understand that it is a huge complexity to deal with in a small installation, but given our scale we have little other options left.

Forgive me for not knowing how this all gets plumbed together but could we build it with Rsync (easier) and then plumb in S3 (harder) at a later date? If so that would give us more options later on and a solution of some kind sooner.

@xyzzy They are not extensions one from the other.

And the rsync solution will simply not work for GitLab.com, so it will be a solution that we (infrastructure) will probably ignore completely.

@dzaporozhets and I discussed this issue in a call today. We think S3 is the better choice because

an Rsync solution will end up as rarely used, poorly maintained, GitLab code in the medium / long run
an S3 solution would be better maintained in GitLab
S3 is better for gitlab.com
S3 will be useful to medium/large GitLab installations that generate a lot of CI data
for GitLab deployed on cloud platforms (like Amazon), uploaded files stored in S3 would be more durable

The next step would be that @brodock creates a gitlab-ce MR (and deploys this MR on a staging server) that demonstrates the Minio approach can work.

only handle one of the easy cases (e.g. attachment uploads)
figure out if we can use Minio with the existing files on disk (no file copying migrations should be needed)

If it works we can merge that and incrementally move file storage to Minio. If it does not work we can reconsider this approach.

@jacobvosmaer-gitlab can you detail what so radically changed your mind? I am now confused where we stand on this issue because it reads like we've tossed any kind of rsync methodology right out and went for a wildly more complex object-strage method... with gains and advantages that are not entirely clear to me.

note - when I say rsync I'm not talking about scheduled jobs that scan the entirety of the git repo, that would be folly. I'm talking about something like a post-commit hook that submits a job to a queue containing the git repo information, and then we have sidekiq processes that pick up those jobs and shuffle off repo rsyncs, essentially only copying data for repos when commits are made against them. This scales through queueing and abstraction, the post-commit hook is called after the data write is entirely finished and packed, and rsync is an intelligent tool that can recover from partial transfers and disconnects, and it allows the users to use whatever storage solution they please so long as it can be presented as a posix file system (99% of them). This also works for our wiki's as they're git repos on the backend, and we can hook our CI upload and artifact process in the same way. You're also using tools that are native to EVERY linux distro without having to add yet another library / divergent piece of strap-on-code to GitLab. This also is inline with our usage of rsync within the API for moving repositories from one storage shard to another (currently how we're migrating from NFS to CephFS right now).

@northrup we are not talking about the git repositories here. We have a sync solution for that already (@brodock correct me if I'm wrong) that uses git pull.

We are talking about uploaded files such as attachments and avatars, LFS objects, CI build logs, CI artifacts. These are all files (sometimes a small handful of files, e.g. CI artifacts have a metadata file, images can have a couple auto-generated resized versions) that correspond to SQL rows in our database.

If we want to replicate those files efficiently then (as we all agree in the discussion above) you cannot just launch a single rsync job on the directory of all LFS objects. What you need instead is to track the replication state for each type of file ('all LFS objects older than date X have been replicated to remote Geo server Y') or even for each file ('this LFS object has been replicated to Geo server Y'). This is a lot of bookkeeping. And then we need either hooks on all types of uploaded files ('after creation replicate this LFS object to all known Geo remotes') or batch jobs ('replicate all LFS objects newer than date X to all Geo remotes').

Hooks are unreliable because Sidekiq can lose jobs. This is tolerable for Git repositories because they get updated over time: if one Git replication hook fails, its work will be done by the next replication hook for that repository. An LFS object on the other hand is created only once and never updated. If the one Sidekiq that had to replicate a new LFS object fails without getting re-enqueued (this happens!) that object never got completely replicated. Data loss.

So reliable hooks for append-only storage is a problem for us. I would have more faith in batch jobs that periodically replicate whatever hasn't been replicated yet. Scheduling these batch jobs so that they do not overlap or put an uneven strain on your GitLab cluster is hard. The resulting code will be complicated.

In summary, I expect that to make rsync work well we would need

SQL bookkeeping everywhere
Complex replication code that can do retries (to achieve reliability) and that can even out the system pressure caused by replication

Now to make things worse, this hairy mess of rsync support code would be subject to a high degree of software rot.

There will be a constant stream of changes in gitlab-ce that break the replication code
When this breakage happens we will often find out about it too late: namely when a customer reports a bug. In this context 'customer reports a bug' is likely to mean 'customer suffered data loss'. Big reputational damage to our expensive Geo product.

Compare this with 'S3' object storage.

in many places in the application we already use library code (gems) that supports S3 storage
we have a possibility to organize this so that every GitLab installation accesses its 'files' (attachments, LFS objects, etc.) through an S3 API: bundle Minio

If all non-repository file access goes through an S3-compatible API we will catch bugs much earlier, resulting in software that is easier to maintain for us and a more reliable product for the customer.

Having skimmed through the discussion, using a non file system based storage system for files will make contributing much more complex. Effectively every use of Ruby's File and IO classes/modules would have to be replaced with this file system API. This also affects gitlab_git, and basically every single piece of code we use (including any Gems) that touch the file system. For example, gitlab_git reads Git attribute files. This would then have to somehow be mapped to S3 or whatever it is that we're using.

@yorickpeterse the proposal is to use S3 for non-git data (attachments, avatars, build logs, etc). Any git data we use git to replicate, so it can still be stored in the local file system.

@brodock Ah OK. For those kind of files something like S3 (or a similar service) makes the most sense. I believe that the uploader we're using (Carrierwave) already supports multiple backends. I'd prefer making this agnostic over providing our own thing. If we have to provide our own thing we also have to deal with all the problems that it may bring. It would also be another piece of machinery that needs to be deployed, monitored, updated, etc.

Thanks all.

If we want to move forward with this, the next step would be for @brodock to create the proof of concept recommended by @jacobvosmaer-gitlab.

Added ~360296 ~481018 labels

Milestone changed to %8.14

I can start working on the PoC as soon as Sentinel is delivered: https://gitlab.com/gitlab-org/gitlab-ce/issues/3355

I've updated the body of this issue with what we will do for the proof of concept. If the proof of concept works, we'll create other issues for the remaining work.

@DouweM and I talked about who should work on this and we both agreed that the bus factor for Geo is too big. Additionally, the complexity of Geo does not justify having one person working on it. The suggestion is to have @patricio work on this and @brodock helping out.

Reassigned to @patricio

@regisF Can you clarify in the title that this is a PoC? Otherwise it looks like we're shipping the full feature in 8.14.

Maybe also remove direction and ee product labels.

The meeting we had earlier this week shows that we need to have a better overview of all the pieces.

To have an overview of the entire solution, here is what DR needs to cover:

How do we do the bookkeeping of what's changed
How do we transfer the data to the other server reliably
How do ensure data integrity for the data that have been copied.

We have 4 possible solutions for DR, as shown in the ugly diagram below.

During the kickoff meeting, the issue was raised that we shouldn't dismiss rsync or even git for DR without at least trying them in a PoC and discuss what the pain points are. I agree that if they are easier solutions than this S3 approach, we should at least test them.

To move forward with this, I'd like to test the bookkeeping + transport with rsync/git/s3 approaches with a simple use case (one PoC for each), report back, and choose a final approach. Do we need to take into account data integrity at this point of the PoCs?

The proof of concept we wanted to do for 8.14 will only demonstrate if we can add a S3 storage layer without much changes in the code. This will only validate one aspect of one of the many possible approaches. So let's extend this PoC for the S3 approach, then let's try it with rsync and the others and see how it goes. What do you think?

Removed ~139512 ~333913 labels

Changed title: Geo: Disaster Recovery → Geo: Disaster Recovery - Proof of concept

Assuming no frontend work is needed here.

Indeed. Frontend work will come when we will do the actual implementation. I'll ping you when we'll need you @jschatz1. Thanks for checking.

@patricio @regisF and I discussed the drawings and the status. Here's what I took away:

As a proof-of-concept for handling attachments, @patricio was able to make attachments work over S3 by using the CarrierWave AWS-SDK gem. He first got it to work with AWS and then switched to using Minio. This validates that it's possible to use an S3-compatible solution easily for retrieving and store attachments.

In addition, using an S3 solution could work for:

LFS
CI artifacts/build logs (currently local filesystem writes)

However, an S3-based solution does NOT work with git-annex.

The big question: Is it sufficient to have customers provide us with an S3 bucket and information and let the S3-based solution handle replication for us? For GitLab.com, I think we would be okay with relying on AWS S3 buckets.

For people who need an on-premise solution, they would need a local S3-compatible solution. Minio is the simplest one, but it doesn't support replication (although they discuss a simple mirroring feature here. OpenStack Swift is another possibility. There are even hardware providers that have S3-compatible storage solutions.

@regisF is going to talk to customers and see whether it's acceptable to rely on S3 for holding the data.

If Amazon S3 is an acceptable solution for many companies, then Amazon does have cross-region replication support: http://docs.aws.amazon.com/AmazonS3/latest/dev/crr.html

This would be the easiest thing to start, but obviously this won't work for customers who don't want to depend upon Amazon.

In any case, we need a solution for GitLab.com to do the initial clone/copy for each repository as described in https://gitlab.com/gitlab-com/infrastructure/issues/415#note_17370229. Is that the first step here? Object store for uploads and everything else comes second. Our first priority should be to make sure the repos have been copied.

Some input to a possible PoC / MVP with minio:

I've spoken to Harshavardhana (from Minio Inc.), making some questions about how to achieve replication with it in a few different ways.

In a simple "similar to rsync" scenario, we can use mc mirror command and point a local path to a remote endpoint, and it will do a streaming scan, where both side outputs are already sorted, and it uses size and timestamp for comparison (so it's lightweight compared to hashing each file).

According to him, minio supports S3 notification API, which can be used to "watch" for changes and replicate as changes happens in realtime.

To achieve this we use mc mirror --watch, which will start with a full scan (the streaming sorted comparison described above), plus subscribe to S3 notification API: ListenBucketNotification().

Here is an example how the S3 notifications look like:

$ mc cp README.md play/brodock/
README.md:                 4.18 KB / 4.18 KB ┃▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓┃ 100.00% 49.64 KB/s 0s$

$ mc watch play/brodock 
[2016-10-31T23:39:37Z] 4.2KiB ObjectCreated https://play.minio.io:9000/brodock/README.md

This has the problem of the warmup (full scan) which probably doesn't scale to gitlab.com, and also the notification API is not using a reliable queue for this command. It's a step forward into the direction we are pursuing but not quite there yet.

The next thing he told me is that minio can actually feed a reliable queue with this events, and it can be for example any AMQP compatible (RabbitMQ for example) or NATS.io

A few articles he sent me about this feature:

(NATS an postgresql support to be rolled out soon)

In theory we could build some "synchronization sauce" using minio to feeding the notification events into a reliable queue, and having something in the middle consuming and replicating it to another geographical location running minio on the other side.

Even if we decide we don't want to keep minio, this type of solution, will be using some "standard adapter" (the S3 notification API), which could be ported more easily between different implementations that supports it.

I have recently discussed Geo for use as DR with several prospects. Several of them are interested in having an active/active setup in one data center with Geo linking a separate DR data center that is also set up as active/active and ready for failover. I'm not sure how Geo plugs into the receiving end of an active/active setup now, but this configuration should be planned for in whatever architecture we select. Architectural constraints such as leaving the DR site's Geo-fed active/active nodes read-only (with respect to any local user activity) until one fails over to the DR site is an acceptable one.

I figure that this is best mentioned now, while we are still figuring out the architecture for all of this.

@xyzzy an active-active setup will only happen after we can get a fully functional active-passive. It's a lot more complex to setup a "multi-master" topology, and even more getting it fast enough and reliable. I don't think this should be the goal for this feature in the near future.

@brodock yes I can also imagine something where Minio notifies Postgres, which we could treat as a reliable log of object CRUD events, and then do our replication form there. That would still have the disadvantage of us having to create reliable replication code and having to maintain it, but at least it would be in one place (Minio replication) instead of 10 places (uploads replication, CI build log replication, LFS artifiacts replication, etc.).

@brodock active-active on the Geo end may not be our short term goal but it is ABSOLUTELY what customers are telling us that they want. So we need to make sure that whatever we come up with will support that down the road. I believe that this does not have to be active-active for user access while it's slaved off of the real master via Geo - it has to be that way once the decision is made to make the DR site the master.

RE: S3, I am on-site at a large customer today where this came up. They are fine with using S3 but they plan on having a solution that (for security reasons) is not cloud based. When they heard "S3" they thought of cloud. They need something that is installed via Omnibus and just works. So, if we use S3 we need to be able to supply it in order for this to work at this sort of customer. I believe that is where we are headed but wanted to make sure that that was indeed specified here.

@xyzzy But you did explain to them that we are not talking about Amazon S3 but S3 protocol compatible storage?

@xyzzy I know too that customers say they want active-active, I have been hearing that for three years now. :) But I am not sure if we should make it part of our discussion here.

we need to make sure that whatever we come up with will support that down the road

I don't know how to make sure of that.

@marin - I absolutely communicated that S3 =/= Amazon ... that's why I added that if whatever is required (in this case S3 support) gets shipped with our Omnibus distro then they don't care what the transport is. They just want no external requirements. That is, I believe, the same direction as we are headed.

I think we understand the need for no external requirements, but it sounds like @regisF talked to customers (I think GitLab.com included) who would NOT mind using Amazon S3 and its cross-region replication. It seems like the first minimal product here would be to rely on an external provider and defer how we can provide an internal S3-based solution later.

@stanhu I'm concerned that if we go the route you mention, where we rely on external providers and not ship an internal S3 based solution, it would mean forcing a lot of people to move to S3 prematurely.

The first play test I did with MinIO forced me to change the way uploads are handled, so they would now use the S3 API instead of a regular filesystem. This means that if we don't ship an S3-based solution with the changes, users would be forced to move to an external S3 provider.

I know the purpose of the S3 solution you mention would be for large Geo customers, but the changes in the codebase to support S3 would need to also be added to CE, or we risk a highly divergent codebase.

The solution that I think @brodock was initially proposing (and that I support as well) is for the first MVP to be:

Replace the storage option for all Carrierwave based uploaders to use :aws instead of :file (allowing the use of an S3 based solution)
Add MinIO to the stack to handle those S3 operations

In the second iteration, we can move other uploaded files (LFS objects, build artifacts, etc.) to MinIO as well.

If we go that route, then replication could also be achieved in CE, by using Amazon S3, but then we could use what @brodock mentioned here: https://gitlab.com/gitlab-org/gitlab-ee/issues/846#note_17666855 as a "1 click solution" for regular Geo customers.

Just brainstorming here... I might be wrong on my assumptions, but I think we should look at all angles.

@stanhu actually, to date, most customers have said that they do not want to use a cloud provider, as they want to keep things on premise. I have another call schedule in the next days with another customer, I'll let you know.

Relying purely on Amazon S3 for this would only suit one customer (to date) and us (GitLab.com). The other current Geo customers would not buy it.

Moreover, I have to add that asking customers to buy specific hardware (like S3-compatible one) seems unrealistic.

@regisF Thanks for the clarification.

FYI, another tool to consider for either the S3 or the rsync case is lsyncd: https://github.com/axkibe/lsyncd

@stanhu anything that will be calculating a delta between 2 storage endpoints in a O(n) process is a no-go for infrastructure - we will just not use the feature at all.

How would this sync continue if the daemon fails?

We had a call with @patricio and @stanhu today about this and here are the steps to move the project forward.

As a reminder: the goal of this project is to be able to replicate the content of the primary node to secondary nodes, so that we can recover the content in case the primary server crashes.

Out of scope: the notion of automatic failover (or any kind of HA). No active-active setup.

Feedback from customers: consensus on keeping things on premise, and no dedicated hardware for replication.

We are moving forward with the exploration of S3-protocol.

What will we do now, in order (@patricio is going to handle those steps, with the help of @brodock ):

Automatically backfill repositories from primary node to secondary nodes (https://gitlab.com/gitlab-org/gitlab-ee/issues/1190) : at the moment it's a manual process. We need this as a first step to supporting DR, so when we designate a secondary node, this process is triggered automatically.
Install MinIO on a secondary node, and try to replicate a fair amount of data (say, 5Gb, which is the size of data one of our medium size customer). The goal is to analyze how the experience goes: do we have trouble setting it up, does mc mirror work... How does it react basically.

If the experience is positive, we'll move forward with what @brodock suggested (in this nice summary written by @patricio ).

I'll update the body of the issue with this new PoC.

Status:

@patricio has created a merge request for automatically backfill repositories from one node to another, and is working on it. He will however go on holidays next week, so I've asked @stanhu if we can have find someone else to work on the rest of the PoC while he's away - so we can move forward.

Status:

The merge request to automatically backfill repositories is moving forward well. While @patricio is on holidays, @brodock will finish it.

To complete the 2nd proof of concept, we will wait for @patricio to come back from vacation, unless @brodock can start it. We can't really put other people on this. By the time a new person would be up to speed, @patricio would come back from holidays.

Do you need a communication channel with the engineers from MinIO? They're more than willing to help out.

Mentioned in issue #1252

Added Deliverable label

Milestone changed to %8.15

Status: @patricio is back from holidays. We have finished and shipped the automatic backfill of repositories feature. We can now proceed with the second PoC which is about testing MinIO on a secondary node and replicate a fair amount of data.

Ok, second part of the PoC has started. Here are my first notes:

In order for mc mirror to work we also need to include the minio client binary, as the client is separate from the server. This adds another dependency to our stack.
mc allows for very easy configuration of different remotes, this can easily be managed by Omnibus for the packages. The configuration file can live anywhere on the server.

My next step is to prepare around 5GB of data to upload to a DO droplet that has my first PoC and sync that to another server running the same PoC. I'll get that working this week.

Second PoC update

mc mirror

The Good

mc mirror --watch works perfectly. I was able to mirror 7GB of data without any issues, and every new file added was immediately mirrored on the remote. The upload took 2 hours, because my upload speed is kinda bad.
The caveat here is that we need an mc mirror process for each Geo node, the good thing is that the process can be started in the node itself: mc mirror --watch remote/<bucket_name> local/<bucket_name>

The Bad

It is only possible to mirror a single bucket at a time, so unless we put all uploaded artifacts in a single bucket, we will need start multiple mirror processes on each node for each of the buckets where the uploaded artifacts might live.
LFS Objects and Build Artifacts will be tricky to migrate, as we allow users to choose where to store them. This might require a manual step by the user to migrate, or a separate minio server for each. (We might be able to get away with just symlinks, but that might arise security concerns).
I had to use an older version of mc, as the current one has a bug that will not let you start mirroring with the --watch flag.

The Ugly

Orchestrating all of this via Omnibus will be a challenging task

@regisF I believe the marks the second PoC as done, right? Should we continue with a third one, or what are the next steps?

If anyone is interested, you can check the changes I did to the codebase here: https://gitlab.com/gitlab-org/gitlab-ce/commits/minio-poc-dr

@stanhu @patricio and I had a meeting 2 days ago to discuss about next steps.

Second PoC is done and successful. It's possible to use MinIO, setup on a secondary node and actually replicate data without problems and reliably.
Patricio realized that we can symlink the different folders where we store data (assets, LFS data,...) into one folder to listen to changes on this folder, and let MinIO watch and replicate that.
mc mirror --watch only works if the file added to the bucket is added via the S3 API, so we still need to modify our codebase.
The first time we run mc mirror to initiate the replication (first full walk), the process can take a lot of memory and file system access.
If the process crashes, we can restart it to avoid a full walk again.
However, we need some kind of journal to make sure we keep track of all the files that have been copied, and those which are not.

So far, we have:

automatic backfilling of repositories with buttons in the Geo UI (shipped in 8.14).
MinIO which we'll use to replicate.

Next steps:

Figure out what we will use as a journal and estimate the amount of work needed to have it.
Change our codebase to use the S3 API.
Create a script for the packaging team so we can bundle MinIO and this solution in Omnibus

mc mirror --watch only works if the file added to the bucket is added via the S3 API, so we still need to modify our codebase.

More specifically, we would need to:

Use CarrierWave AWS-SDK for all attachments
Switch LFS object backend to use S3

I am not sure about the journal approach. The problem we are trying to address is how do we ensure that the data is consistent in both instances? Right now the only way to do that is re-run mc mirror and have it do an entire filesystem walk.

This is expensive because mc mirror works as the following:

mc mirror list files using List() api implemented using Go os.Readdir() (https://github.com/minio/mc/blob/master/cmd/client-fs.go)
https://github.com/minio/mc/blob/master/cmd/client-fs.go#L887 is spawned as a Goroutine, and it streams the output as and when it is available through a channel.
The output is lexically sorted.
MinIO implements its own File tree walk - https://github.com/minio/mc/blob/master/pkg/ioutils/filepath.go#L43
The difference between the two walks is computed here: https://github.com/minio/mc/blob/master/cmd/difference.go#L53-L183

I think this might work ok to start, but I doubt it will scale when you have a million GitLab.com repos.

I propose we focus on trying to get the attachments working well with MinIO, since @patricio has done a lot of work there already. Then we can focus on LFS etc. later.

Thanks @stanhu.

I propose we focus on trying to get the attachments working well with MinIO, since @patricio has done a lot of work there already.

I'm confused by this. I think we know how to make the attachments working well already with MinIO. What would remain to be done here? The two steps listed above (CarrierWave AWS-SDK and Switch LFS object to use s3)?

I'm confused by this. I think we know how to make the attachments working well already with MinIO. What would remain to be done here? The two steps listed above (CarrierWave AWS-SDK and Switch LFS object to use s3)?

I was framing in the context of actually having a deliverable feature. Just adding an S3 layer in front of the attachments would have a big impact on our overall system, and there is enough work to do there just to make this work in production:

Ensure a smooth migration path for the existing attachments directory (is it as simple as just pointing the existing directory?)
Bundle MinIO with Omnibus and configure it properly with S3 keys
Enable the mirroring and exchange keys
How would admin might be able to monitor the status of the mirroring?
Figure out how to gracefully handle when MinIO is down (e.g. does this prevent issues from being created?)
How might we do a resync in case MinIO stops receiving S3 notifications?
Could we disable MinIO if we didn't want to add another moving piece here?

I'm sure there are many other issues that will come up as we do this.

Food for thought

@patricio by reading your post I couldn't help but think that we could in fact consider introducing some form of sharding at the minio level. What I mean by this is that we could have many buckets for uploads the same way we have many shards for git right now. This would simplify storage management as it would not be single massive storage anymore, and would also parallelize the watch/sync process.

@stanhu I don't think we should over optimize the replication feature. I know we need to take into account GitLab.com, but we are building an EE product. For at least 98% of the customers buying Geo DR, having MinIO walk the entire filesystem is not going to be an issue. There might be only a handful of customers that might have millions of files that need to be replicated, and these are most likely huge companies that already have their own DR setup in place.

For GitLab.com in particular, I'm not even sure we should be using MinIO. We should move LFS objects and build artifacts to Amazon S3 anyway and let them handle the replication, and use that to send the files, otherwise we would still need to have some sort of filesystem where all the S3 objects are stored.

Also, let me answer you questions (points) from https://gitlab.com/gitlab-org/gitlab-ee/issues/846#note_19267726

1) It really is as simple as pointing to an existing directory. All the setup would be done by Omnibus, there should be nothing that the customer needs to change. Installations from source would have to configure everything manually, but a thoroughly written guide will help them.

2) and 3) These are, to me, the hardest steps.

4) I'm not sure about this yet. The mirror process of Minio shows you a progress bar when you start it, we might be able to fetch information from there.

5) MinIO being down would not allow for attachments to be uploaded, same would go for LFS objects and build artifacts, once we move them. Issues would still be able to be created.

6) Simply restart the mirror --watch --force --remove command and MinIO would check which files are on the remote and sync the missing ones.

7) If we change the codebase to use just S3 storage, this would be out of the question. We could make this choice configurable, though. In the gitlab.rb file choose which storage option you want. :file or :s3 and set up the code accordingly.

@pcarranza we could introduce sharding and different buckets, but this would mean more moving parts and more processes to juggle with Omnibus.

Thoughts?

@patricio

4) Can we do this on another iteration? We don't need this to have an MVP.

5) Does that mean the we need to have a special process that monitor MinIO and constantly make sure it's available (supervisord or something)?

I do agree that the use case of GitLab.com is rather unique. Is it realistic to do this as a first iteration and move slowly towards making it possible to handle millions of files as we go?

So to me the next steps are:

Change the codebase so attachments are saved through CarrierWave AWS-SDK
Bundle MinIO with Omnibus and configure it properly with S3 keys
Enable the mirroring and exchange keys
Install of this somewhere and make it work altogether.

Then in another iteration, we'd handle the case where MinIO crashes and needs to be monitored etc...

Does it make sense @stanhu @patricio ?

@patricio @regisF That makes sense. I do think it's important to keep monitoring in mind because we need to validate that this thing is actually working.

I do agree that the use case of GitLab.com is rather unique. Is it realistic to do this as a first iteration and move slowly towards making it possible to handle millions of files as we go?

I have to say that I am worried about this statement. GitLab.com has been great so far in keeping us ahead of our customers problem. Now we are building something we expect our customers to pay extra for, and we are not going to use it on a large scale? Same as Geo, we don't use it and we have our customers report issues to us. Any customer that really needs this feature is large enough for us to call them a near GitLab.com scale.

This is the moment to decide the path of the feature. If we wait for us to hit a major scaling issue we might need to redo the whole architecture.

Then in another iteration, we'd handle the case where MinIO crashes and needs to be monitored etc...

Monitoring should not be an after thought! For a feature that is called Disaster recovery, monitoring is as important as how you are syncing the files. How do you know that your disaster recovery is working unless you have a way to look at what is happening in the systems?

I concur about thinking big. We absolutely do need to consider the case of customers with millions of files. Re: using a single directory, the more files there are in a directory the longer it takes to access it. In my experience, one sees this start to take place with more than a few thousand files. We certainly have a HUGE GitLab installation, but our customers will be getting there as time marches on. If we can solve this problem now we will not have to re-implement a solution in the future. The big items from my perspective are that Geo must do the following:

Install from Omnibus with little configuration
Work regardless of the amount of data a customer has.

Certainly, the speed of copying data depends on the amount of data and speed of the link.

As far as implementation details are concerned, I'm happy to weigh in but at this point I'll leave that to those who understand the inner workings of everything more than I do.

@patricio

I know we need to take into account GitLab.com, but we are building an EE product. For at least 98% of the customers buying Geo DR, having MinIO walk the entire filesystem is not going to be an issue. There might be only a handful of customers that might have millions of files that need to be replicated, and these are most likely huge companies that already have their own DR setup in place.

With this line of thinking you are basically telling me to not run this in production, which goes against the whole idea of building the feature this way. This has to work at GitLab.com scale, else we are not going to ever test this feature and will be a toy feature that is not delivering what we already agreed.

I completely agree with what @marin is stating and please, as @xyzzy is saying, think big. We need to think of scale right up front. We can't deal with a O(n) process when n >= 1,600,000 repositories, it just doesn't make sense, and no tweak will make it work. You will need to implement the same feature twice.

Thank you for your input @marin and @xyzzy. I understand what you are saying. We need to rethink some parts of the the implementation, then.

Since the changes needed for us to add MinIO are also required to be done in CE, I propose the following strategy:

Change the codebase to allow the use of MinIO as the only storage option for the attachments and submit proper MR.
Change Omnibus to install MinIO, properly configure it, and monitor it. Submit MR.
Merge both MRs once we have tested a built package.

The steps above would constitute the most MVP. We can then do:

Change the codebase to allow LFS objects and Build artifacts to be uploaded via an S3 API, but make it configurable. Give the users the choice of using :file or :s3 for storage. Submit MR.
Change Omnibus to handle this changes in the gitlab.rb file, and submit MR.
Merge both instances once a package has been tested.

After this has been properly tested in CE, we can then move to EE and the changes needed for DR, which would need to be:

Implement the mirroring process for the S3 objects. (See if we need to do our own journaling, maybe work with the MinIO guys to re-implement their mirroring mechanism to make it scale better, monitor everything properly, etc.).
Make any changes needed in the Geo codebase for the mirroring parts.
Make the needed changes in Omnibus to properly handle the mirroring processes and monitor them.
Test everything together. Release.

@stanhu what do you think about the steps I described?

cc @regisF @DouweM

@pcarranza @xyzzy @marin

Thanks a lot for your input.

We shouldn't get stuck here. Of course, monitoring is essential, and scaling considerations are as important. But to test the concept with a relatively small amount of data, we don't need it per se.

Moreover we also have customers with lot less data than us who could take advantage of this right away (once monitoring will be done). We can either wait to create the perfect setup, or move forward with this and take lessons along the way.

I would argue that it's better to ship something smaller, in alpha and try it for real, rather than get stuck.

Statements like Work regardless of the amount of data a customer has. should be taken with caution. It can paralyze us completely, because so far I've not seen a proposal on how to address millions of files right from the start. I think this is what we always want to achieve at GitLab: we have a big vision, but we are going there one step at a time.

I'm with @regisF and @patricio here:

Despite how big or small the user base will be, betting on using an S3 API compatible service to store the kind of files we want to replicate (Geo) and protect (DR), looks to be the right path.

MinIO can be the initial solution for CE and to get this out so we can start testing. In the meantime for gitlab.com we can always use a different S3 service to solve our problems, until we can provide something for our clients.

I think it's fine to ship MinIO without monitoring if we ship it as experimental support, but that means we will need to revive that :filesystem, :s3 switch we discussed early, so we don't disrupt existing usage, nor introduce big migration requirement.

We can either wait to create the perfect setup

No one is claiming that we need a perfect setup. Having a setup that has a possibility of hurting us badly down the road because we neglect to realise the scale is what we should be vary of. We shot ourselves in the face couple of times already and this is hurting us on GitLab.com for the same reasons. All I am asking is not to repeat the lessons and take scale into consideration very early on because later on it will be hell of hard to fix it.

In the meantime for gitlab.com we can always use a different S3 service to solve our problems, until we can provide something for our clients.

This is the line of thinking that I am trying to warn about. If we end up using something different than our customers we will end up with supporting 2 things and we will need more people to maintain 2 solutions.

I think it's fine to ship MinIO without monitoring if we ship it as experimental support.

How will you know what went wrong in the experimental support? Wait for our customers to loose data and report the loss so we can go and investigate?

Can we call it "alpha" or something like that? My idea is to ship something that users can try in a testing environment, not in production. To deliver DR to production it will take a few iterations. The early we have people experimenting it, the early we get feedback.

@brodock I'd agree -- but we need to warn people accordingly and we need to make sure that that implementation is at least headed in an acceptable direction. Some customers are likely to test what we are testing and then want to move forward with it. I've experienced customers putting alpha and beta product into production -- only to later pay an unfortunate price.

@brodock @xyzzy we can call it alpha, but at first we don't even have to publicize it. We can simply ask 1-2 customers who really wanted DR, to help us test it, and iterate until it's ready for production. I think this is the best strategy to make it a good product.

@regisF We'll never know what works and what doesn't if we don't try something...

mentioned in issue #1366 (closed)

I tried to make an overview. Please comment:

Assumptions:

we want to drink our own wine on GitLab.com
we need a DR solution for GitLab.com anyway
need to add events to each action for both ("mc mirror --watch only works if the file added to the bucket is added via the S3 API, so we still need to modify our codebase.")
for LFS we can do a git pull with LFS
GitLab Geo will not support Git Annex

S3:

more work than rsync (change all uploaders and LFS to use s3)
bundle Minio in packages
convert storage to Minio in packages
sync doesn't scale (full scan, one process, one network connection)
need to add monitoring for Minio
force customers to use Minio (this is not a boring solution)
force LFS to use s3 (not sure if that is easy)
force GitLab.com to use Minio or Ceph to Ceph sync (which we don't want)
bifurcation of sync mechanisms (cloud, minio, ceph, massive rsync since it is still file based)
different from our git pull sync mechanism (if we use rsync we can add a sync field to repos)

Rsync:

need to add a last synced field to the db tables of 8 types of files
need to create a CI test that checks that every db filename field also has a sync field (or just make sure the total count in schema.rb is the same, this so we don't forget to add sync when we add new file types)

@jacobvosmaer-gitlab @dzaporozhets @stanhu @patricio please correct and complement my above overview

Of course comments from other people are welcome too @patricio @brodock @xyzzy

@sytses

Regarding MinIO:

more work than rsync (change all uploaders and LFS to use s3)

I don't think adding MinIO would be more work than rsync. Changes to the codebase are small, the bulk of the work needs to happen in Omnibus, so in that regard it's some extra work there.

convert storage to Minio in packages

There is no need to "convert" the storage. MinIO works transparently over the existing filesystem. It just needs the proper configuration, which Omnibus will take care of.

sync doesn't scale (full scan, one process, one network connection)

We can split it up into multiple processes, and we could even split it up to use more than one connection, if the server has more than one NIC. The full scan it's where we hit scalability problems. I think we could work with MinIO to solve this. (I haven't heard suggestions so far as to how we should solve this, just that we should solve it).

force customers to use Minio (this is not a boring solution)

No need to force the customers. We can make it configurable with very little effort. (Need to add tests for both configurations, though)

force LFS to use s3 (not sure if that is easy)

The changes to support LFS on S3 are quite minimal, actually. So it's very easy to do so.

force GitLab.com to use Minio or Ceph to Ceph sync (which we don't want)

@stanhu mentioned we could use Amazon S3 instead of MinIO for GitLab.com and leverage the changes being done to support S3. Although this would mean not using MinIO for replication, which would go against "drinking our own wine".

bifurcation of sync mechanisms (cloud, minio, ceph, massive rsync since it is still file based)

I don't quite understand this point. Could you please elaborate? Which sync mechanisms are we using right now? (This would be for .com only, our customers are not using ceph).

different from our git pull sync mechanism (if we use rsync we can add a sync field to repos)

Can you elaborate? if we use rsync we can add a sync field to repos do you mean, use rsync for everything, even repos? And no longer use git to sync that?

Regarding rsync.

I'm not that familiar with it, so I don't think it's wise for me to make assumptions as to how it would/should work.

@pcarranza @northrup can you please comment on how we could get DR with rsync?

@jacobvosmaer-gitlab I would also love to hear your opinion on this.

@sytses

The choice to move forward with MinIO was mainly motivated by the scaling issues we thought rsync would hit, as well as the creation of custom code to maintain it.
None of the solutions (either S3 or rsync) are boring solutions: they both have their significant pros and cons, and both require efforts.
Rsync would also require
- Bookkeeping everywhere
- Retry mechanisms in case of failure

Now, MinIO would also require additional components to make sure we address monitoring and scaling issues. Here are some ideas about it:

can we shard the attachments folders by month, making replication easier?
can we catch earlier if minio is down (regular pings)?
catch in the backend that MinIO is down before uploading
Showing a message in the UI to the user in case MinIO is down
Add Prometheus for MinIO to measure activity, metric endpoints
To make sure we don't kill production, at first, put a setting in place so we can disable going through MinIO entirely in case it doesn't handle our volume. Remove the setting after we are sure it's working.

@stanhu @pcarranza @marin @patricio

few things to consider against rsync:

rsync is expensive (delta calculation even if per file, is big enough to kill us at our scale)
we don't have a reliable job system (sidekiq CAN loose data), so we can't rely on sidekiq to initiate replication of files, as a lost job means a lost file
we can't garante file roting protection with rsync on a file by file base, unless we regularly rsync all our content (which does not work at our scale)

few things in favor of minio:

it's S3 compatible protocol, which means, if at some point we find a dead-end, we can always change "minio" to "something else" and we keep compatibility
because it's S3 and S3 is "boring solution", it enables scaling vectors for our customers even if they are not using DR (disk growth can be tackled by not using a single filesystem for everything, so by introducing S3 for attachments, means attachments can live in a different machine just by changing credentials
MinIO have file rotting protection and additional features for performance / integrity that we don't have with rsync: see https://docs.minio.io/docs/distributed-minio-quickstart-guide for a few insights
MinIO is written in Go and have some extension points (there is a good chance if we require something custom, we could fork and implement)

A few things agains minio (but not in favor of rsync):

MinIO may not be the ultimate solution for the scale we are in
Is well stablished solution as rsync or any old S3 compatible like Ceph

We have constraints that are hard to met:

Easy to bundle
Easy to run
Easy to deliver to our clients
Require minimal to no user intervention
Handle from small to big clients
Run in any filesystem or not be specific file system dependent
Possibly multi platform or platform agnostic

This is like finding the holy grail for storage solution, and it's something super hard we will probably never achieve so we need to take tradeoffs.

I will try to list a few bussiness decisions we can make and possible implication:

If we decide to use something "big enough for crazy scale"
- Will not be small enough to run in minimal infra
- Will be hard to support small clients
- Will require techies in-house with specific knowledge
- It's a huge challenge for the Build Team (probably not going to fit well with omnibus approach because it may have their own install/scaling solution)
If we want something that doesn't require installing additional software (use only standard linux tools: scp, rsync, git? etc).
- We will need a reliable job system (which means not using sidekiq for this)
- We will need to implement a lot of bookeeping and self healing stuff around that, that will be custom only to GitLab
- We will not be providing something "battle tested" to our customers which means bugs / crazy usage patterns will probably hit them hard and we will be firefighting for a long time
- Not as "boring solution" as it looks like, more like reinventing a squared wheel
If we go with MinIO
- May not handle crazy growth
- It's written in Go, and code is not super complex, means we can fork and implement features
- Has extensibility points to either integration with external solutions or places where forking and adding code will be easy to maintain
- It's not a huge lock-in, as we are using S3 protocol for that, means we can switch to another solution without a big rewrite

One idea that came to mind:

As we are betting a lot on openshift, can we move to: "Use MinIO until it doesn't scale anymore than move to OpenShift"? So MinIO can be a path for small to medium installs, and OpenShift own's S3 solution for huge clients.

It fits in the Idea to Production demo vision, and helps selling GitLab + OpenShift, it gives us a sane direction. We can start working with the MinIO as a prototype / simple solution but get part of the work done for the "big solution".

Now, MinIO would also require additional components to make sure we address monitoring and scaling issues. Here are some ideas about it:

can we shard the attachments folders by month, making replication easier?

can we catch earlier if minio is down (regular pings)?

catch in the backend that MinIO is down before uploading

Showing a message in the UI to the user in case MinIO is down

Add Prometheus for MinIO to measure activity, metric endpoints

To make sure we don't kill production, at first, put a setting in place so we can disable going through MinIO entirely in case it doesn't handle our volume. Remove the setting after we are sure it's working.

@regisF I would say that the listed steps are not exclusive to MinIO but to any additional service we use (even AWS s3, requires you to "shard" the toplevel folders otherwise you will have serious performance problems because of locality in the cluster).

see for some context: http://stackoverflow.com/a/22173939/230526

The reason I considered S3/Minio at all is that we would be putting gitlab.com's files on S3. This has worked well for Docker Registry uploads on gitlab.com (@pcarranza correct me if I am wrong). It is also Boring.

My thinking was: if you want Geo, then bring your own S3 (e.g. Amazon S3), just like you need to bring your own NFS server. Everybody else (everybody who is not using Geo) can use Minio which does not have replication.

If we abandon this split (meaning: people can have Geo with Minio) then we end up doing the same hard work (reliable replication code that scales) we would have to do with Rsync and I don't think striving for S3 support makes as much sense.

@pcarranza do we want to migrate GitLab.com's files on S3?

@jacobvosmaer-gitlab Thanks, that is helpful.

My thinking was: if you want Geo, then bring your own S3 (e.g. Amazon S3), just like you need to bring your own NFS server. Everybody else (everybody who is not using Geo) can use Minio which does not have replication.

Right. It sounds to me that based on customer feedback most users can't use Amazon S3. Most people also do not have another S3 solution, let alone one that does replication.

For GitLab.com, we already use Amazon S3 for Docker images, Ominbus packages, and other things. As I said before, I think we would be okay with moving GitLab.com to use Amazon S3 for LFS/attachments/etc. and using Amazon S3 replication for DR. Although this runs the risk of Cloud Jail, this would be a straightforward approach since we wouldn't have to implement replication. However, many of our on-premise customers can't use this, so we're back to providing our own replication solution. I think the paradox here is that what might work for GitLab.com won't work for other customers.

If we abandon this split (meaning: people can have Geo with Minio) then we end up doing the same hard work (reliable replication code that scales) we would have to do with Rsync and I don't think striving for S3 support makes as much sense.

Right. What about your previous concerns about tracking all the different files without a central interface such as S3?

Thanks for the comments everyone.Using aws s3 for GitLab.com changes things. With that we have the following situations:

GitLab.com points carrierwave to aws s3 (we can do that today).
Non geo customer keep pointing carrierwave to the filesystem since they don't need replication.
Geo customers will likely have problems scaling Minio and we don't have experience with it.

Why are we implementing Minio in Omnibus? In the first two situations you don't need it and in the last one we probably can't support it. I'm probably missing something but I wanted to write down my thinking.

Why are we implementing Minio in Omnibus? In the first two situations you don't need it and in the last one we probably can't support it.

@sytses the idea was to avoid having if/else code in all our carrierwave Rails uploaders. With Minio available the Rails uploaders have just one code path: "S3". My goal was: less bugs because less if/else.

I think the paradox here is that what might work for GitLab.com won't work for other customers.

Yes. So who do we put first? I suspect the answer is 'customers'. And then 'S3 on gitlab.com' goes out the window.

What about your previous concerns about tracking all the different files without a central interface such as S3?

@stanhu This part is just hard. But to do file sync in a way that scales we need a database to track state. Minio uses the filesystem as its database. That is not feasible once you have a lot of files and a slow filesystem (NFS). So I think we would have to add code to Minio to make its sync work at scale. Compared to that doing it in gitlab-rails is probably more effective (higher pace of development, more engineers who write Ruby than Go on our team).

@jacobvosmaer-gitlab I assumed that it was easy to point carrierwave to s3 instead of files and that it could be done with minimal configuration. config.storage = :file or config.storage = :s3. Is that not the case?

Let me remind you all about @brodock comment above. It gave some reasons why going with MinIO would be a good first step.

Moreover, it also gives us S3 support for GitLab.com.

My concern is that we get stuck in trying to provide the most scalable solution for v1. We know MinIO is not ideal, but isn't it good enough? Two of our big customers have 7 and 10Gb of data respectively. I don't know about the others. But I bet they are not close to have what we have. For those ones, this v1 would be enough, until we find something more scalable. And for us, can't we use S3?

@sytses it is not that simple as changing one configuration option. The storage type is defined in each of the uploaders, so the changes would need to go there, e.g. https://gitlab.com/gitlab-org/gitlab-ee/blob/master/app/uploaders/lfs_object_uploader.rb#L2

Then, if we make this configurable (so that you can say in the gitlab.rb file: storage = :file or storage = :s3, we would need to add a lot of branching code depending of the configuration, which can lead to bugs, as @jacobvosmaer-gitlab points out. Hence why MinIO was suggested in the first place. We can just go with one configuration, :s3, and have MinIO handle the communication between the S3 API and the underlying filesystem.

If we use MinIO, the yes, there is minimal changes to the codebase required and no configuration (handled by Omnibus) or migration (none needed) for the user, but then users, and by extension us, will not be able to change the storage API. But this will allow us (GitLab.com) and our customers to swap MinIO for another S3 compatible storage, like Amazon S3, or OpenShift (as @brodock point out).

So the question now is, which approach do we choose?

Do we take the jump and add MinIO?

Pros
- Swapping MinIO for another S3 provider is trivial
- mc mirror works great at small and medium scale.
- Required changes to get uploaders to work with S3 are small
Cons
- Does not work at large scale.
- Adds another service that needs to be properly monitored.
- Will force GitLab.com to use Amazon S3, in order to replicate reliably.

Or, do we implement our own replication solution based on rsync?

Pros
- We can make it work at GitLab.com scale.
- We will be drinking our own wine, which will make bugs easier to spot.
- ??
Cons
- Reaching an MVP will be a lot more difficult
- Initial development will be slow, as the learning curve to get this working is steeper.
- Could mean re-inventing the wheel for our own use case.
- ??

Please, let's reach a consensus on what to do next. I really want to get my hands dirty here

Conversation happening in https://docs.google.com/document/d/1NCBxAi7i2ka5KeR3ql5zN7CEcyVZYbh1_swLpGmvv-Q/edit (GitLab Team only, otherwise we can't give edit rights)

@patricio @brodock

Will force GitLab.com to use Amazon S3, in order to replicate reliably.

I don't see why it would ever force us to do that, we can easily swap MinIO with Ceph as it is S3 compatible.

More things to consider about Or, do we implement our own replication solution based on rsync? : We will still need a shared posix filesystem among all the workers, which will make it harder to scale. (add it to cons)

I think that by detaching ourselves from the need of having a posix filesystem will open the door to do a lot of interesting things both for us and for our customers simplifying the storage scaling issues quite a lot.

mc mirror works great at small and medium scale.

The only issue I see with the S3 protocol approach (MinIO) is that it's a dependency to consider.

Separately I think we are trying to rely on the provided mc mirror tool for the sync and it will just not work at large scale.

How do I picture this system

I think that this is a point that was discussed in the original meeting: we can't rely on such general purpose mirror tools to do the sync for us. We need to drink our own wine here and keep the state of the sync avoiding at all costs going down the path of calculating the diff. Independently of this happening with mc mirror, rsync or even by doing it at the database level, simply because this will be a lot of processing that will need to happen every time we want to get back in sync.

We need to think of the sync as a stream of events kind of the same of what MySQL does with the binary log, by recording these events we will always know what we need to sync just by checking in which event the secondary node is standing against what the master node has now.

This could simplify monitoring because just by offering how far are we from getting in sync with the master we roughly know what the delay is, and we can track if this number if moving forward or not. We could even graph this as a rate to see how many files are we ingesting per second.

Then the initial import would be to generate the list of events of every file existing in the application, with only that we can trigger the sync progress.

We had a call today about the next steps.

Whatever the system we will put in place for DR, we will need to use it for ourselves on GitLab.com.

We can't use mc mirror to replicate. It won't scale.
Why would MinIO at all then? It's too early to think about object storage, we need to rely on the filesystem.
We will start investigate making our own solution with scalability in mind.

Here is the vision:

Every attachment is tracked in the original DB.
Secondary nodes have a new writable tracking DB (in addition to the original DB that is replicated from the primary and is readonly)
We check periodically the tracking DB and find the highest updated_at timestamp
Find the first X timestamps in the original DB that are later than this updated_at
Replicate those files and update the tracking table once it's done
Rinse and repeat.

Questions that need to be addressed. @patricio will investigate.

How do we design the tracking system?
- Can a different database be used on same PostgreSQL instance for writes on secondary? Could be another PostgreSQL instance
- Can Rails handle different databases well?
What do we need to change so every attachment is saved through Carrierwave from now on?
How will we copy the actual asset? Git+http ?
How do we import old attachments that have not been tracked with carrierwave so far on all GitLab installations
One table in the tracking DB or have tables that match the original DB tables that contain attachments?

Out of scope for now:

- What happens if an asset can't be copied and block the entire process?

CarrierWave changes

All uploaded files already go through CarrierWave, but some of them are not saved to the DB after being uploaded. The most notorious one is the FileUploader. This uploader is used across all Issues and MergeRequests descriptions, and also Comments (note.rb The funny thing is Note has an attachment field where uploads where once recorded. This field is not used anymore)

I think the best way to change this code going forward, is to create a new model for the FileUploader where we keep track of the uploaded files. This will require a rewrite of this uploader that also needs to be backwards compatible. Based on preliminary investigations, the changes seem straight forward.

Tracking system

Most likely it will have to be a different PostgreSQL process
Rails can handle multiple DBs without issues. We can create a base class with the connection configuration and inherit/include from that, e.g.

class BaseDrModel < ActiveRecord::Base
  establish_connection :dr_database
  self.table_name = self.to_s.underscore
  self.primary_key =  'id'

  [...]
end

I envision a different table for each type of attachment with the same fields as the origin's table, this way we can easily fetch objects from the source DB, copy them to the secondary, and save them on the target DB without much hassle. Once the object has been saved to the target DB (after the file has bee copied), it counts as successfully replicated.

Transfer of files

Since we already add SSH keys to both primary and secondary, respectively, we can leverage rsync to transfer the files. One rsync process per file.

Scheduling Transfers

I think it could work like this:

Initial backfill:
- We kick off 2 processes for each type of attachment, everything on the secondary.
- One process goes from newest to oldest, the other from oldest to newest.
- We check if the object we are accessing exists in the tracking DB, if not, schedule a copy.
- Both processes should meet around the middle. We can let them continue towards the edges to make sure all objects are copied over.
- if the process stops, they should continue from the last file that was copied over and continue in the same order.
Regular task
- One task per attachment type, all on secondary.
- Check every hour for new files. We have a timestamp of the last file that was copied over, so just check from there on.
- If new files are found, schedule copy task.

Migrate old attachments

This is the trickiest part, in my opinion. If we want to migrate old data to the new model, then we need to do this task on the primary, as the secondary has no write rights to the DB. This can be an expensive task, as we would need to parse markdown data, extrapolate all the needed information from there, and save a record to the database.

Depending on how critical this information is, I was thinking we could do it lazily. By this I mean, once an Issue/MergeRequest/Note is accessed, we kick off a Sidekiq job to parse the information contained in this particular record being accessed. This approach would mean that only actively used data will be migrated. Stale data will not be accessed that often/at all. If this is not ideal, we could start background processes to go through every single entry in DB and migrate accordingly.

Another approach could be not to migrate the old attachments, but leave them as they are. We can let the secondary walk the DB for markdown entries with files and copy the files over. We can keep track of the data copied in the tracking DB (I'm not sure how accurate this would be, as there is no initial record of all existing files, so we cannot compare DB entries to make sure all files are copied, but it's an idea.)

These are just ideas on how the implementation could look like. I'm sure some will chance as we start implementing them. I'd love to hear what you guys think, @jacobvosmaer-gitlab @marin @brodock @stanhu

@regisF et al. I have new information about one of our largest customers. Their database is about 10GB - their total data on disk is almost 8TB! Big difference. Sorry for the misleading information.

@patricio

Will the tracking system manage LFS objects too?
What are the benefits of having one table per attachment type?
Migrate old attachments: could we instead read the files on disk, and add them to the attachments table?

Couple of points:

We kick off 2 processes for each type of attachment, everything on the secondary.

Why do we need to do this? For performance?

This is the trickiest part, in my opinion. If we want to migrate old data to the new model, then we need to do this task on the primary, as the secondary has no write rights to the DB. This can be an expensive task, as we would need to parse markdown data, extrapolate all the needed information from there, and save a record to the database.

It doesn’t make sense to parse all Markdown to find the files on disk. It makes sense to just walk the filesystem and insert these entries directly in bulk.

Since we already add SSH keys to both primary and secondary, respectively, we can leverage rsync to transfer the files. One rsync process per file.

Will the user have rsync privileges to copy the files in the right place? I think our concern was that this Geo user would have to be able to copy files directly into a folder owned by user git or some other user.

@regisF

Of course
It makes mapping, handling, and saving of objects simpler, and more readable. Besides it's just a database, we can add as much data as we need without having to worry about performance too much.
Walking the file tree is very expensive, specially since that would have to be done on the primary. It can bring the whole server down.

@stanhu

Why do we need to do this? For performance?

If you see just below that comment, I suggest starting 2 processes for redundancy. If we walk the entries from both sides, it is less likely that the process might miss one.

It doesn’t make sense to parse all Markdown to find the files on disk. It makes sense to just walk the filesystem and insert these entries directly in bulk.

See my comment to Regis above. I was under the impression that walking the filesystem is a very expensive operation. if it's not, then sure, we can go that way.

Will the user have rsync privileges to copy the files in the right place? I think our concern was that this Geo user would have to be able to copy files directly into a folder owned by user git or some other user.

The SSH keys are added to the git user, not to a "geo" user. So yes, there will be no issues with write permissions. See https://docs.gitlab.com/ee/gitlab-geo/configuration.html#step-1-adding-the-primary-gitlab-node for more info.

@patricio "If you see just below that comment, I suggest starting 2 processes for redundancy. If we walk the entries from both sides, it is less likely that the process might miss one." => I think one process is much more likely to work that 2 process. One thing will be easier to get right.

Re using Rsync: this requires shell access (SSH) for the user on the receiving end. This is similar to how when you do git push via SSH it starts git receive-pack via SSH on the receiving end.

There are two ways to do this. Full SSH access (not what we want here) or command checks. For Git SSH access to the GitLab server we use command checks, this what the bin/gitlab-shell executable is for.

So we would have to add a new type of allowed command to gitlab-shell, and configure the receiving end of an rsync 'push' to allow just the right rsync commands.

(For completeness, there exists another Rsync transport mechanism called rsyncd (daemon) but it is not something we want to use because it has very weak security, forcing you to implement network-level security on top.)

Do we need rsync?

For day-to-day sync, we would be working on the level of individual files. These can be transferred with HTTP POST/PUT requests. We know already if a file needs to be transferred or not because we track this in SQL. I think rsync adds little value in this scenario.

For backfill we want probably want to work in batches. But here we could also have our own HTTP-based system. For instance, select up to 10MB of files to be transferred, spin up tar to pack them together into one blob, stream the blob as an HTTP request body, and unpack on the other end.

Rsync is useful if you want to combine the job of finding all files in a tree, and copying only the files you need to copy to the other end. But in our case we don't need to find the files (they are listed in sql) and we know (from sql) what files need to be copied or not.

@sytses

I think one process is much more likely to work that 2 process. One thing will be easier to get right.

True, but, in essence, it will be the exact same process, just with a different start order. If we want to avoid complexity, and just run one process, I understand, but I disagree with "one process is much more likely to work, than 2". It's just division of labor, kind of like Divide and Conquer.

For backfill we want probably want to work in batches. But here we could also have our own HTTP-based system. For instance, select up to 10MB of files to be transferred, spin up tar to pack them together into one blob, stream the blob as an HTTP request body, and unpack on the other end.

Good point about file limits. We'd have to make sure HTTP streaming works properly with files > 20 MB.

I was also thinking about LFS. I think you mentioned that git pull only pulls in the latest LFS objects for the current commits and so we might have to walk through the lfs_objects table. But I believe git lfs fetch --all does the trick?

can't we use scp instead of rsync for this, and not pay the delta calculation cost? it can even have a little faster performance if we use compression at the ssh level with minimal cpu cost (need to test to see if it improves binary transfers).

@brodock The issue here is that we do NOT have permissions to use SSH to copy files with the git user. No user should be able to access the filesystem via ssh git@gitlab.com, for example. gitlab-shell limits access. As @jacobvosmaer-gitlab mentioned:

So we would have to add a new type of allowed command to gitlab-shell, and configure the receiving end of an rsync 'push' to allow just the right rsync commands.

In any case, rsync has switches (e.g. --whole-file, --size-only) that can disable delta calculation and just transfer the entire file based on size/timestamp differences. Let's not get stuck on that.

@patricio mentioned

a different table for each type of attachment

I'm not sure what types of attachments need to be tracked or why this benefits us. Tracking files that need to be copied, sure. But the file transfer methods we are discussing don't really change depending on the file type. What do we need to track and why?

@brodock the whole reason to use rsync instead of scp is that copying data can be (and often is) more expensive in terms of resources than simply seeing if it needs to be copied. rsync can be pretty darned efficient at this. By default it checks the file size and mtime and compares the data with the receiving end. With tens or hundreds of thousands of files this can take resources. It is, however, far more efficient to do that than to simply transfer all of the files. I'm not sure which switches we are proposing to use but the delta calculations only come into play if a file is found to be different on the receiving end. At that point the tool looks at which portions need to be transferred. In my experience rsync is pretty darned efficient at all of this.

But I believe git lfs fetch --all does the trick?

I have not heard the final word on this but I don't see how it could work other than walking the entire commit history looking for commits that reference LFS objects. This is not a feasible way to do sync (even if you'd have 3 new commits you would be walking the entire repo).

This is what I expect from my current understanding of how LFS works. I think someone needs to read the source of the git lfs fetch --all command to make sure.

Edit: this confirms my suspicions. https://github.com/git-lfs/git-lfs/blob/master/commands/command_fetch.go#L224-L232

git lfs fetch --all prints this message: Scanning for all objects ever referenced...

We should not use rsync because of what I wrote earlier:

Rsync is useful if you want to combine the job of finding all files in a tree, and copying only the files you need to copy to the other end. But in our case we don't need to find the files (they are listed in sql) and we know (from sql) what files need to be copied or not.

I think we should use SSH less, not more than we do now. It is a great tool (I love it) but from a deployment perspective it is a hassle. It can require changes to PAM settings, sometimes even elevating the privilege of the application user 'git' to that of a human user. It limits your choices of load balancer. SSH makes GitLab a special beast instead of 'just another web application', which is a bad thing for the adoption of our product.

@jacobvosmaer-gitlab I think you are right; I was looking at the exact same block of code before seeing your update.

changed milestone to %8.16

hey folks, I wondered if you'd looked at Apache CouchDB to support this? It supports binary attachments within JSON documents, & probably the best mesh sync protocol available today. It speaks native HTTP(S), supports etags and plays very well with HTTP based proxies and caches as a result. It's written in Erlang (same as Riak, LeoFS see a pattern here... ). BTW I'm one of the core devs so slightly biased, but if you're interested ping me on email or during weekdays on IRC (dch@freenode/#couchdb ) for a chat. I'm also reasonably familiar with the pitfalls of distributed sync systems which it looks like you're just getting started out with.

I could see couchdb handling most of the non-git stuff pretty well, e.g.:

Any in disk stored file
    Issues attachments
    Merge request attachments
    User avatars
    Group avatars
    Project logos
    CI build logs
    CI artifacts
GitLab pages assets (The CI artifacts that results from a pages build, which is the actual .html .css .js and images that will became the webpage).

For customers who don't want to host it themselves there's Cloudant ( who provide a very robust geo-scaled SaaS around it ) now owned / run by IBM.

We just had a meeting with @rspeicher and it appears that the next steps are:

Finish the work on backfilling mechanism (https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/994)
Make sure every attachment is recorded in the Attachment table from now on
Import old attachments by crawling the filesystem
Work on Omnibus to create and setup the tracking DB on the secondary nodes
Create the cron job that handles the replication of the assets

Point 2 and 3 should be done asap, so we can reduce the amount of import of old attachments once we'll activate DR on GitLab.com.

The optimistic release date for these two tasks are 8.17.

@stanhu please review the steps above.

@DouweM can you assign a new developer to work on this with Robert?

There is also another question: in case the primary server crashes, how will we activate the secondary node to become primary, and let other nodes be aware of that?

I assume this will be done through the CLI somehow?

added marketing work label

added ~333913 label

I can probably work on the omnibus part, need to assign some time for that with @marin

mentioned in issue #1490 (closed)

is there a different owner to this issue as patricio left?

I believe that is @rspeicher

@northrup @ChadMalchow Robert will supervise the person @DouweM will assign to this project.

Finish the work on backfilling mechanism (!994 (closed))

@regisF Overall the plan looks good. However, I'm not 100% sure this backfill mechanism should be merged until we have the tracking system in place as mentioned in https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/994#note_20936901.

assigned to @dbalexandre

changed milestone to %8.17

mentioned in issue omnibus-gitlab#1876 (closed)

@regisF we had a call with @stanhu @marin @rspeicher @brodock last Friday and we decided that even before we start worrying about attachments, we need to be able to ensure that repositories are properly synced. So, the immediate next steps are:

@dbalexandre works on backfilling repositories (not attachments);
@brodock will continue helping with customer Geo issues and helping @dbalexandre and @rspeicher on DR;
@stanhu leads the feature team;

@regisF @rspeicher I've to agree with @stanhu here, backfilling can happen later and we have bigger bones to chew first.

I would like to stress the need to simplify the filesystem crawling and making it dead simple (no rake task) because this will need to be executed by all our customers.

GitLab.com is in 19TB already, growing healthy and steady, so the sooner we stop adding things without any form of control, the better. On top of this, do not make any customer handhold an import manually because it will just not cut it.

Proposal how to copy file attachments: https://gitlab.com/gitlab-org/gitlab-workhorse/issues/105

mentioned in issue gitlab-workhorse#105 (closed)

I'm confused based on @dbalexandre and @pcarranza updates. Douglas says backfill is the most important, after the call, while Pablo says it's the least important. Can you please share the current priority and the progress? A customer+ is extremely interested in Geo for DR and is confused by this conversation also.

@dblessing I believe part of the miss-understanding is because we are using terms that are not 100% precise. The backfill feature has a lot of different concerns, which we should try to be more clear and specific when we refer to:

first run
heal from backup/restore of a secondary node
heal from corrupted / incomplete sync
heal a previously good/sync repository but now with corrupted data (hardware failures, etc)

We want to make sure we have all the repos on the secondary first. We are trying to tackle the first run and heal from failed notification (with the help of a tracking database).

With the tracking database, we will be able to do that without needing to crawl the disk in a "full-scan"

mentioned in issue #1642 (closed)

mentioned in issue gitlab-com/gitlab-docs#69 (closed)

mentioned in issue #1747 (closed)

changed milestone to %9.0

The current iteration's progress can be followed in this document: https://docs.google.com/document/d/18vGk6dQs7L0oGQOb_bNiFa5JhwLq5WBS7oNxQy09ml8/edit

The goal is to ship something in alpha for 9.0, with most of the steps required to have a basic DR. This effort is currently led by @stanhu and involves @brodock @rspeicher @dbalexandre .

I'm working with sales and marketing to see how/what we'll say to our customers.

@regisF do you need Sales to cherry pick a customer or two to work with you on getting Geo DR installed and operational? Would be great to have a customer endorsement for the announcement.

@Haydn please comment there https://gitlab.com/gitlab-org/gitlab-ee/issues/1747 for this specific effort.

mentioned in issue gitlab-com/organization#40 (closed)

changed milestone to %9.1

We have made great progress over the last month on this topic.

On 9.0, we'll ship a first iteration. This includes some big changes:

First of all, starting with 9.0, all file uploads are now recorded in the database. Before, they were just uploaded the file to the server, and the only way to know that a comment had an attachment was by looking at the <a href> link in the comment itself. Now, we will keep track of all uploaded files in the database. That will let us replicate those files in another iteration.
- We have added the number of uploads in the usage ping to monitor this data.
A tracking database is now setup automatically in each secondary node, to track progress of the replication.
We have added support for replicating LFS objects.
There is a new process to backfill repositories from the primary server to a secondary node. Moreover we don't need to click on a button to backfill repositories anymore.
Secondary nodes can now be disabled and enabled again through the UI. Please note that setting up a secondary node the first time still needs to be done manually.
We have a new command line tool to check the "Geo health" of a node.

The next iteration will cover the following:

Make it easier to install Geo for developers, because it should be simple to install GDK with Geo so developers can actually work on it.
Make it easier to install Geo for customers
Add support for replicating attachments
Import existing attachments (those which are not tracked through the DB)

The body of the issue has been updated to reflect this.

mentioned in merge request !1431 (merged)

mentioned in issue #1955 (closed)

Now that GitLab 9.0 is out, here is what we aim to accomplish for 9.1:

Make GitLab Geo easier to install for developers with GDK
- Because we can't move fast on this project if it's hard for developers to setup and start working on it
Make GitLab Geo installation process easy and automated (https://gitlab.com/gitlab-org/gitlab-ee/issues/1664)
- Because it should be painless for our customers to install. The magic of GitLab comes from its apparent simplicity, hiding complexity.
Add support for remaining file replication (e.g. attachments, CI artifiacts, etc.) https://gitlab.com/gitlab-org/gitlab-ee/issues/1955
- Because now that LFS is replicated and files get tracked in the DB, we can start replicating all the other files

@regisF I'm a bit out of sync with GEO but have a question: are we tackling using object storage for file replication?

I'm asking because there is an ongoing effort from the CI team to provide object storage for artifacts, and I'm not sure if there is anyone getting everyone in sync on this since it could impact the work on GEO.

Finally, to give you a data point that you may find interesting: we are trying to copy the artifacts and pages files to a new host and so far it took 18 days with rsync (without performing a diff since initially there was no data on the target drive), we have no idea how much more it will take but I would not be surprised if it takes another couple of weeks (the joys of a lot of small files)

And yes, I said days

@pcarranza That's really interesting data. We had a long discussion about object storage while designing Geo. In fact, we had a proof of concept where we moved all CarrierWave objects into S3. I think that we will want to support that.

However, right now most of our customers right now are using filesystem storage, and they don't have an on-premise object storage. Amazon S3 or Google Cloud Storage isn't an option for them, so we would have to provide one or recommend one for them. We don't feel that Minio is something we can ship and support.

For the first iteration of Geo, we have to focus on filesystem replication. I think object storage makes sense as a future option.

changed title from Geo: Disaster Recovery - Proof of concept to Geo: Disaster Recovery

As per https://gitlab.com/gitlab-org/gitlab-ee/issues/1989, this feature once launched to everyone, will be called Disaster Recovery, not Geo Disaster Recovery as I originally thought. The issue description has been edited to reflect this.

As we are past the feature freeze now, here is a status on what is very likely to ship with GitLab 9.1:

It will be now easier to setup Geo and DR locally in GDK, so developers can actually set up their local environments and start contributing https://gitlab.com/gitlab-org/gitlab-development-kit/merge_requests/270
We now have to type fewer commands to set up the primary and the secondary nodes. We want to have the least amount of commands for the customer to type in order to set up everything, so this comes a long way of supporting this goal.
Along with git data and LFS objects, we now replicate the following assets: issue, merge request, and comment attachments, as well as user, group, and project avatars. Artifact data are still not replicated at this point.

The documentation will be updated to reflect all this as well.

Now that these steps are done, in the coming days, we'll test everything, again and again, fixing bugs as we find them.

The goal for 9.2 is currently:

Improve UX on Geo Nodes screen https://gitlab.com/gitlab-org/gitlab-ee/issues/1975
Import existing attachments
Process to designate a secondary node as primary https://gitlab.com/gitlab-org/gitlab-ee/issues/1921

We might also want to replicate artifact data, but it's unclear for now if we will tackle that in this next iteration.

marked the task Make GitLab Geo easier to install for developers with GDK https://gitlab.com/gitlab-org/gitlab-development-kit/merge_requests/270 as incomplete

marked the task Make GitLab Geo installation process easy and automated (https://gitlab.com/gitlab-org/gitlab-ee/issues/1664) as incomplete

marked the task Add support for remaining file replication (e.g. attachments, CI artifiacts, etc.) https://gitlab.com/gitlab-org/gitlab-ee/issues/1955 as incomplete

Added https://gitlab.com/gitlab-org/gitlab-ee/issues/2134 for 9.2 as a stretch.

marked the task Add support for remaining file replication (e.g. attachments, CI artifiacts, etc.) https://gitlab.com/gitlab-org/gitlab-ee/issues/1955 as completed

changed milestone to %9.2

removed assignee

assigned to @dbalexandre

marked the checklist item Make GitLab Geo easier to install for developers with GDK https://gitlab.com/gitlab-org/gitlab-development-kit/merge_requests/270 as completed

changed the description

marked the checklist item Improve UX on Geo Nodes screen https://gitlab.com/gitlab-org/gitlab-ee/issues/1975 as completed

@dbalexandre what milestone is this feature looking to be released? Asking as this is a top feature requested by from large accounts. not sure if this issue has stalled with change in PM.

/cc @JobV

@ChadMalchow This feature is currently in alpha, and some costumers are testing it. @stanhu @mydigitalself Do we have a particular milestone?

EEP customers very interested in an update https://na34.salesforce.com/00161000006g08Q https://na34.salesforce.com/00161000003NCNP

@stanhu @mydigitalself is their a milestone that we can communicate to customers and prospects here. Any other updates or education we can provide to the market to keep them excited and attract more prospects to this feature?

We're in the process of figuring out what we need to implement on our side to make sure we have DR covered. Any updated information on the state of the built-in DR capabilities would be much appreciated!

changed the description

removed assignee

We're aiming to have Geo generally available by the end of this year. I've updated the issue description to summarize our efforts so far.

changed the description

changed milestone to %9.5

mentioned in issue gitlab-com/infrastructure#2349 (closed)

marked the checklist item Enable Geo log cursor as completed

changed the description

marked the checklist item Group-level selective replication: https://gitlab.com/gitlab-org/gitlab-ee/issues/2224#note_33117061 as completed

changed the description

Interested customer https://na34.salesforce.com/00161000006g0ZQ They really want both Geo for off-shore teams and an HA configuration for rapid (not instant) rollover in case of primary failure.

@mrogge

changed the description

marked the checklist item Deprecate system hooks: https://gitlab.com/gitlab-org/gitlab-ee/issues/2174#note_28319238 as completed

changed the description

removed Platform label

marked the checklist item Make GitLab Geo installation process easy and automated (https://gitlab.com/gitlab-org/gitlab-ee/issues/1664) as completed

marked the checklist item Detach repository group and path name from disk: https://gitlab.com/gitlab-org/gitlab-ce/issues/28283 as completed

marked the checklist item Start testing Geo with GitLab.com: https://gitlab.com/gitlab-com/infrastructure/issues/2293, https://gitlab.com/gitlab-org/gitlab-ee/issues/1884 @jarv as completed

marked the checklist item Instrument all project/file download times: https://gitlab.com/gitlab-org/gitlab-ee/issues/3020 @stanhu as completed

marked the checklist item Remove Geo system hooks: https://gitlab.com/gitlab-org/gitlab-ee/issues/3110 @to1ne as completed

@Haydn Premium renewal dependent on DR being production ready. https://na34.salesforce.com/00161000004xb1K

Additionally https://na34.salesforce.com/00161000002zFY1 is interested, and will not engage Geo until it is publicly GA and no warning concerns.

changed the description

added customer label

Zendesk: https://gitlab.zendesk.com/agent/tickets/82924

marked this issue as related to #1747 (closed)

removed milestone

marked the checklist item Implement migration path from legacy to hash-bashed storage format: https://gitlab.com/gitlab-org/gitlab-ee/issues/3118 @brodock as completed

marked the checklist item Fix read-only UPDATE issues with Geo (https://gitlab.com/gitlab-org/gitlab-ee/issues/1744) as completed

changed the description

Geo: Disaster Recovery

Proposal

Current step:

Designs

Child items ...

Activity

Solution A: we store assets on a filesystem

Solution B: we store assets in "the cloud" (something that speaks S3 protocol,...)

Why don't we replicate it ourselves?

To summarize, I think the following would be needed if we go the S3 way:

If we go the Rsync way

Some input to a possible PoC / MVP with minio:

Second PoC update

mc mirror

Since the changes needed for us to add MinIO are also required to be done in CE, I propose the following strategy:

The steps above would constitute the most MVP. We can then do:

After this has been properly tested in CE, we can then move to EE and the changes needed for DR, which would need to be:

How do I picture this system

Admin message

Admin message

Geo: Disaster Recovery

Proposal

Current step:

Relates to

Activity

Solution A: we store assets on a filesystem

Solution B: we store assets in "the cloud" (something that speaks S3 protocol,...)

Why don't we replicate it ourselves?

To summarize, I think the following would be needed if we go the S3 way:

If we go the Rsync way

Some input to a possible PoC / MVP with minio:

Second PoC update

mc mirror

Since the changes needed for us to add MinIO are also required to be done in CE, I propose the following strategy:

The steps above would constitute the most MVP. We can then do:

After this has been properly tested in CE, we can then move to EE and the changes needed for DR, which would need to be:

How do I picture this system