Using a virtual server in the cloud as an NFS server will only take you so far in terms of storage capacity. We want to see if CephFS can be used to store Git repository data (and all other file data managed by GitLab). If you are currently evaluating or running Ceph with GitLab please comment below. If you prefer to comment in private please email sytse@ our domain.
@jnijhof If you already read it a summary would be great. This stood out to me: "We observed consistent results and linear scalability at the RADOS level. However, we did not
observe the same at the file system level. We have experienced large performance swings during
different runs, very low read performance when transfer size is small, and I/O errors tend to
happen when system is under stress (more clients and large transfer sizes). These are not particularly reproducible results, but it suggests that there are opportunities for code improvement and
maturation."
@thelinuxfr We heard that GlusterFS(2) has bad latency, it is not an option we're considering. We've also tested with LizardFS but it is not proven to scale to the number of files we will need. Right now we want to explore CephFS.
I'm in no means an expert on Ceph (yet?) but I'm currently looking to implement it for my own business and I wanted to share one or two things that I found out during my research.
The publication linked above is from 2013 and since then Ceph had some very valuable additions that could work really well for Gitlab's purposes. It also makes the performance findings from the paper a bit obsolete in my opinion.
The first big feature is cache tiering. With cache tiering it's possible to split the storage network in two separate parts; a fast hot storage pool and a slower cold storage pool.
Cache tiers have two modes they can work in, "writeback mode" and "read-only mode" and I think the latter is very interesting for Gitlab. Basically what you do is put a couple of (expensive) SSD driven storage servers in front of your slower cold storage server that can run on traditional (cheaper) harddrives. Ceph will still write to the slower cold storage pool but on reads it will cache the data on the hot storage SSD systems and sequential reads will be read from the SSD directly. I don't have any details on Gitlab's data but I can imagine that more data might be read then it's written and this could be a big benefit in such a secnario. However they do note in the Ceph documentation that it shouldn't be used for mutable data but I'm not sure that definition holds true for git objects, hopefully somebody smarter then me can weigh in on that.
There is second addition to the Ceph codebase since the paper was released which won't really help with the performance but does make it more economically viable to use Ceph with large storage requirements; Erasure code. Erasure code is comparable to traditional RAID-5 with the added benefit that you can decide how much parity you want. The default erasure code profile is setup in such a way that you can lose a single OSD (think disk) at the expense of only 50% of what you would need when you use a replicated system (where each piece of data is stored exactly twice). So in a replicated system you would need 2x1TB to store 1TB but with erasure code you would only need 1.5TB. Erasure code offers a lot of flexibility for the size/redundancy trade-off and I can recommend this blog post for some of the (easy) math behind it. Do note that if you want to use erasure code you probably want a second cache tier in writeback mode because RBD images are not supported on erasure code (they require partial writes which erasure code doesn't support).
Thanks @maran. Our main concern at the moment is the amount of latency on the MDS (metadata server). Git does a lot of file lookups. But it will be great to have a 'hot cache'.
Although outdated there are two blogposts by intel that benchmark Ceph and compare it's performance to that of EBS.
Ceph's EBS alternative is 'Ceph RBD'. We are considering 'Ceph FS', which is a higher-level abstraction (filesystem instead of block device). GitLab needs a single large filesystem to store its Git repositories.
CephFS snapshot improvements: Many many bugs have been fixed with CephFS snapshots. Although they are still disabled by default, stability has improved significantly.
CephFS Recovery tools: We have built some journal recovery and diagnostic tools. Stability and performance of single-MDS systems is vastly improved in Giant, and more improvements have been made now in Hammer. Although we still recommend caution when storing important data in CephFS, we do encourage testing for non-critical workloads so that we can better guage the feature, usability, performance, and stability gaps.
@jacobvosmaer Could you not bypass CephFS by exporting on big Ceph RBD and creating a filesystem on top of that? I think I might have been reading slightly outdated docs but I think a lot of warnings were given about using CephFS in production environments.
@maran I read the same warnings about CephFS too. One of the reasons we have this issue out here is to see if anybody out there is using it anyway and what their experience is.
The problem with making a huge RBD disk is: how do you get it to the application servers? We considered clustered file systems but those look limited in the number of clients (application servers) they support. #4 (closed)
If we stick a huge RBD disk behind an NFS server then all file reads/writes + metadata lookups must pass through that one NFS server. The theoretical advantage of CephFS is that read/write actions go straight to OSD's with only the metadata lookups/modifications going to the MDS.
Right what I think you are saying is that you need to use the same filesystem on multiple app servers, with one huge RBD disk that wouldn't work. I haven't really looked into CephFS that deeply so I will abstain myself from making any other comments until I read more about it.
Question is: With a networked and distributed FS like CephFS wouldn't the same issues as with object storage come up? The main performance issues come from having to serially read objects as they reference the next one etc. (simplified). So with a distributed FS the latency issues will be there similarly to using object storage. One thing to circumvent is to have a fast DB for reference storage and only store the raw git object in object storage. Which as long as the object storage can be used with many parallel requests shouldn't be that much of an issue.
Concerning the object storage usage. Here are some benchmarks from the research implementation of git using openstack swift. https://docs.google.com/document/d/1IRPYwmSWzsAt9X66Rw9nnAZeTKBi17TMTczsjkZkiTg/edit
Github uses sharding on filesystems and as far as I know it's a pain.
So with a distributed FS the latency issues will be there similarly to using object storage. One thing to circumvent is to have a fast DB for reference storage and only store the raw git object in object storage.
Don't you need the objects to find other objects?
I believe some work has been done / is being done to store references in a DB instead of files on disk / the 'packed_refs' file. But I have no idea how far along that project is. See e.g. the Git 2.6.0 release notes.
Thanks for the link to those Dulwich/Swift benchmarks, that is interesting and possibly useful for us. (For those who do not know: Dulwich is a re-implementation of Git in Python, just like JGit (Java) and libgit2 (C)).
Yeah that's the main performance issue with networked storage systems and git objects as they need the reference of the current block to grab the next. As far as my knowledge goes. With this latency adds up and slows down retrieval etc. As you are already using a networked storage I do think the performance hit via CephFS should not be that bad for object storage it depends on the latency I assume. So google object storage would be something to consider as their network seems to be the superior.
@stp-ip we are not considering building a Git implementation backed by object storage (like Dulwich+Swift you mentioned) at this time. It would take a lot of engineering work and it is uncertain whether it would pay off with a usable (fast enough) Git storage backend.
"we use the Ceph RBD and I don’t think I'm going to be helpful but if I can add something to the gitlab thread I’ll post there. For now I will just be a clearer about what we’re actually doing.
We define the monitors and images mount points in our kubernetes RC. The kubelet in turn mounts the RBD image to the host (a VM in this case), which is then passed through as a volume mount to our docker container. We run a fairly low traffic gitlab for internal things, however we did do some NFT and performance testing that is undertaken by another team and obviously passed those checks. I suspect I can’t show you those numbers, but I’ll see if I can reproduce the latency issues, as I wasn’t aware of anything beyond the underlying limitations of the IaaS.
We do however have separate boxes for Data Store and the Monitor/OSDs, 3 of each at the moment. If I were to turn off an Monitor/OSD box, we’d see some pretty extreme intermittent latency as the client tries to refer to a box that isn’t there. We also previously had these all on the same boxes (in addition to etcd) but cephs-store did a reasonable job of making everything else, most noticeably etcd, to be intermittently unavailable.
This almost certainly isn’t how you’re going to want to use ceph though as it just gives you a single point of failure, but that’s what we do."
@sytses sounds like they have a one-server (or rather one-container) GitLab installation with persistent storage backed by Ceph RBD. It is the equivalent of us running gitlab.com on a single AWS server with repository data on an EBS volume: something we have outgrown for some time now.
@jacobvosmaer: Pretty much with the caveat that your EBS solution would be down if the AZ failed whilst the ceph solution would recover to a different AZ.
Extrapolating on that though, obviously you could look at hosting a HA nfs solution on top of a couple of Ceph RBDs but that's a layer of abstraction which one would assume would be better handled by CephFS itself.
I think the conclusion is that since our data will span multiple servers and will be queried by multiple servers we're better off waiting for CephFS to be production ready.
As an alternative idea, has anyone thought about instead of a disk back, using a object store backend like the redis backend to libgit2 https://github.com/libgit2/libgit2-backends
@timhughes Maybe I'm wrong but I don't think that would work for users using Git directly, unless we write something that handles Git payloads and stuffs them in Redis/whatever.
At this point I think we need to start considering setting up and testing Ceph as critical, the path forward is:
We are going to setup a cluster of ceph servers
We will kick the tires in this cluster with whatever we can think that is evil enough.
We are going to rebuild the host for dev.gitlab.org using the omnibus package and mount it on the ceph cluster.
We are going to dogfood ourselves with this cluster to see how it behaves and what surprises we have, or what mistakes we did.
We need good monitoring and tooling to see if there is an actual degradation at the FS level.
Anything that is not ensuring that GitLab.com can continue growing is a lower priority for operations and will be queued.
We need to provide good visibility of the situation to the whole company and be extremely vocal, if we are running out of time, everybody should know it, and should know why and what are we doing about it (and is invited to give a hand).
At this point we roughly know that we are growing fast, but we don’t know if that is on new repos, forks, imports or plain pushes.
We need support from the developers by building enough tooling to know where the storage usage is going, we also need to know what the trend is and what to expect tomorrow, and in a month.
If things go wrong, we need to have a plan B and enough time to execute it. This may include development involvement and resourcing.
How can we test the application in case of a degradation of the service
How do we ensure the object placement is optimized for repo data?
What kind of performance (e.g. IOPS) can we expect?
What is the optimal hardware and network that should be used for best performance? Pablo: not sure we can answer this, we will be using VIRTS in Azure for now.
What happens when servers get into an inconsistent state? How is this detected/resolved properly?
Inconsistent objects are detected by scrubbing and comparing checksums. This rarely happens (<1 per year), but needs to be resolved manually.
What replication factor makes sense? - 3
How do we recover from a complete disaster? (quorum down)
I reached Daniel from CERN directly to ask a few questions from the talk, here are the answers:
In order to make the change in a controlled fashion I want to make the transition slowly and not just swapping one server for the other, so I need to first measure what I have now, and then measure how Ceph is going to behave as the word out there is that Ceph is bad at IOPS on high load.
Ceph won't give you the same IOPS performance as an expensive NFS appliance with an NVRAM write back cache, for example.
But on the contrary I find CephFS to be more performant that the other network filesystems we use here at CERN -- OpenAFS, plain old linux NFS, and a couple others.
To make yourself comfortable you really should do extensive testing with Ceph to (a) understand how it works from an ops/reliability point of view and (b) to work out a setup which performs to your requirements
In the talk read and write latency for specific file sizes are mentioned continuously; how are you measuring this? is Ceph providing this form of data or are you measuring this from the client side?
Ceph has a built in benchmarking tool called 'rados bench'. That can be run interactively by you to see how it performs. We run a few rados bench tests regularly and feed the results into our monitoring.
How are you measuring IOPS? just plain old iostat? anything fancier than that?
If you have a running Ceph cluster, you can run 'ceph -w' to watch the ongoing user activity, in read/write B/s and ops/s.
I really encourage you to get some test hardware and start playing with/testing Ceph. Some people find that it has a bit of a learning curve
We should have proper metrics asap. Then we can set up a test instance of Ceph on which we'll run GitLab Geo as a secondary with GitLab.com as the primary. To increase the load we can run multiple Geo secondaries on the same Ceph installation.
We currently replicate git repositories notifying secondaries and letting them fetching from primary (it doesn't require to have a replicated filesystem for that, so anything will handle).
Things in /shared folder are a different beast. That covers any LFS object, attachments, avatars, build-artifacts, pages etc.
They can all be handled by either a replicated Ceph filesystem or we could change code to use a S3 compatible storage (this would give us freedom to choose a proprietary managed solution or any opensource S3 compatible daemon like Ceph, Riak, etc).
With that in mind, let's say we want to get gitlab.com replicated with Geo to ams.gitlab.com.
We don't have to move gitlab.com repositories to Ceph, we can test that on the Geo instances first:
Setup a Ceph cluster in AMS region, with full read/write and use it for the /repositories folder.
Setup Database replication
Setup Geo
Do first repository import or just "release the kraken" and let it get slowly populated after git push to repositories.
This will give us functional read-only web navigation and git pull from ams.gitlab.com. Anything that we currently store on /shared will not be available.
To get that we would have to setup another cluster, this time in gitlab.com and move /shared folder there and replicate it to ams.gitlab.com instances.
I like the "release the kraken" approach, that will allow us to see how the whole thing behaves.
How does it work in this setup? every time a repo gets a push, the secondary gets notified and pulls the whole thing? do you allow to throttle this behaviour to not hammer the primary? If so, can it be throttled in runtime?
every time a repo gets a push, the secondary gets notified and pulls the whole thing?
yes, every time there is a push on primary server, we queue that and consume the queue every 10 seconds. This will notify secondaries with a list of updated repositories. Secondaries will queue that again (in sidekiq this time) and start consuming (git fetching) immediately after.
do you allow to throttle this behaviour to not hammer the primary? If so, can it be throttled in runtime?
The only way we can "throttle" is by changing the amount of sidekiq workers running on secondaries. Currently we use default queue, if we want fine-grained control, we can move it to a new specific queue.
@brodock Is there a way of changing the amount of sidekiq workers from the application? can we have a specific queue for this and provide a way of short circuiting it completely? I just want to a have a quick way of stopping or lowering the load in case we are killing the master server.
Another question, how are repos that do not get pushes synced?
http://githubengineering.com/introducing-dgit/ "Git is very sensitive to latency. A simple git log or git blame might require thousands of Git objects to be loaded and traversed sequentially. If there’s any latency in these low-level disk accesses, performance suffers dramatically. Thus, storing the repository in a distributed file system is not viable."
I think we should figure out if the latency that CERN had (3ms reads?) is acceptable to us.
Ok, I'll try to dump what I got from the CERN video
General view
CERN is using Ceph to serve block devices to mount using Cinder and provision virtual machines in their own OpenStack.
They are not using Cephfs
They use Puppet to deploy the cluster, mixed with the ceph-disk deployment tool that is in upstream.
Initially they started with 48 OSDs per server and 5 monitors randomly distributed in the room. Each server had 64G of ram and 24 attached disks each. Monitors where basically CPU based hosts.
They have 1000 VMs with attached Ceph volumes.
They are starting to move their storage service into VMs with Ceph volumes.
IOPS
They provision block drives for VMs, they use QOS to mimic the behaviour of physical disks such that there is 2 kind of volume types:
high IOPS provides 120/120 RW MB/s 500/500 IOPS both per VM.
They only upgrade a user to a high IOPS block when he complains about performance, there are not so many users complaining.
Comparing to a physical disk, they can't do thousands of sequential IOPS and everything is limited to this numbers because they don't yet have Burst IOPS in Cinder.
The write latency was originally 30-40ms with 24 disks hosts with no SSD journal drives (all magnetic drives)
So they moved to SSD drives replacing 4 out of 24 drives for SSD and placing all the journals on these drives decreasing the latency from ~50ms to 5ms and adding a 5-10x IOPS capacity per server with only this. The main reason was because the journal is randomly accessed to read it. The result was to have a flat 5ms latency.
They don't use SSDs for the whole drives because of cost and because throughput is more important than latency and they can get it with magnetic drives.
They monitor iowait to check if a user needs a better drive. A better drive reduces the load because the OS is not waiting for the block device to return data.
Drive configuration max_sectors_kb defines an artificial limitation for a virtual block device. Given 100 IOPS and a block size of 512kb (default) the max throughput will be 50Mb/s. By defining a bigger block the throughput increases.
Ceph and linux Tunning
LevelDB for the monitor hosts is really sensitive to disk latency. They use SSDs for them too.
Scrubbing (checking integrity of the blocks in the cluster) can be really IO intensive. It will compete with the user access to drives slowing down the whole thing.
Thundering herd - scrubbing is triggered automatically every week by default. If all the OSDs in the cluster are built at the same time they all will trigger this at the same time. Be sure to smear the scrubbing manually (check scripts for that).
Twice they lost a whole server - cluster continued working just fine.
Backfilling is set to host to trigger the backfilling (replication of blocks) when a host goes dark, not a whole rack (by default)
Power cut killed 3 out of 5 mons not reaching quorum - as a result the OSDs froze and the cluster was down for 18 minutes. They came back again and life continued resuming IOPS, no data corruption was reported.
A router failure started loosing packets randomly. Some OSDs decided to commit suicide as a safety mechanism. They needed a manual restart. OSDs seems to be talking through broken sockets resulting in 600 seconds operations until they reach the timeout.
Deletions are slow because they have 1 thread only. Seems to be fixed in hammer
Scaling test with 30Pb
150 clients write throughput with 4Mb objects
Without any failure going on:
52Gb/s with no replica
~25Bg/s with 1 replica
~16Gb/s with 2 replicas
Speed decreases linearly as we have more replicas.
Doing backfilling (host is down) didn't affected throughput significantly.
They seem to be limited by network throughput.
Having a huge cluster means that you will always have backfilling going on.
The main limitation is the OSD map. With 7200 OSDs it is a 4Mb file, but each OSD process caches 500 of these files, limiting the scale. They configured Ceph to cache less maps.
On top of that information we will need to start comparing with the profiling information that rados bench provides, then we can see if this is a good solution or not.
We should ensure that Ceph read 64k ahead like the Linux kernel does. If the read ahead is only 4k your get heavy io when you need to read many commits in the repo (needed for some operations). 64k is 5000 commits.
"At the RADOS level replication is synchronous, so it's not going to work well with that kind of latency. At the RGW level it's async, but loses ordering guarantees that you'd need for something like this." https://news.ycombinator.com/item?id=11439448
Notice r/s and w/s - that's how many read and write requests we are getting per second (IOPS).
For what I see we are getting a maximum of ~2100 IOPS per seconds in the worse case, with an average of ~434 and a standard deviation of ~301, so it's a long tail. That tail comes from the writes as we have specific spikes of up to ~1900 IOPS but a stdev of ~270.
Also notice r_wait, that is the latency in milliseconds (the time it takes to fulfil a read request, from issuing to serving it). It never goes under 3ms (2.92 to be precise) with spikes up to 12ms. ~6ms on avg, ~3ms for stdev. It's quite stable. This replies to the "is this enough for us" question for the time being.
The write (w_wait) case is much worse ranging from 3ms up to ~500ms. The avg is ~32ms and the stdev ~100ms, so it looks quite bursty.
This is just one sample, I'm working in making this available in check_mk so we can monitor it the whole time.
Generally what you do is use the Lustre system and mount the same chunk of disk space on multiple OSS (Object Storage Server) devices. Lustre supports 'striping' across OSS/OST targets so that files are split between multiple storage targets and a parity checksum is calculated and stored in case of an OSS/OST loss. Things have changed since 2013 when that paper was written, I've also been reading up on Ceph to make sure everything is getting a fair shake. @pcarranza and I talked about this and I'll be conducting an apples-to-apples comparison of CephFS and Lustre w/ load testing and documenting the results. I'll see which one breaks first and what the performance characteristics are of each of them. After we have some actionable data about how things behave under our specific environment we can circle back again and I'll present the findings.
@northrup please document also whatever is part of the setup and maintenance cost: if a given infrastructure is harder to maintain or monitor, that will be something critical to know before embracing it completely.
Just had a call with @jnijhof we should also push the linux kernel repo to each new cluster we build, time it, and compare the timing with GitLab.com to have a point of comparison that is simple and direct than talking about raw IOPs
In summary, the improved VMs and the separate SSD journals led to a huge improvement. What were the other VMs and how did they compare in terms of network latency and bandwidth?
For the follow up test I will clone and push to my linux repository on DEV to see how that goes. Also will remove an osd server to see if that does anything.
Some tests that could be run (each on CephFS, NFS, and local):
Check git ls-files -rt --name-only <branch>. This recursively lists each file/directory of the repo. Results in a lot of serial disk reads, but some could be parallel (no idea how git implements it internally). With some scripting, printing out the last commit that affected the file is also possible, but that might not be a very good test. I do not believe that GitLab actually runs this anywhere (at most, just a single directory in the tree view), but the timing might be more useful.
Test perform large merges and diffs (not sure where to get any; something like Google Chrome might have some good ones), both with merge conflicts and without.
Test large rebases. (might be similar to large merges though)
I particularly like the "Use a repo that has a large number of objects in a single tree" sample as it could show how ceph behaves with a lot of small files.
@jnijhof In terms of number of direct child files and directories, the largest directory in the Linux repo is arch/arm/boot/dts with 1374 children. On the CocoaPods/Specs repo, the largest is the Specs directory with 19383 direct children.
I used this command to get these values (ignores directories starting with .):
I tested with git push and git clone a couple of times with flushing cache and without and the timings keeps the same which is really good news.
Next step: test with different placement groups numbers but not sure if we can even change that on a running cluster. I have followed the Ceph guide so the actual numbers used are correct but I want to find out if we can change that. I.e. when we are extending Cephfs we need to make sure the amount of PG's are still right.
Changing the PG numbers is just a ceph command... after increasing the pg_num, increase the pgp_num to have ceph rebalance the existing data across the now new available OSD spots.
Last known state is that we are trying to find a way around connecting 2 networks inside Azure, one from the classic portal, and the ceph cluster that lives in the new portal. @northrup is in a call with Microsoft people right now trying to find a way of doing it.
Worst case scenario we will need to migrate GitLab to the new side of the portal as a whole.