GitLab CI is awesome, but not always fast. Let's make it fast, and fast by default. Everyone should be able to have CI run in less than 5 minutes.
This is a meta issue to track a variety of ideas for speed improvements.
Ideas
Sticky runners (to avoid start time when a runner can be re-used for the same project and to avoid needing to re-download cache and artifacts; re-use common services (e.g. docker daemon)).
Pre-pull common docker images (to load docker images during boot rather than build start).
Build "gitlab/kitchensink" image that contains 90%+ of the tools/binaries that users need (e.g. docker, awscli, python, ruby, kubectl, etc.).
The documentation should include some information and tips on improving speed of tests and builds, IMO right now we're not really sharing that information with users as much as we should be.
For us specifically would it be possible to generate the DB and then fork the pipeline into the dozens of jobs? Right now we're doing a lot of duplicate work on every Rspec job so parallelization isn't as effective as it could be.
Cancel a pipeline automatically if the relevant commit no longer exists due to a rebase + force push.
Alternatively, make it easier to kill an older pipeline after rebasing a merge request. Right now it just kind of disappears from sight but continues running which makes the queue longer.
Actually, would it be possible and/or insane to output the container with the database set up and dependencies installed as an artifact to pass it forward to every Rspec job?
@zj Interesting question. I wasn't intentionally putting anything on that list for "time to first build". Do you mean that in the sense of onboarding and getting started, or in the sense of non-cached build, or in the sense of prep work to get a build going like writing .gitlab-ci.yml and making docker images?
I think onboarding is very important, but separate from the above.
Getting started mainly, so one who is new to CI, what is the time to get a successful build on his/her development environment? But that should be a separate issue indeed.
To increase our own build times, we might just want to have a DO box which only does static tests and one which only does the KnapSack builds, both using a SSH runner. I think most of these jobs should be done within 30 seconds, maybe rubocop can take a minute.
Pre-pull common docker images
We have default images which does exactly that right?
Cache known system directories e.g. for ruby gems automatically.
This is going to require a lot of maintainance on our side, all frameworks and languages will need this and it can change between versions.
Analytics of pipeline runs to help people discover and diagnose speed problems.
Please! Travis shows the time for each command, that would be awesome already.
Improve .gitlab-ci.yml templates to use caching and other optimizations.
I'll ask contributors if they can do this from now on. :)
Enable failing fast (so a pipeline returns failure as soon as any test fails).
Douwe convinced me lately this is possibly a terrible idea. If my rubocop fails I still want to see what Rspec did, so next time I look at it I can fix everything in one pass.
@markpundsack: Improve .gitlab-ci.yml templates to use caching and other optimizations.
So that as @zj said this could be maintained by community and while this might complicate configuration, it's also great examples to show people how to use .gitlab-ci.yml, and show them the potentials.
@markpundsack: Start reporting times for each step in a build (i.e. git fetch, docker pull, etc.) so people can monitor and reduce them.
I highly want this. As @zj pointed out that Travis did this, which is very helpful for people to optimize builds.
@connorshea: The documentation should include some information and tips on improving speed of tests and builds, IMO right now we're not really sharing that information with users as much as we should be.
This is probably also what .gitlab-ci.yml template could be doing.
@connorshea: For us specifically would it be possible to generate the DB and then fork the pipeline into the dozens of jobs? Right now we're doing a lot of duplicate work on every Rspec job so parallelization isn't as effective as it could be.
This sounds a bit crazy to me. Even if this works, I think we should still utilize caching instead of this.
@connorshea: Cancel a pipeline automatically if the relevant commit no longer exists due to a rebase + force push.
This could be interesting.
@connorshea: Alternatively, make it easier to kill an older pipeline after rebasing a merge request. Right now it just kind of disappears from sight but continues running which makes the queue longer.
You can find them from the list of pipelines, and the new pipelines tab in a merge request. I often cancel pipelines which I no longer care.
@markpundsack: Enable failing fast (so a pipeline returns failure as soon as any test fails).
@zj: Douwe convinced me lately this is possibly a terrible idea. If my rubocop fails I still want to see what Rspec did, so next time I look at it I can fix everything in one pass.
I thought it's just pipeline would be marked as failure, but the rest jobs should still be running? I might want to know the pipeline failed as soon as a job failed instead of waiting for all jobs are done. Or, the running jobs should keep running, but maybe the pending ones could be cancelled.
To increase our own build times, we might just want to have a DO box which only does static tests and one which only does the KnapSack builds, both using a SSH runner. I think most of these jobs should be done within 30 seconds, maybe rubocop can take a minute.
Build "gitlab/kitchensink" image that contains 90%+ of the tools/binaries that users need (e.g. docker, awscli, python, ruby, kubectl, etc.).
I'm not fully sold for that idea. Usually maintaining this kind of big images is a hard thing, also making sure that with our changes we don't brake people builds is even more tricky.
To increase our own build times, we might just want to have a DO box which only does static tests and one which only does the KnapSack builds, both using a SSH runner. I think most of these jobs should be done within 30 seconds, maybe rubocop can take a minute.
@zj can you elaborate? :)
Sure, not pulling an image, no downloading or creating of a cache, hardly any overhead, just bundling and running rubocop, should take about 30 seconds. The overhead is with all static tests, also with knapsack. These should take no more than a minute each IMO.
Edit: Faster KnapSack will improve the wall clock time one-on-one, maybe that should be the first focus. Maybe not ssh, but a ruby-alpine image, or something?
Enable failing fast (so a pipeline returns failure as soon as any test fails).
Douwe convinced me lately this is possibly a terrible idea. If my rubocop fails I still want to see what Rspec did, so next time I look at it I can fix everything in one pass.
@zj I think what they're suggesting (or at least, how I think it should be implemented) is the build status would be changed to false immediately when a build fails rather than waiting for everything to finish before changing the status. Right now Rubocop can fail in 2 minutes and the UI doesn't tell us (unless you go into the pipelines view) until the last Rspec tests finish 30 minutes later.
We have default images which does exactly that right?
@zj I don't believe so. I think the default image isn't actually loaded until you use it. At least in my testing, that appeared to be true.
Cache known system directories e.g. for ruby gems automatically.
This is going to require a lot of maintainance on our side, all frameworks and languages will need this and it can change between versions.
True, but other vendors do this and it makes it much easier to get started. I'd hope that if we did this, it would be in some extensible way like plugins, especially so that third-parties could maintain.
Analytics of pipeline runs to help people discover and diagnose speed problems.
Please! Travis shows the time for each command, that would be awesome already.
Certainly we should start by doing what Travis does. Circle has build timing graphs so you can see how your parallel containers are doing, and build history graphs so you can see changes over time.
Improve .gitlab-ci.yml templates to use caching and other optimizations.
I'll ask contributors if they can do this from now on. :)
Yeah, it's been on my list forever to do this myself. But the right thing would be to ask contributors to do it. Unfortunately it's really hard to understand what the right way to use caching and artifacts actually is! Can you create an issue and we'll follow up there?
Enable failing fast (so a pipeline returns failure as soon as any test fails).
Douwe convinced me lately this is possibly a terrible idea. If my rubocop fails I still want to see what Rspec did, so next time I look at it I can fix everything in one pass.
Sure, it's a bad idea to do for everyone, but there's a great blog post about a team that turned their tests, which took some number of hours to run, and only ran overnight, into a fast-fail set of tests that ran in minutes and a slow-fail set of tests that ran overnight, and they dramatically improved their team's velocity.
I had a conversation with someone recently, possibly a PM candidate? Not sure. Anyway, one idea was to do something like what Slack does with notifications. e.g. it escalates after some time. So imagine you have the build page open, we send you a popup notification as soon as something went wrong. But we keep building the rest so you can see all of the errors. When it's done, we email you all of the errors. But, if X minutes elapses from the first failure, and the rest of the build still isn't done, then go ahead and notify anyway. Possibly even kill the pipeline run if there are other jobs queued. OK, that's a pretty elaborate process, and might be overkill, but might also be awesome. :)
Perhaps it's really "notify fast" instead of "fail fast". Or we just let users who want the "fail fast" to put their fast tests in a separate stage from their slow tests. I mean, they can already do that, but perhaps we document and recommend it. Plus maybe add notification after the first stage succeeds, since the second stage hardly ever fails, maybe you want to know it's OK to go ahead on to other work. But not allow the merge or deploy until the full suite passes.
@zj@godfat BTW, for caching gems, I think a more reasonable approach is to set the bundle path to a local directory, and cache it, rather than caching the system directory. The syntax to do so is just ugly, so I'd also like to hide that behind a plugin or something. But the first step would be to have the Ruby template include the appropriate command. You can shave 8 minutes off a 10 minute build by properly caching gems. This has already been added to https://gitlab.com/gitlab-org/gitlab-ci-yml/blob/master/Ruby.gitlab-ci.yml, but we need to go farther and set the cache:key to a constant so the cache is shared between branches and jobs. Otherwise each MR gets a virgin cache.
@connorshea: The documentation should include some information and tips on improving speed of tests and builds, IMO right now we're not really sharing that information with users as much as we should be.
This is probably also what .gitlab-ci.yml template could be doing.
True, but the templates don't really explain why they're doing those things. These should be documented.
Build "gitlab/kitchensink" image that contains 90%+ of the tools/binaries that users need (e.g. docker, awscli, python, ruby, kubectl, etc.).
I'm not fully sold for that idea. Usually maintaining this kind of big images is a hard thing, also making sure that with our changes we don't brake people builds is even more tricky.
@ayufan Yes, this is hard and painful. But our competitors are doing this and providing MUCH faster builds than ours because of it. And some of them are winning business because of those build speeds. It is super painful to keep things up to date. People expect support for the new version the minute it ships. But perhaps this is just a cost of doing business.
I'm hard pressed to think of any other combination of changes we could do that would have as much impact as this. And it's great for onboarding since it really reduces the time to get started!
Advanced folks can make custom images that will load quickly, but even then, if they're not one of our pre-pulled images, they'll have to wait for the download or rely on our (poorly performing) proxy cache. For single-tenant installations, they'll be able to pre-load their own image, which will be great. But GitLab.com will never be able to load all the variety of images out there quickly. It's one of the downsides of Docker.
This is up to clustering platform to do, not really us. So if we ever switch to Kubernetes or Docker Swarm this will be effectively solved.
@ayufan Interesting. If that turns out to be true, then it's a good reason to push people in that direction. But for GitLab.com, would we really give runners access to the Kubernetes cluster? If we didn't want to bind mount the runner's own docker instance, what changes with Kubernetes?
Store cache regardless of job success or failure. If before_script successfully pulls gems, node deps or whatever, they should be stored. Period. Getting first build green isn't an easy thing, and seeing those damn gems being downloaded and installed all over again is annoying and time consuming. My workaround for this was to comment out the actual tests and only let bundle install and npm install cache everything, and then play with the build.
@ayufan Interesting. If that turns out to be true, then it's a good reason to push people in that direction. But for GitLab.com, would we really give runners access to the Kubernetes cluster? If we didn't want to bind mount the runner's own docker instance, what changes with Kubernetes?
Kubernetes and Docker are working on Hypervisor isolation. This will open a possibility to open our shared runners to existing cluster, use a caching of docker images and improve performance.
@warren.postma Alpine is a base image that is much smaller than Ubuntu or other Linux distributions. A lot of folks are switching to using alpine-based images for Docker deployments because it makes your Docker images much smaller, faster to fling around, faster for CI, etc.
@markpundsack In an evening I had nothing to do I tried to convert GitLab CE testing to alpine images, sadly alpine uses different libraries, most notable Libc. Some gems with native extensions failed to compile with this. So I don't think the default we should encourage people to use it perse.
@zj Yeah, Alpine is definitely harder to use sometimes, and some projects may not be able to make use of it at all, but I'd still say a "best practice" would be to at least try it for every project, because if you can get it to work, it's much faster. So it comes down to use defaulting to conservative practices that work for everyone or best practices that work better for some people. Usually that's an easier decision because the best practice is often better for (nearly) everyone. This case is less obvious. One option, which isn't a great one, is to offer both alpine and regular images for templates, for example.
@markpundsack We might just use slim images for all languages/frameworks that are not staticly compiled. For Go for example, we should advice/force static compilation as it makes no sense to pull in 300MB of debian stuff where only the binary would suffice.
@markpundsack I guess you can put this link to Cancel a pipeline automatically if it's not HEAD.
"Analytics of pipeline runs to help people discover and diagnose speed problems." seems higher priority than others, to investigate the heaviest bottle-neck in the current architecture. If Gitlab has an advantage as all-in-one solution over Github+Travis or CircleCI, which means "You don't need to switch tools so make your development faster", then GitlabCI should be as fast as them otherwise it's killing the only advantage.