Caching between stages

Let's don't mix the caching with passing artifacts between stages.

Caching is not designed to pass artifacts between stages. You should never expect that you have the cache present. I will make the cache more configurable making it possible to cache between any jobs or any branch or anyway you want :) Cache is for runtime dependencies needed to compile the project: vendor/?

Artifacts are designed to upload some compiled/generated bits of the build. We will soon introduce the option (on by default) that artifacts will be restored in builds for next stages.

mentioned in issue #280 (closed)

Cool. Caching dependencies is exactly what I need. I think the confusion is that in other build systems artifacts can be used for that purpose (it's what we do in Jenkins currently).

My ideal implementation is to cache the result of the before_script and make that available to each job so before_script is only ever run once for each build. In fact this is how I expected it to work when I first started using Gitlab CI and I was surprised when it didn't.

My ideal implementation is to cache the result of the before_script and make that available to each job so before_script is only ever run once for each build. In fact this is how I expected it to work when I first started using Gitlab CI and I was surprised when it didn't.

This is impossible this way to do. With GitLab CI you can specify different runners, different images for each of the job. For example you want to test against different versions of ruby. Cached gems will be invalid for different versions if not done strict.

The current approach of caching is strict, but it's better to make less tight in next releases when we see in which direction they should be improved :)

It doesn't make any sense. I'm spending an hour for building and then I cannot use my build to deploy it. :3 Is there any chance to have "keep" function to preserve artifacts between jobs? It can basically use same mechanism as "cache" – it can be compressed and then recreated after next job start. Or maybe even simpler: define which paths shouldn't be cleared (by git clean).

@seweryn.zeman We already have artifcats. What we will add is the possibility to pass artifacts between stages. That will solve the problem.

Sounds fair. But artifacts are available for download in CI interface. Maybe you should also think about not uploading them?

Hoping this implemented quickly.

+1 for https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/issues/336#note_2910837

Just to comment on this, I have a perfect use case case for some sort of carry-over between stages.

When we check in chef cookbooks, gitlab-ci runs testkitchen tests against them. In the event of a failure, we want to fail the build but run a cleanup step (kitchen destroy).

What ends up happening is that because .kitchen is removed between stages, testkitchen thinks there's nothing to clean up. Meanwhile we have dead docker containers sitting around from failed builds.

Adding the .kitchen dir to the repo makes no sense and it really doesn't fit as a build artifact.

+1 it would be really useful

@ayufan Are you on holiday?

+1 This is especially useful for golang apps. To do a go get and go build takes a while. Would be nice to use that compiled "result" in the deploy stage of a docker runner.

@francoishill11 Everything is ready on GitLab side, and 75% is done on runner side. It'll come after 1.0 release.

@xiemeilong-ok Nope, there's a lot that needs to be done to have it implemented.

Added ~147498 label

I'm really confused by the stages. From what I understand from the CI's yaml README, stages should be use for different steps in the CI process:

The specification of stages allows for having flexible multi stage pipelines.

It even advertises the default stages as build, test and deploy. From what I understand, these stages are meant to split the whole process into smaller steps, for example different steps/stages for cmake, make, make test and make install.

But that workflow (cmake, make, make test and make install in different stages) fails completely because the git working directory is cleaned before each stages, as users are reporting here.

What I want to do (and probably many others too) is a way to split a single CI "action" into smaller steps (configure, make, test, deploy for example) so one can identify right away which step failed. I also need to do these steps for different configurations (for example, for -DCMAKE_BUILD_TYPE=Debug and -DCMAKE_BUILD_TYPE=Release). I don't want to repeat myself by copy-pasting the whole CI action over and over again...

If stages are not meant for this workflow, what mechanism would allow it? Artifacts? Something else?
I fail to see the use for different stages if the (default) test stage needs to re-build, and the deploy stage needs both build and test. In that case, why bother with stages at all?

I've posted a kind of workaround on SO but I'm not that satisfied with it...

@nbigaouette1

I did manage to ship first part of proper artifacts passing: https://gitlab.com/gitlab-org/gitlab-ce/issues/3423.

However, we miss the implementation on runner side as for now. I'll release Technology Preview of that this month. The first implementation will pass all artifacts from the builds in previous stages to next stages.

So it will be possible to do that:

build:
  type: build
  script:
  - make binaries
  artifacts:
    paths:
    - my/binaries

test:
  type: test
  script:
  - my/binaries do-something

Thanks @ayufan! So I almost understood it correctly: stages are the way to go, but not cache. It should be artifacts. I'm glad it's coming soon!

In gitlab-org/gitlab-ce#3423 @plajjan mentions a scenario where two different jobs of a stage generates two different artifacts (say for python 2 and 3). These artifacts are to be used by one of two more jobs in the next stage. Will gitlab (ci/runner) know which one to use where? Or will the names have to be different?

Thanks again ;)

@ayufan This not yet shipped, is it? If it is not, how likely is it to be released with 8.5?

And another question about the artifacts, how long are artifacts stored. Is there a way to delete them after some time or on specific branches? In one scenario of us there are dependencies needed by the next stage. But those artifacts could be deleted just as the build finished or failed. Maybe also an option to mark artifacts to only persists over the build would be great.

Is there any thing like this already?

What I talk about are really dependencies like vendor/ folders. But actually cache do not work for passing them between stages. Would be optimal if those could be updated/installed/build once and then be able to use in the following stages. Without having tons of them stored as artifacts.

When I tried this with the cache, this worked, but only if you have no parallel runs. As soon as there are more than one parallel jobs in the second stages only one of them receives the cache and thus one of them fails to build.

@wzrdtales The vendor/ folder is the cache :) The artifacts are the intermediate output of the stage. You can check this to tweek the caching setting: http://doc.gitlab.com/ce/ci/yaml/README.html#cachekey

@ayufan Then the cache currently is not usable for mutliple jobs at the same time, which is a problem if there is a preparation job.

Wait a second, so with cache:key we can now:

make:
  stage: build
  script:
    - make
  cache:
    paths:
      - bin/
    key: "$CI_BUILD_NAME"
    untracked: true
test:
  stage: test
  script:
    - bin/generated

???

@seweryn.zeman Yes, you can, but I don't advise that, because you can end-up with some mixed scenario when you don't know what exactly happens :)

It's better to wait for artifacts. At some point artifacts could become local :)

Ok, I understand this as long as cache is proper term here. :) Also naming "local artifacts" local seems to be a good decision.

I think we already discussed all possible cases here:

Caching between jobs – for keeping vendors/ that can be only updated, don't have to be rebuilt.
Artifacts – build output that can be downloaded from UI.
Local – to keep make between stages like build and test. (This sound so obvious xD)

Artifacts is the intermediate output of stage that will be passed to next and shouldn't be saved in vendor/ (ie. cached). I can think of it as object files from compiled C++ project.

Making artifacts local would be only optimisation for passing them between stages.

The way the artifacts passing will work now is that you could use runners on separate servers.

@ayufan Then a quick question: If you have a vendor/such as node_modules and you install all dependencies first.

This would be still considered to be cache?

If yes, then what I actually tried to do is the following:

stages:
  - setup
  - test

before_script:
  - export CI=1

install::deps:
  cache:
    paths:
      - node_modules/
      - bower_components/
    key: "frontend"
    untracked: true
  script:
  - nvm install 4.1
  - node -v
  - npm update -g npm
  - npm -v
  - cp config/environment.default.js config/environment.js
  - cp config/.ember-cli-default ./.ember-cli
  - npm install bower
  - npm install
  - node_modules/.bin/bower update --force-latest --allow-root
  tags:
    - node
    - nvm
  stage: setup


test::phantom:
  cache:
    paths:
      - node_modules/
      - bower_components/
    key: "frontend"
    untracked: true
  script:
  - nvm install 4.1
  - cp testem.json.phantom testem.json
  - xvfb_test
  - grep -m 1 -e percentage coverage.json
  tags:
    - node
    - nvm
  stage: test

test::firefox:
  cache:
    paths:
      - node_modules/
      - bower_components/
    key: "frontend"
    untracked: true
  script:
  - nvm install 4.1
  - cp testem.json.firefox testem.json
  - xvfb_test
  - grep -m 1 -e percentage coverage.json
  tags:
    - firefox
    - node
    - nvm
  stage: test

This works partially, the first job installs the dependencies just fine. Now there is a problem if the next stage runs which would depend on what the first job has done. I think in this case this more than just a cache as the flow enforce that the dependencies need to be available.

Actually the problem is, that only one of the two runners in the next stage receives the cache and the other one not. If I add a chrome runner, then also only one of them gets the cache. I assume that only one job can access the same cache at the same time.

Yes, artifacts passing solves that problem. It will unfortunatelly put a little more overhead, but the artifacts could be fetched by any number of concurrent runners.

How big is the cache for you?

100MiB

mentioned in issue #1039 (closed)

I have the same problem as @tobru. I need to pass vendors between jobs, that can be runned concurrently. Artifacts could be solution, if they wouldn't be send to the gitlab server.

Any other idea @ayufan, how to handle it?

@dariss6666 Currently the artifacts will be send to server. I prefer to go with that solution since it is more generic. Later we can think about optimizing the artifacts passing mechanics.

@ayufan May be it would be still a consideration to clean up artifacts after some time. For me just 100 builds would result in 10GiB artifacts generated and stored but never used again after the build.

And one thing to consider might also be that passing artifacts in those scenarios only make sense if all jobs, which should get those artifacts passed, run on the same server. This obviously hardly depends on the size of the artifacts and the connection between those two possible servers.

Already on their list https://gitlab.com/gitlab-org/gitlab-ce/issues/5572

Yeah :) Building the product takes time :D

The thing is that everything that was mentioned about caching and artifacts is somewhere on our roadmap, but it takes time to build and make sure that it fits the proper quality standard. So I do to take an incremental approach, adding small features that will make our live (or builds) easier. Later reassesing what we can optimize to make it more robust, but on the other hand to not introduce unecessary complexity.

You maybe interested in checking out the artifacts passing between stages: https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/merge_requests/80

mentioned in issue #1186 (closed)

This is resolved by passing artifacts between stages.

If I set something as an artifact, it is propagated back to gitlab for later download. Sometimes this is a good thing, sometimes bad (I don't want to persist this much data). Does gitlab persist more than the most recent CI run artifacts? If so, this could chew up my disk space fast. Consider a project with something like 6 branches each of which builds a Gig of product (artifact). I could get in a bind for space fast. Perhaps we could set a flag on "artifact" to indicate the artifact should not be persisted back to gitlab, just left on the runner? Could I delete the artifact at the end of the "deploy" stage to avoid having it end up on the gitlab server? While I could get around this by stashing things outside the runner's build tree this feels... dirty. What's the right thing to do here?

Lee, this issue is not really about artifact passing. Please see https://gitlab.com/gitlab-org/gitlab-ce/issues/3423#note_4714661, particularly the conversation with @ayufan about artifact passing being enabled by default, and the way you can explicitly control which stages you want artifacts from.

It would be great if we could define the cache as read-only for certain jobs. I currently install npm and bower dependencies in parallel during the dependencies stage. After that the codebase is linted in the lint stage.

In that stage I need the dependencies, thus the cache. However, I don't change anything in the cache. Or even if I was, this would be an unintended side-effect. The CI runner however still checks the directory for changes and submits these - if any.

It would be great if we could set a cache:readonly key, so that this unnecessary step is not performed.

.gitlab-ci.yml

Don't understand what I was doing wrong. Gitlab Version 8.6.7.

image: debian:wheezy

before_script:
  - uname -a
  - ls -la

stages:
  - build
  - test
  - deploy

sample_build:
  stage: build
  cache:
    paths:
    - result/
    key: "test1"
    untracked: true
  script:
    - df -h > result/df.txt
  only:
    - master
  tags: 
    - run1

sample_test:
  stage: test
  cache:
    paths:
    - result/
    key: "test1"
    untracked: true
  script:
    - cat result/df.txt
  when: on_success

sample_deploy:
  stage: deploy
  cache:
    paths:
    - result/
    key: "test1"
    untracked: true
  script:
    - cp result/df.txt target/df.txt
  artifacts:
    paths:
    - target/*.txt
  when: on_success

Build crash with error:

gitlab-ci-multi-runner 1.1.3 (a470667)
Using Docker executor with image debian:wheezy ...
Pulling docker image debian:wheezy ...

Running on runner-85d37076-project-2-concurrent-0 via d17...
Fetching changes...
Removing result/
HEAD is now at 1a8de2f cache fix
Checking out 1a8de2f6 as master...
HEAD is now at 1a8de2f... cache fix
Checking cache for test1...

$ uname -a
Linux runner-85d37076-project-2-concurrent-0 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u6 (2015-11-09) x86_64 GNU/Linux
$ ls -la
total 24
drwxrwxrwx 3 root root 4096 Apr 22 09:26 .
drwxrwxrwx 3 root root 4096 Apr 21 15:54 ..
drwxrwxrwx 8 root root 4096 Apr 22 09:26 .git
-rw-rw-rw- 1 root root  646 Apr 22 09:26 .gitlab-ci.yml
-rw-rw-rw- 1 root root    7 Apr 21 15:54 README
-rwxrwxrwx 1 root root   28 Apr 21 15:54 script.sh
$ cat result/df.txt
cat: result/df.txt: No such file or directory

ERROR: Build failed: exit code 1

If I comment "key:" - folder "result" exist and the output of the build is "success", but the next stage "sample_test" no longer works because there is no folder "result/df.txt: No such file or directory"

@Fein set the cache globally and use something like key: "$CI_BUILD_REF_NAME". Since default cache doesn't cache in between different stages. Also if you set a path on your cache & untracked: true it defeats the purpose since untracked in your case might be the same as having the result/.

Only thing what i wonder is how to disable the cache normally on another stage. I use

cache:
     key: "disable" # the word disable can be anything in this example.

Since i have my "dist" folder which i take from build stage to deploy stage. In my deploy stage i don't need say bower/node modules because i use an artifact. So why would i want my deploy stage to "save/restore" the cache again for these files. Waste of time, don't know if there is a nicer method (but this works for me)

I just started using GitLab CI after reading the documentation I settled on a bunch of stages in the following order or an Android app:

build (compile sources, assemble the .apk)
check (check style, lint, etc...)
test (unit tests, automation tests, etc...)
archive (upload the final .apk to some server for storage)
deploy (deploy the .apk to production - master branch only)

Most of these stages require a built .apk file, which is done in the build stage. Without caching, I'm required to assemble the .apk every time for each stage, making the whole process really slow. I honestly never thought I needed some sort of caching for this. I always thought of jobs more like tasks that will be executed in order of stages associated and the temporary output from assembling the .apk would remain available for the next job in the following stage. Guess not.

I assume this issue exists to solve this particular issue, correct?

Anyway, I currently solved my problem using the cache options using key: "$CI_BUILD_REF" (that's the commit hash) with some specific paths. Not sure if this is the right way or if introduces some other problems that I might not be aware. But it seems to help "keep" build data for the next job on the following state regarding the same commit.

What do you guys think of this workaround for now?

Is this issue being discussed for a future release?

mentioned in issue #1324

@rfgamaral, the reason why it works the way it does is because the different steps might be executed on different machines. That is the beauty of Gitlab CI. This is why caching will need to be enabled "explicitly" for the folders/files you want to cache.

@francoishill11 Yes, that makes sense. My questions was more about the $CI_BUILD_REF and if there was no problem in using that as cache:key. Cause that's about the only way I've found to do what I wanted.

Although I understand how cool it is to have the various stages being built by multiple runners across multiple machines, what would you guys think of having an option to force execution with the same runner the first build/stage started with? I mean, you can still have multiple runners across multiple machines, but you push to a branch and a new build on stage A will be started on runner X, the next stages will still use runner X. But you can do another push to the same branch and have stage A started on runner Y, but the next stages will use the runner Y until it finishes all stages.

Something like on the .gitlab-ci.yml file:

jobs:
  same_runner: true

(probably not the best of names to clearly identify what it does)

Just a suggestion.

@francoishill11 @rfgamaral This use-case is not what caching is for. Caching is to speed-up the setup of build dependencies. There is no guarantee that the cache will be there.

You're looking for artifact collection:

build:
  script:
    - do whatever
    - steps necessary
   artifacts:
    - build/something.apk

@jonathon-reinhart If that is so, than I got to be honest and say that I really dislike how GitLab handles all this. Let me clarify...

To me, an artifact is - as you used in your example for an Android project - the final .apk file. It could also be HTML reporting from Android Lint. An Android project based on the Gradle build system generates many temporary and intermediate files that will eventually result on that final .apk file (or not). These temporary/intermediate files are not artifacts. Well, not from my point of view at least.

For instance, let's assume a 3-step stages process, build, check and assemble. On the first one I'm going to build the application by simply running the Gradle tasks that will compile the .java files into .class files. On the next stage, I need these .class files to run static code analysis tools like Findbugs. On the next (final) stage, I'll also need these .class files to assemble the final .apk file. If these .class files are not passed in someway between stages, each build stage will take much longer. For each build stage I only need to compile the .class files only once and that's on the first stage. All others should just reuse those temporary .class files and simply do what they are supposed to do (static code analysis and assemble .apk).

Then there's the .gradle folder which contains the Gradle Wrapper. But I guess this one could be considered as a build dependency and I can use a global cache for this one. Correct?

Personally, I don't think it makes any sense to pass these temporary files between stages using artifacts. They will also be stored along the builds and accessible through the GitLab interface but these files are pointless to keep, they don't need to be kept as build artifacts, they are temporary and only need to live between stages. The alternative is the cache feature, but you say there is no guarantee that the cache will be there. But I do want to store the final build .apk (assembled on the final stage) as a build artifact to be downloaded from the GitLab interface so QAs can browse the builds and pick the ones they want for testing.

So what's the solution for all this?

IMO, I believe there should be a guaranteed way to pass some sort of cache for these temporary build files between stages, it would solve a few issues.

Unless, artifacts somehow are flexible enough to achieve what I just described. Example:

stages:
  - a
  - b
  - c
  - d

job1:
  stage: a
  script:
    - something
  artifacts:
    - app/build/ # DO NOT collect artifacts for download on GitLab, just pass it to next stage

job2:
  stage: b
  script:
    - something
  artifacts:
    - app/build/ # DO NOT collect artifacts for download on GitLab, just pass it to next stage
    - app/build/reports/ # DO COLLECT artifact for download on GitLab

job3:
  stage: c
  script:
    - something
    # We do not need to pass app/build/* to the next stage after this job

job4:
  stage: d
  script:
    - something
  artifacts:
    - app/build/outputs/apk/final.apk # DO COLLECT artifact for download on GitLab

Is this possible? I could work with something like this... But I'm not sure the artifacts feature is flexible enough to provide this.

tl;dr;

How can I have a build process with X stages, pass along temporary build files between them and selectively collect just some artifacts for storage on GitLab per stage?

@rfgamaral I think they're still artifacts, whether or not you want to keep them after the build completes. Maybe what you really want is a new feature, "temporary artifacts". @ayufan?

@jonathon-reinhart From my point of view they aren't, but let's not discuss semantics :D

But you said "whether or not you want to keep them". Do I really have a choice? I mean, if I use artifacts to pass those temporary files between stages, won't they be automatically collected for download on the project GitLab page? Maybe I'm missing something here.

It's not really about "wanting", it's more of a missing feature, something I'd really benefit from, maybe others too. A feature that allows me to keep temporary build data between stages. It's not really relevant if it's called artifacts, cache or whatever. As long as it works and it's reliable and doesn't mess with any of the current features. :)

@rfgamaral

That is what are the artifacts for - to pass data between builds. Of course they are exposed to be downloaded from UI, but since any job can run on different runners we need some central place to store them (you can later retry any build from web interface). The caching on the other hand allows you to speed up some operations on runner that you are using to run builds. The caching were never meant to be used to pass data between builds, since it doesn't make sense in this case.

The only problem that I can think of is that you need to create, upload, download and restore artifacts this may not be good for all workflows. Especially ones that have a big artifacts. However, we should think about this issue how we can optimize this workflow.

@ayufan I understand that you can later retry any build from the web interface (and that's nice). But I can't get my head around having those temporary build files available for download mixed with files that I actually want available for download.

I still think this needs a new/better/different mechanism to handle other kind of scenarios, such as the ones I have described. I can think of two ways:

Use the current artifacts feature already in place and working but allow one to selectively pick which files/folders should be available for download in the web interface. This way we could easily avoid making these temporary files available for download/browsing.
Implement a new feature with a new name - similar to the artifacts feature - to pass data between stages that might execute on different runners. Just don't make these "artifacts" available for download at all.

What do you think of both of them?

However, I just reread some past comments on this issue and I'm also in favor of finding a better mechanism for this. Some way to completely avoid passing that temporary data as build artifacts to the GitLab server. Space is a concern, in some scenarios this data could be huge and fill up the server fast.

Maybe a way to force that when a build is first started on the first job/stage, it will use the same runner until it reaches the final stage. You still have the problem where one of the stages fails and you retry that stage from the web interface. It should still attempt/force to run on the same runner when it becomes available. If not available (maybe it was deleted) start that build from the first stage.

Just a few more thoughts on this :)

@rfgamaral: you say temporary files, but isn't that what cache is for? You can keep the cache per build or even per branch (which i do, so i don't have to re-download npm packages for example) Just don't place your "artifact" in the same folder, place it in a seperate folder and then use that to download. Set the cache globally so its re-used for all jobs. If you don't want to use the cache use something i did in my older post on this thread. Works for me, i have caching for my downloaded files. They go from stage1/2/3 and i can use it to build and test my code, and the final result gets 'deployed' and only the coverage report is as a artifact to be downloaded. Also don't forget that you can name your items "stage1:build:" and "stage1:report:" and then cache stage1. Stage2 could be deploy for example with an artifact that comes out of stage1.

@riemers The problem with cache is that it's not guaranteed, and as explicitly said in the linked comment that is not what it's for.
Which is a problem if your next stage depends on files generated in a previous stage, like the .classfiles @rfgamaral describes in this example.

I'd like to use this great free service sparingly and not clutter it with a few 100 megs of artifacts which I won't ever use after.
So something like temporary artifacts would be great or/and having the option to specify which artifacts are permanent/downloadable.

@jeroenpelgrims : yes generated files should be in artifacts, cache should only be things you need when you want to build. I did see somewhere that they wanted to expire the artifact after xx days. I would also like to see something like that since i don't always need those files as downloads in gitlab, which indeed creates clutter. I think in the next version it would be default 30days or if you set it to "expire never" it would stay. This would at least not build up more clutter at some point.

@ayufan what do you think about this:

Add possibility to mark artifacts as not-downloadable.
Add possibility to store artifacts in external storage like S3?

TL;DR – go to my idea: artifacts:downloadable below.

I believe this would solve two biggest problems:

Ad 1) Not making intermediate build elements downloadable.
Ad 2) No need to care about max artifact size. No storage overloading with historic artifacts. Possibility of easy artifacts management with external tools (like "auto-Glaciering" on Amazon).

I believe cache and artifacts should also be described more-use-like in documentation. I mean add big description so people won't confuse artifacts as they are defined by biggest CI engines now (Jenkins, Travis) – where they ARE defined as final product that can be downloaded/deployed/sent to HockeyApp etc.

I think it should be described as this:

cache – temporary storage for project dependencies. [They are not useful for anything else, not for keeping intermediate build results for sure. I also think cache:key is useless here. All dependency managers can handle multiple versions of deps.]
artifacts – stage results that will be passed between stages. [I would introduce new key here, let's say artifacts:downloadable.]

`artifacts:downloadable`

This flag set to false will remove artifacts once after all triggered builds are done. This would be useful to ensure passing downloaded dependencies or intermediate build results to another stages, but clear them after whole process is done. Setting this to true would pass selected artifacts to be kept from deletion and passed to GitLab UI for downloading.

@riemers That was what I thought at first, mostly because I'm currently using a single runner on a single machine. As such I don't have the issue where stage B might execute on a runner different from stage A, not guaranteeing the cache between stages. That will only work if the build goes from stage A to Z in the same runner/machine, otherwise, you might not have the cache available.

@seweryn.zeman I agree with your purposed solution but I still think one tiny detail is missing (unless I misunderstood your suggestion). I don't think a boolean downloadable option is enough. You either store artifacts or you don't, not very flexible. IMO we should be able to pick what we want available for download and what we don't want. That way I can selectively pick that the final product files (.apk, .war, .jar, whatever) will be collected for downloads through the web interfaces, while all other generated/intermediate files will not.

This could be solved with multiple artifacts options, otherwise, a different option artifacts vs temp_artifacts (or any other name) is required to distinguish between what you want to be downloadable and what you don't.

Yes, I didn't extended it in comment for purpose – because I was not sure how to make YAML list of paths with flags attached to them. I think best option would be to enter a path for artifact and add downloadable flag to this path – as you said.

This will work:

artifacts:
    - path: node_modules/
      downloadable: false
    - path: dist/
      downloadable: true

EDIT: Or downloadable may be replaced by temp and inverted logic.

@seweryn.zeman From my point of view that would be the ideal solution.

To go a little bit further on that... How would the downloadable: false files be handled? Filtered out in the web interface or simply not available at all in the GitLab server? I believe in some use cases, these "temporary" not downloadable files will be huge and some people here already expressed their concern regarding space usage on the GitLab server. That shouldn't be forgotten.

I would simply remove them after all stages/builds are done from triggered task. Simply after all builds are finished – remove.

And I also would love to have caching/artifacts storage on S3. This way any problems with even gigantic artifacts would disappear.

Basically, downloadable: false remove after all stages/builds are finished. That would work too :)

Just to add, I think the name of the stage "build" is a little unclear; it would suggest that the output of the build is then useful in future jobs. Im not sure wha the point of a build step actually is if it's outputs are not present in future stages. It's also not completely clear from the CI docs that stages are completely separated and do not carry forward any changes made. This bit me when I was learning CI and was only made totally clear when I read this issue!

For anything with a compilation step (e.g. Java, javascript gulp processes, Android builds) having some sort of local_artifact is vital.

I agree having some sort of local artifacts is needed. We do have artifacts that are more then 2GB of data (after build) which we will run tests against and pack them to a setup in the next two step (next stages). We can't always transfer this files to gitlab server just to pass them to the next stage which should actually be run on the same server. Something like "local_artifact" to store artifacts locally as well as "same_runner" to group jobs that should be run on the same runner as suggested above is needed here in my opinion.

To be honest...even the documentation doesn't really know that jobs are independent: Take a look at the "when" documentation: http://docs.gitlab.com/ce/ci/yaml/README.html#when Why would you even need a "cleanup_build_job" then?

I strongly agree that having a same_runner boolean would be very helpful too. I would think that the majority of projects dont require huge, multi-server runner setups, and this would simplify yaml files considerably, as local_artifact would not need to be passed through all the stages.

@dabeeeenster @Wolfspirit Thanks for your comments. I think that we all agree that this is needed. Now we need to think how to make it work without introducing a lot of "configuration" issues :)

Someone might want to edit them and clearly explain somewhere that stages basically start from a clean slate, or point out to use artifacts for passing.

I don't see anyway that anyone would expect the behavior of starting from scratch on every stage would they? It's fine you choose to do this, but some indication in the docs that this happens seems important...couple hours down the tubes

+1 Just had the same surprise. Nobody expects the directory to get cleaned before each stage. Maybe just give us an option to skip the cleaning?

@XemsDoom Please read some of the previous comments in this issue, the reason isn't that stages are being "cleaned" but that they are (potentially) executed on different runners.

@Wolfspirit I managed to do this by my own.
I have introduced mounted volume /artifact_repository to my runners, which holds data between stages.
And when pipeline is: successful/failed I just clean specific directory inside artifact_repository. The directory name comes from project name and build tag.

But I have jobs, that using it (/artifact_repository) on one host. If there are more severs with runners that are using artifact_repository, the volume need to be mounted to every specific host. I guess this can be done using NFS.

Just catching up on this thread now. From what I can tell, we currently have enough features that we support the original intent, which is to create files in one stage, and pass those files to subsequent stages in a guaranteed manner.

What we don't have is:

clear documentation how to do this,
efficient ways to use the same runner for subsequent jobs,
elegantly declare that these files should be passed within the pipeline, but not stored beyond that, nor made downloadable.

We have plans to solve (2) with sticky runner, and there's a good idea for (3) using the new artifact expiry, with a new declaration similar to "expire upon successful pipeline run". I know the name artifact may still be unpalatable for this purpose, and we could consider some third definition of things to store from builds. But substantially, it seems like the current feature set "works", if not elegantly. Is that a fair assessment?

mentioned in issue #1621

Just wanted to add on that this is important to my use case as well.

@markpundsack To clarify, would any combination of 2 and 3 prevent the files from being sent to the gitlab master at all. As others have mentioned, if the "artifacts" to remain only within the pipeline (per number 3) are very large, then passing them to the server could be time consuming. I only ask because you mention artifact expiry, which would seem to imply that they are still sent to the main server, but then will eventually be deleted.

Thanks.

Gitlab CI runner gripes

The way artifacts work today is okay. What isn't okay is using them to work around the problem of not passing data between stages. Normally, artifacts would be used to, well, expose artifacts. But they're used to overcome the flaws of the CI, which also drains space, bandwidth and, most importantly, makes builds slow.

In standard cases, splitting jobs to build, test and deploy stages is an overkill. A one-liner deployment e.g. bundle exec cap production deploy takes several minutes because a new container needs to start (GitLab.com CI seems to download it every time...) and then bundle install has to run once again from scratch if there's no cache to run a simple command. It takes an extra 5 seconds to deploy an app with Codeship CI, and extra 1-3 minutes to do that with GitLab CI, depending on cache existence.

I could use one stage to build, test and deploy. However, then I can no longer use only and expect syntax and I have to mess with shell if statements to perform deployment at certain conditions (only master, only tagged).

Artifacts seems to be the only mechanism that kind of fixes the main flaw which is not retaining build data between stages. Ideas like downloadable/non-downloadable artifacts only show up because of the main flaw. Nobody would come up with a "non-downloadable artifact" thing if stages worked naturally and passed the data, because artifacts are considered a UI thing (being able to download and retain certain build artifacts).

Thesis

For most people, stages are simple steps. That's a model for our brains. We want to see which stage the build failed at in the UI.

It doesn't mean we want each stage to run in a separate container with no previous data, or use it to parallelize workload.

Solution

I think the real solution here is to "inline" all jobs from all stages into one job and run them in a single container execution if there's one job per stage. (Example 1)

If different images are used for each stage, but there's still one job per stage, run all jobs on the same runner and share data by default via a volume. (Example 2)

If there's more than one job per one stage, run these jobs (and only these) in parallel. Just like today. (Example 3)

Result: current behavior and robustness for complicated & long running jobs that need parallelism is preserved. Non-parallel jobs like a typical build+test+deploy scheme now run 3x faster in a single container run. Non-parallel jobs like a typical build+test+deploy scheme now run 2x faster because artifacts don't have to be uploaded and downloaded for each stage.

Example 1

Typical build+test+deploy done the "natural" way. All executed in one container run ruby. No playing with non-downloadable artifacts thingies. No need to wait for "deploy" container to start after "test" container passed.

image: ruby

build:
  stage: build
  script:
  - bundle install
  - bundle exec rake db:migrate
  - echo hi > hello-from-build-stage

test
  stage: test
  script:
  - bundle exec rake spec

deploy:
  stage: deploy
  only:
  - master
  script:
  - bundle exec cap production deploy
  - npm install -g aws-s3
  - aws-put hello-from-build-stage

Example 2

Typical build+test+deploy done the "natural" way. Some steps executed in different containers. No playing with non-downloadable artifacts thingies. Data is shared between runs by using the same runner, and mounting a volume. This is possible because there's only one job per stage, so all jobs are sequential.

stages:
- ruby
- python
- java
- deploy

ruby:
  stage: ruby
  image: ruby
  script:
  - bundle install
  - bundle exec rake db:migrate
  - echo hi > hello-from-ruby-stage

python:
  stage: python
  image: python
  script:
  - ls hello-from-ruby-stage
  - pypi install -r requirements.txt
  - python run something

java:
  stage: java
  image: java
  script:
  - mvn install
  - mvn test

# Note: deploy stage doesn't require any specific image, so it runs in the same container as java stage.
deploy:
  stage: deploy
  only:
  - master
  script:
  - bundle exec cap production deploy

Example 3

Parallel jobs when needed.

stages:
- build # one job
- render # three jobs belong to this stage
- test # one job
- deploy # one job

build:
  stage: build
  artifacts:
    paths:
    - build/
  script:
  - make

# Three jobs from render stage run in parallel. This behavior happens when there's more than one job per single stage.
render_goodneighbor:
  stage: render
  artifacts:
    paths:
    - render/goodneighbor
  script:
  - render goodneighbor  # Heavy computing happening that needs parallelism

render_diamond_city:
  stage: render
  artifacts:
    paths:
    - render/diamond_city
  script:
  - render diamond_city

render_vault_88:
  stage: render
  artifacts:
    paths:
    - render/vault_88
  script:
  - render vault_88
  
test:
  stage: test
  script:
  - bundle install
  - bundle exec rake spec
  - echo hi > hello-from-test

# This runs in the same container as test stage. Only/except rules still apply.
deploy: 
  stage:  deploy
  only:
  - master
  
  script:
  - scp hello-from-test remote@server
  - scp build1 remote@server
  - scp build2 remote@server

Disclaimer

I'm a newcomer so I might lack context, and I surely don't know what your EE customers do with GitLab CI, but hopefully this is eye-opening/interesting/useful for you. Thanks and keep up the good work.

@Nowaker I very much agree with almost everything you said. The only suggestion I would add as that the restriction to run things all in the same container is largely unnecessary. For example I have several projects where the different steps of the build need different tools. Node vs Java, etc. What the docker support in jenkins does (for example) is leverage data volumes. Here the source code is checked out on the agent (runner), so now you have the code repo cloned. Then the launch each step in the specified docker container and mount the workspace into the container. The container does its job for the step in the build and is then disposed. The next container in the process is launched and the workspace is again mounted into the container. In this way you can still use multiple containers, but the files just move along the pipeline.

Now of course for this to run, all of this needs to be done on the same runner.

I think what you said makes a lot of sense in that, I may be separating things in stages as a logical organization for my own interpretation. I make not be creating stages and jobs explicitly because I want parallelism. Coming up with a nice clean construct where I can define multiple stages, each with their own container that will run on the same running and pass artifacts between them is the key I think.

For the same reason that we mentally want stages, I also don't want to have to build uber build containers that can do every step of the process. So I want separate stages, and separate containers, but the ability to say these are all tightly integrated steps of a single build path and should run on a single runner and share files directly. I think this would solve 90% of the builds projects need to get going.

Then we can layer on the more complex cases where we need parallelism and artifacts published back to the server that other things can pick up and work with. So where I see gitlab CI now, is they have implemented the complex case, and we are now trying to get something simple and efficient that works for a lot of cases.

I completely agree with @mmacfadden and @Nowaker - the defaults that gitlab have chosen here (parallelised builds, multiple runners) is the wrong default. I think the bulk of users would prefer a single runner, sequential builds, and a shared data container passed between stages as @mmacfadden describes.

Don't get me wrong. What gitlab can do is very powerful and will work really well as your project scales. But yeah if we could start of with this simplicity and then choose when and where to implement more complex structures, I think gitlab CI would see increased adoption.

Yep sorry didn't mean to sound so negative! It's great that gitlab have built this overall architecture as it means it can scale really well. I just think that there should be a parallel: true option available (that defaults to false), with the appropriate changes to how data is or is not passed between stages.

Hey @mmacfadden, thanks for your feedback. In my scenario, specifying a different image per job would mean disposing of the data. In my Example 3, if test job was image: ruby and deploy job was image: java. So yeah, your problem wouldn't be solved and you'd need to resort to using artifacts hack to share data. But one use case, simple build+test+deploy in a single container is solved, and I believe this is the most frequent one (at @virtkick we have like 20 small projects like these, and only one serious one that requires more images in one build). This is Example 1.

Second problem is running jobs sequentially on the same runner in various containers, and having a uniform way of sharing data on the runner across these container runs. Data sharing should be the default - why not? Of course, the condition for this to happen is having only one job per stage, and not using runner tags. This is Example 2.

Step 3 is parallel jobs for complex use cases, something GitLab have already solved. For complex scenarios they beat Travis or Codeship. Good for them! So now only easy work too implement next. :) This is Example 3.

Thanks for your input @dabeeeenster and @mmacfadden.

@Nowaker In my opinion (which doesn't count for much)... At first I was thinking the default behavior would be to run an entire build on a single runner, each job would run in a separate container, and data would always be shared between the stages. This is not all that hard to do from a docker data volume standpoint (if that is what you were doing). There is no reason to throw data away between containers unless there is a need to do so.

The main thing that strikes me right now is that there is just no way to logically group jobs, or commands into groups simply for the sake of readability and debuggability of the process. I could put the whole darn thing in one stage / job if I just had a way to label certain parts of the job / stage as "checkout", "build", "test", "deploy". I want these groups because it makes me feel better about what I am seeing. I don't need them because I need parallelism, nor do I need to publish artifacts from the intermediate things. I really want the artifacts to just come out at the very end.

The problem is that I don't have a way to do this without using "stages", "jobs", "caches", "artifacts" that all have implications well beyond just logically grouping things so the UI, Logging, and Reporting is easier to digest.

I don't have all the answers by any means, just sharing my experience. I really want to move off jenkins to gitlab ci :-)

@dabeeeenster, @mmacfadden, I've just updated and extended my examples to cover your use case of sharing data between different containers on the same runner. Please review and let me know if that indeed solves the problem you outlined. :)

@mmacfadden Re: https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/issues/336#note_14755042, yeah, exactly my feelings. We want mental separation between stages, and being able to use except/only for deploy, and that's it.

@mmacfadden @Nowaker @dabeeeenster

Thanks a lot for comments and ideas to evaluate :)

I completely agree with @mmacfadden and @Nowaker - the defaults that gitlab have chosen here (parallelised builds, multiple runners) is the wrong default. I think the bulk of users would prefer a single runner, sequential builds, and a shared data container passed between stages as @mmacfadden describes.

The defaults are very, very, very conservative, because they are conservative they are not the best in terms of speed. I would say that this opens part of bigger problem that we should solve is to actually start preparing a guidelines to figuoring out and optimising builds.

So I want separate stages, and separate containers, but the ability to say these are all tightly integrated steps of a single build path and should run on a single runner and share files directly. I think this would solve 90% of the builds projects need to get going.

We are looking into a way to tie a builds to specific runner for sometime already and slowly improving a different parts to make it possible. We recently changed the way how pipelines are processed to make it more single-runner friendly.

The state of caching and artifacts

I agree with you that there should be much easier way to have achieve simple flow: build-test-deploy. On the other hand it's hard to do it right too, because we have to assume a lot about a build plan, and all assumptions are just assumptions and introduce a lot of magic in how system works. I basically try to avoid to introduce this magic, by introducing simple concepts (maybe not always as similar to other products). This is why we added artifacts. I would argue a little about artitfacts, because I basically tried to redefine it a little, but maybe introducing a little confusion and a lot of people is arguing with me about the purpose of artifacts. Always, my idea how to split the data passing were:

caching - to cache platform dependent files, which are tied to build container and do make sense in context of you docker image (ex. bundle install, it's tied to ruby version you use),
artifacts - to have a scalable mechanism for passing data between builds and stages,
releases - to create a first-class release in GitLab, tagged, with description and attached files.

A few comments. artifacts is basically a little different concept then artifacts used in other products that didn't yet have Pipelines/Stages/Builds. Jenkins to name one. Most current systems are designed around of concept of single Build or multiple parallel builds. So it is natural that they only output is artifacts. This is still the same for GitLab CI, the output of build is artifacts, but this output can currently be intermediate and final. Everyone assumes that artifacts are final. But when we think about scalable system we have to have an idea to make it easy to pass between next builds. In Jenkins world you also can pass data between Projects, of course using artifacts. But there artifacts also are being used as releases. This is where we didn't yet finish our work. We do have Releases, or I should say Tags with description, but this is not nicely integrated with GitLab CI and there's no first-class support to create one. Given that, if we have fully working artifacts and releases, the purpose of artifacts also do change. To be this intermediate output of builds, probably disposable, but also probably not always, because you may want to preserve some artifacts, because they have a valuable information, like debugging symbols.

Given above as for now the system is conservative, and is optimised for scalability, not yet performance. However, there's a number of improvements that we can do:

artifacts - make them implicit and local by default, try to reuse them as much as possible on single runner: This basically solve the upper comment about having to send artifacts to GitLab. I agree, from day one I did know that this will be too heavy task for some builds to be done. But by doing the implicit thing we introduce magic, and a lot of logic to make sure that we are always running on the same runner, that this runner doesn't get modified by another build executed in between, it's unfortunatelly not that easy thing to achieve. Current assumption is slow and conservative, we always assume that you are always starting from scratch.
releases - it would be nice to finally finish the integration adding a first-class support for releases, it would make it really handy and maybe make it less confiusing about possible facilities. The important part is that artifacts are now suitable only for passing files, but we start to have more artifacts, ex. container images. The releases would have an option to include any type of metadata: files, but also container images, and anything else that we ever add.
sticky builds - I have a great plan to do that, but this is also a little complicated, because there's a lot of edge cases.

Sticky builds

There's this long requested feature: run build on the same runner. I have been thinking quite long about this one. This is kind of tricky, because over time the GitLab Runner become much more robust, supporting way more execution environments: Docker, VirtualBox, recently Kubernetes and this makes it harder to figure out a solution that would work more or less the same on each executor. So we have to probably consider that in context of specific executor and assume that it's possible that it may behave a little differently, but I want to avoid that just by doing a minimal change that improves that (the processing of pipelines were 1 step), but also makes it as much as compatible it can be.

My current idea to improve that is to add an option (implicit or explicit, probably @markpundsack will try to convince that implicit is better :)) that will allow you to stick your next build to the same execution environment of previous one:

build:
  stage: build
  script:
  - bundle install

test:
  stage: test
  continue_on: build # the name up to decide: on, continue_on, use, or anything else that makes sense
  script:
  - bundle exec rspec

test2:
  stage: test
  continue_on: build
  script:
  - bundle exec rspec

The default and simplest implementation will make this build be run sequentially: build, test, test2 on the same runner, possibly in the same context that were present from build. Runner would explicitly ask for follow up jobs after running build, and then after running test and after running test2. These jobs will be executed immediately after. The first implementation will probably try to reuse existing executor environment (VirtualBox VM, or Docker, or Kubernetes Pod). So it will assume that sources are clonned by build, and if they are modified these modified sources will be used by test or test2.

The good thing about that approach is that it will scale well if you have multiple pipelines running concurrently, it will not scale well for the single pipeline, because then effectively a pipeline will be executed in sequence. It's hard to guess how we should make some builds of a stage to execute in parallel, because trying to replicate environment of build, to have a separate, but identical environment to execute test and test2 is a hard thing to do.

The outcome is that having the continue_on: will make the simple case work as you would expect: no need for caching, no need for artifacts passing, no redundant clonning, but on the other hand it will make it slower, because the builds in a stage will not be executed in parallel, but in sequence. The nice thing is that you could then use this:

services:
 - docker:dind

build:
  stage: build
  script: docker build -t my-image .

test:
  stage: test
  continue_on: build
  script: docker run --rm my-image /run/my/tests.sh

staging:
  stage: deploy
  continue_on: build
  script:
  - deploy_to_staging my_image

But this leads to the edge cases. We will have to do some extra work in some cases:

If you retry build that has continue_on: we will have to execute all dependent jobs and continue_on: to make sure that we use a valid execution environment,
The same applies for manual actions.

Example:

build:
  stage: build
  script: docker build -t my-image .

production:
  stage: deploy
  continue_on: build
  when: manual
  script:
  - deploy_to_staging my_image

We would have to execute also build if you click production.

Summary

I hope that I made it a little more clear what I had in my head when I started adding all this caching, artifacts and everything that makes this fully scalable, but not user friendly for simple cases :) The important thing is that we added that, and now we have a voices like yours where we can see what works and where we should improve. I would say that Sticky runners is a next thing to work on. Please post your comments, concerns, ideas how to make it better :)

Thanks

That looks very good :-)

artifacts is basically a little different concept than artifacts used in other products

Maybe it also should be named a little bit differently then? That would certainly help those coming from another platform. But I know, renaming things is even harder than naming things, and it might not be worth the effort.

To be this intermediate output of builds, probably disposable, but also probably not always, because you may want to preserve some artifacts, because they have a valuable information, like debugging symbols.

That's a tough one. We don't want to clog your harddrives, but seeing the results for investigations of failed builds et cetera would be nice. Maybe there should be a default to keep them for about a day, possibly longer if there's enough disk space, but also buttons on the job page to either retain or discard them explicitly.

add an option for sticky builds

I like explicit better. My idea would be to integrate this in the image field - given that it should use the same runner, you can't use different images anyway. So have something like

build:
  stage: build
  image: something # or global default
  script: …

test:
  stage: test
  image:
    keep-from: build
  script: …

staging:
  stage: deploy
  image:
    keep-from: test
  script: …

I like explicit better. My idea would be to integrate this in the image field - given that it should use the same runner, you can't use different images anyway. So have something like

This is naming part, but conceptually it's the same :)

Maybe it also should be named a little bit differently then? That would certainly help those coming from another platform. But I know, renaming things is even harder than naming things, and it might not be worth the effort.

Totally +1.

@Bergi unfortunately using image keyword is not good idea there as it describes Docker image already. :)

One way or another for all possible solutions here, what I believe we need (even sooner that any other thing) is more detailed documentation for cache/artifacts maybe with common use-cases (many of them introduced in this issue).

That's a tough one. We don't want to clog your harddrives, but seeing the results for investigations of failed builds et cetera would be nice. Maybe there should be a default to keep them for about a day, possibly longer if there's enough disk space, but also buttons on the job page to either retain or discard them explicitly.

Exactly. For now we have expire_in, but more improvements definitely will be done :)

Do you have any roadmap then?

BTW Why you decided to write this in Go? It's so hard to contribute... :o

@seweryn.zeman

Do you have any roadmap then?

This are plans for now, set of ideas put into evaluation.

BTW Why you decided to write this in Go? It's so hard to contribute... :o

Because of portability and efficiency.

unfortunately using image keyword is not good idea there as it describes Docker image already. :)

@seweryn.zeman Actually that was on purpose. You can either select a docker image by name with a string, or you specify an object that refers to a previous job so that its image is selected and the job is executed in the same build run. Specifying both doesn't make sense (it would even allow you to request two different images for a single build run), and using image prevents that in the YAML structure already.

Btw, in sticky builds, the tags to select the runner need to be the union of all tags of the dependent jobs, and the runner needs to make sure to link/unlink the respective services for each job respectively.

@Bergi
I'm not sure if I get this right but we don't use docker images for our builds (Windows) and would like to run a job at the same runner the previous job (Build -> Test) was done.
I like the idea of sticky builds allowing me to set "continue_on" which causes the job to be run at the same runner.
Hopefully not clearing it's contents or checking out again!
If this gets integrated into the "image" how would you do something like that then?

EDIT: Or would the build job just not have an image but the others would have the keep option? I think that would be kind of confusing.

Hopefully not clearing it's contents or checking out again!

Yes, it will not be executed. We will try to preserve as much as possible.

@ayufan Awesome write up. A lot to digest here. Some quick comments.

If you want to explicitly force the sequential processing, why not just use:

services:
 - docker:dind

build-test-staging:
  script:
  - docker build -t my-image .
  - docker run --rm my-image /run/my/tests.sh
  - deploy_to_staging my_image

And be done with it? If it's because you want to visualize the results of each step separately, then let's work on the visualization part.

What about docker images? If each job specified a different image, how would it be possible to continue_on?

For me, it seems strange to make continue_on an explicit thing, and implicitly be leveraging artifacts, rather than make artifacts the explicit thing and then potentially optimize around that. We can improve artifacts with short-lived artifacts that only span a pipeline, and that are stored faster, more efficiently than artifacts today. Perhaps using s3 the way caching does.

We would have to execute also build if you click production.

That seems to completely defeat the purpose. :) I would really want to encourage an artifact or release approach here.

One big advantage of the explicit declaration of continue_on is that if you've written your .gitlab-ci.yml assuming state is shared between jobs, then you won't be surprised when that turns out to not be true. e.g. with sticky runners on by default, most times your CI will share state, but on the rare case when an instance needs to be recycled in between jobs, you might get a fresh box and then be surprised it doesn't work. With an explicit declaration, we'd know to either not allow that to happen, or re-run the entire pipeline from the beginning. I think writing your configuration such that it can be restarted in the middle is the "right" way to do it, and something we can encourage, but it's a lot harder than writing it the "simple" way. But with that in mind, someone doing things the simple way would never think to explicitly add the continue_on key. :)

One way or another for all possible solutions here, what I believe we need (even sooner that any other thing) is more detailed documentation for cache/artifacts maybe with common use-cases (many of them introduced in this issue).

Yes, so true! Let's start by teaching people how to do it the way we expect, then see what problems are left.

@ayufan rightly suspected I would prefer an implicit solution here. :) The primary reason is that I believe that sticky runners is a speed improvement that most everyone can benefit from and forcing 100% of our users to make an explicit choice is bad UX.

But if it's implicit, then it's also "best effort" meaning you can't rely on the runner being shared between all jobs, especially parallel jobs. So perhaps we would also need an explicit declaration for forcing that situation, which would allow people to then write simpler scripting. e.g. you don't need to do a bundle install on every step if you are guaranteed that you're running on a consistent runner. Our current "right" way would be to still do a bundle install in each job, but have it just be fast when the cache is present, which would happen more frequently with sticky runners. I believe that is, and will continue to be, our best practice recommendation, because it allows for individual job retries and manual actions, as well as parallel builds which I strongly encourage.

So, while I think continue_on is useful, I think the bigger bang-for-buck is to enable sticky runners by default for everything. Then leave explicit control to a future iteration.

That seems to completely defeat the purpose. :) I would really want to encourage an artifact or release approach here.

If we don't assume that, it actually makes it more complicated, because your build environment can be changed by other job already (newer/older commit). So re-using runner is basically a verify specific usecase when we want to optimise some workflow, and it will not work always.

I dislike sticky runners being default, because for the same reason you can't run two concurrent jobs on the same runner and there will never be an easy, portable and fast way to replicate a build environment that you prepare in previous stage. The common case is a database, you can't make it easy for two processes (test suites) to access the same database. How I see that now, sticky runners simplifies the workflow, but at the cost of the performance: it's done by basically inlining jobs, but still making them distinctive from UI and management perspective.

To clarify, would any combination of 2 and 3 prevent the files from being sent to the gitlab master at all. As others have mentioned, if the "artifacts" to remain only within the pipeline (per number 3) are very large, then passing them to the server could be time consuming. I only ask because you mention artifact expiry, which would seem to imply that they are still sent to the main server, but then will eventually be deleted.

@mmacfadden, Good question. artifact expiring does still imply it's being sent to the GitLab master. But in order to share them between runners, they need to be sent somewhere. The only way to avoid time passing artifacts back and forth would be to have sticky runners and only allow one job at a time so that you can rely on each job running on the same single runner. As soon as you talk about any parallel jobs, there has to be some mechanism for sharing the files.

@Wolfspirit Oops, I somehow missed that you can run jobs without docker.
@markpundsack Putting everything in one job with a long doesn't really work. The separation into jobs is intentional, they do have different dependencies, artifacts, caches etc.

Another idea would be to create an explicit pipeline description of jobs that should be run together:

stick-together:
  - image: …
    tags: …
    jobs:
      - buildA
      - testA
      - deployA
  - image: …
    jobs:
      - buildB
      - testB

The pipelines in the list would run in parallel to each other. Each of them specifies its docker image and runner tags, instead of having those in the jobs.
Admitttedly, this overlaps a bit with stages, though I found those not all too useful anyway. Rather only specify dependencies between jobs explicitly and let Gitlab figure out a topological order by itself, than to assign each job to a stage that awaits all jobs of previous stages implicitly.

It seems reasonable to require that if you declare parallel jobs, then your script is designed to run on different runners; thus not relying on sticking to a single runner. But given some of the example above, I can see the other side, where you're not really defining "parallel" jobs as much as defining two jobs that fit the same stage semantics.

@Nowaker The need for only and except does seem like a good reason to use multiple jobs rather than a single job with multiple lines of script. But like in example 1, only is used for a deploy step. I wonder what the correlation of use of deploy and use of parallel tests is. Meaning, if you're doing CD, you're probably doing (or should be doing) parallel tests too, in which case sticky runners doesn't work anyway.

The biggest downside I see to using separate runners for separate jobs is that the simplest cases, like example 1, look wrong when compressed into a single job. We want to think of these as separate jobs, especially given our default build, test, deploy stages.

But I really like our architecture because it's simplicity leads to incredible power and flexibility. This is a deep benefit we shouldn't cripple.

@mmacfadden Great point about mounting a volume between docker runs.

@dabeeeenster I do wonder if having an option to select a single runner vs parallel runners is the right path here. Rather than introducing a lot of complexity in each job, there's one project-level declaration whether you want everything running on a single runner or not. I still believe implicit sticky runners are a valuable enhancement to the parallel runner case, but it isn't necessarily the best solution to the simple flows described in this thread. Certain parts of the UI might be limited in the single-runner case, like you won't be able to retry a single job; you'd have to retry the entire build. But that might be fine and no worse than other vendors anyway.

Another benefit for the single-runner flow is that if you define services at the top-level, they could be shared across all (sequential) job runs. e.g. the docker:dind service doesn't need to be flushed in between jobs, thus some amount of docker caching can be provided. Same for postgres, etc.

I hesitate to even suggest this, but perhaps these require distinct configuration formats. Like simple-ci.yml vs parallel-ci.yml or put a version code at the beginning of the file. Some features like when: manual actions many not work in the simple CI format.

One major worry for me is that we'd be increasing the barrier of going from single to parallel builds. Since I truly believe parallel builds should be the norm, I want to make sure that we support that case well, and even encourage it. That's one of the reasons we've pushed so hard on these other elements first before trying to solve the "simple" case better. Any test suite that takes more than 5 minutes should use parallelism or you're doing your dev team a disservice. In that respect, teaching people how to properly set up CI for parallelism seems like a good goal. More pain up-front, but pays dividends later. Of course, even if we stick with that path, we should reduce that pain as much as possible.

If we don't assume that, it actually makes it more complicated, because your build environment can be changed by other job already (newer/older commit). So re-using runner is basically a verify specific usecase when we want to optimise some workflow, and it will not work always.

@ayufan I'm not sure I buy that argument. Sure, there may be some edge case where a job messes with the repo, and we might need to address that, but that shouldn't drive the design. For example, we could document that if you manually checkout a different SHA, that you put it back when you're done. :) Other vendors don't even consider that you might mess with the repo at all.

I fundamentally disagree that reusing a runner is a specific use-case. It should be default because it optimizes almost all workflows; simple or complex.

For reference, once we tackle some of the items in #21624, the parallel case might get fast enough that it seriously reduces the need for the single-runner flow.

Hey @markpundsack, thanks for your reply.

Any test suite that takes more than 5 minutes should use parallelism or you're doing your dev team a disservice

True. Worth noting my tiny projects take almost 5 minutes to build because of parallelism features of GitLab, that is, running build, test and deploy in different containers.

After migrating to GitLab CI we speeded up our complicated build 2x compared to Codeship by utilizing 5 parallel jobs (GREAT!) but our simple builds (several Middleman-based static websites, several small apps) build 5x longer because test and deploy step spins up a separate containers.

I fundamentally disagree that reusing a runner is a specific use-case. It should be default because it optimizes almost all workflows; simple or complex.

This should be printed and posted on the wall.

For reference, once we tackle some of the items in #21624, the parallel case might get fast enough that it seriously reduces the need for the single-runner flow.

I like gitlab-org/gitlab-ce#21626. For instance, Codeship is one step further - they have a one-fits-all uber image with everything. We were able to develop Virtkick (virtualization software that utilizes libvirt) for more than a year because even libvirt was there. But they had to do it because their builds run in VMs without root access, so what's not provided by them cannot be achieved, period.

Yet, I don't see how gitlab-org/gitlab-ce#21626 reduces a need for single-runner flow. Consider this. Deploy step runs on tagged master only, and has one command - scp build/ someserver:/home/www/public_html. The command itself takes a couple seconds. Starting a new container, retrieving artifacts and caches and all that to run a single command takes considerably longer, no matter what image you provide. If GitLab.com "wants to become the most popular service for hosting projects", a simple 2 second deployment job cannot be neglected. Most projects are like that.

Worth noting my tiny projects take almost 5 minutes to build because of parallelism features of GitLab, that is, running build, test and deploy in different containers.

@Nowaker Touché! :)

Yet, I don't see how gitlab-org/gitlab-ce#21626 reduces a need for single-runner flow.

Sorry, I got a little loose with my words there. I believe that sticky-runners is part of our overall speed improvements for simple and complex flows. And if you have no parallel jobs, then that implies that your pipeline will run on a single runner. When I said "single-runner flow", I meant the forced single-runner flow mentioned in https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/issues/336#note_14754584, with the parallel: false default. I don't think that would actually be the keyword we'd use, so I just called it "single-runner flow".

Having an enforced single-runner flow vs sticky-runners is mostly because the latter is "best efforts" and can't be guaranteed. That means that tasks that depend on cached information must be able to recreate it if necessary. In practice, that means things like each job running bundle install again, just in case the gems haven't been installed. Now when the cache is there, especially if it's there because you're running on the same runner, then the second and subsequent bundle installs will be much faster, but still non-zero. So a true single-runner-flow where you trust that the files are still present would allow you to shave off those last few seconds. And that's the ideal.

But my premise is that if we get it down to just a few seconds difference, people will be happy. Especially when you then add the benefits of the flexible flow where each build can be retried at will, manual actions can be run in isolation using only artifacts, enabling parallel jobs "just works", etc.

@Nowaker I am actually going to go comment on gitlab-org/gitlab-ce#21626 as well. But I will tell you that in general I think that is not a good approach. These uber containers sound great but they don't work quite often. You always run into something that is missing. Or you run into needing version x+1 of something in there and y-1 of something else that is in there. I personally like a system where I can use a lightweight container that has exactly one part of the build in it. So I can say build this step in the node 6.4.2 then take the output of that and run in java:1.8.2 then run the next in a docker-in-docker 1.12. This way, if I need to update quickly to a specific version of the docker container I can just do it. With the uber container, for me to update to a new version of docker I may ALSO have to get node 6.5, which I am not compatible with.

@mmacfadden Hey, I agree with you. I like https://gitlab.com/gitlab-org/gitlab-ce/issues/21626 but I didn't say it's a solution. But there're certainly use cases where having Ruby and Node.js in one container is legit. I have Middleman-based static websites and we deploy them to Firebase with their CLI-installed tool... npm install -g firebase-tools. This step ain't worth a separate container. Again, my focus is on simple projects, simple builds, etc.

Actually I think caching should be improved extremly.

Always create a cache no matter if a build failed (it makes no sense to not do it, since the content of the cache is irrelevant and mostly resolved before the build actually fails)
Have a way to have a "global" cache which will always gets passed to the latest stage/container/build.
If multiple stages trying to update the cache always the latest one will be passed to the new build (since the content is irrelevant it's ok to maybe have passed an old cache or maybe a "too new one", mostly you shouldn't cache snapshots there since that would be a use case for a artifact ;))

With this it would raise the usefulness. Especially for Java related builds or worse Scala and sbt. On our way it's nearly impossible to use the docker executor since the cache is unreliable and our builds would take 10 minutes longer just for *.jar file downloading or more. Basically if a cache is not present it's fine. But if the cache is only present in 1% of the builds it's quite ... bad.

P.S.: As of now we keep the shell executor and with that we stay less flexible but have way more constant build times. Edit: Maybe I try to create the thing I wrote by using a mounted volume, which will just copy the artifacts back and forth with before_script, after_script

@c-schmitt

Have a way to have a "global" cache which will always gets passed to the latest stage/container/build.

Actually that's already possible. You can use a static key for the cache that does not contain the branch or job name. However, making this the default might be a good idea indeed.

Mentioned in issue #1741 (closed)

Mentioned in issue #1796

@ayufan I spent a lot of time reading through the thread of idea but all I was able to find is propositions, given that the issue is still open I assume this isn't settle and/or implemented yet. Do we have some news regarding this ? it's been 10 month now and it looks like we still have no way of making files available between jobs/stages without having to store them and making them visible in the gitlab ui.

HI all, I've spend a lot of time with "unstable" caching across jobs. Sometimes they are performed by another machine and cache can be empty?

But we have artifacts with expire_in and also dependencies (https://docs.gitlab.com/ce/ci/yaml/README.html#dependencies).

I use it this way and it works fine. Ok, but it would be nice to have an option like "delete artifacts after complete" or so.

@hranicka see #1796

I agree as this makes a lot of sense :)

https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/issues/1151#note_17765741

Hello, I'm also curious if there was a decision and when it might be ready. When all jobs run on the same runner, it's annoying to have a build phase and then have to send artifacts to gitlab, then download the artifact in the test phase

Multiply this times hundreds of commits per day and the gitlab server gets full of junk artifacts.

Hi, are this feature going to be ready some soon? Because I need it like now..

@KeNaCo What feature, exactly?

The very first comment on this issue says

Let's don't mix the caching with passing artifacts between stages.

Caching has been implemented for a long time. The cache is intended for things like software you install, and is not guaranteed to be present on subsequent builds.

Dependencies have also been implemented for a while. This is how you pass build artifacts from one job to the next.

I suggest you get very familiar with the GitLab CI documentation.

@ayufan @tmaczukin Is there any reason for this issue to be open any more?

@jonathon-reinhart sorry ma fault. I miss some comments and think you implement local storing of artefacts.. which is feature I need. We build relatively big artifacts and sending them to gitlab and then pulling back is time&money consuming..

@KeNaCo @jonathon-reinhart I'm starting to consider of introducing local artifacts, the ones that would behave as a cache and would not be transfered from/to GitLab.

Without any active attempt to provide caching or artifacts, allowing me to reuse and leave my working copy dirty would be a huge step forward. Is there a way to say "Look, don't clean up unversioned artifacts, don't clean before, don't clean after, just let the git clone be dirty"? Right now I feel that the desire to have a clean initial state or to cache or artifact, gets in the way of just letting the working copy be shared between 3 jobs in a row that can just operate on "in place junk" as is. If I rely on that, it would be even better to say that three jobs in a pipeline MUST be run on the same working copy and thus on the same exact runner, one after the other. Say "job build_lib, build_tests, and exec_tests" all have to be run on the same runner, one after the other and without any cleanup in between.

In the case of C/C++ projects there is an awful problem with getting all the incremental builds to really be incremental with Gitlab's CI runner fighting me every step of the way. For example, it seems to me that artifacts and caching do not recursively traverse subdirectories and this makes me Sad. I frequently need to cache ALL folders underneath a certain location, often it could be called artifacts\ and contain many many levels of subdirectories. it seems to me that the current product requires me to explicitly NAME every folder I want, right down to the leaf nodes. It then gets every file in that one directory but does not recursively get all files and folders underneath.

I'll just post an example of one of my attempts to cache/artifact libffmpeg, a video codec library, this is a 113 line monster in one of my .gitlab-ci.ymls:

https://gitlab.com/warren.postma/hai/snippets/33599

In the case above I'm trying to get to the point where I can turn OFF untracked:true, but at 113 lines and growing, I'm losing hope that day will ever arrive. Above is series of assembler and C/C++ binary Build-output folders containing a mix of .lib, .obj and .o files.

HM, getting 404 there. Recursive should be possible with - path/*.

Oops, gitlab.com snippet was private. Made public. Will try the - path/* thing.

Any idea when this will be fixed?

We also need to pass vendor/ and other local folders between stages.

We spent almost a week debugging. I never expected CI to work like this. :(

@rahul286 Please read this comment: https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/issues/336#note_19661420

wow. this ticket is definitely long. what is still broken? i read the description and year old first comment.

seems there's no bug, just misunderstanding cache vs artifacts. at least the scenario described works for me. i can pass cache between stages and artifacts between builds. for the described scenario i suggest to use artifacts, not cache to carry "build/" dir. but the reporter used both.

there's also likely some confusion of build log that says it removes "build/" dir -- that is general checkout cleanup to remove untracked files. the cache or artifacts are unpacked after that step. not sure is the output of the bug report truncated or it doesn't say in logs that unpacking cache or artifacts.

One of the issues is that not only the cache is not guaranteed to exist between builds but also between stages, if i set 4 concurrent runner of the same type, one job is executed on one instance and the other not always on the same causing a crash. Actually i need to do npm install in each job to be sure the cache is available. This is caused by the fact that the cache is not shared between instances of the same runner type.

ah, that can be solved with using shared cache, i.e s3 minio server:

you need runner v1.9.0:

Add path and share cache settings for S3 cache !423 (merged), #1897 (closed)

for docker based runners, as each instance cache is just docker volume mounted as /cache. you could if you use some docker volume driver (think nfs) to share /cache between runners. not tested, just theoretically possible.

I was also expecting content created in one-stage and/or job to pass on to subsequent stages/jobs in the same run.

I had vendor/ and few more folders which were part of .gitignore and hence getting cleared with git clean.

After trying a lot, all it needed following 3 lines at the top to get things working as expected.

cache:
  key: ${CI_BUILD_REF_NAME}
  untracked: true

Any feedback/drawbacks about this workaround?

The issue remain, if you use concurrent runners your next stage could be executed on another instance causing a cache missing issue.

@SharpEdgeMarshall should be solving if using shared cache solution: https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/issues/336#note_20896010

@glensc As said multiple times before shared cache is NOT a solution for CI as well as artifacts!
You might use it for small or medium projects but we have an application with 4gb of final build files and using a shared cache makes transfering files between s3/minio slower then actually building the project!

The only solution I can see are the proposed sticky runners which will build on the previous runner without checking out again!
https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/issues/336#note_13264417

Right now we do:

Build

build

Test:

build
test

Deploy

build
deploy

Which is building the project new on each run.
That's not how it should work as we are actually testing another build then the one we deploy but it's faster then up/downloading the cache each time!

Sticky runs should execute test after build on the same runner (for example Test after Build without clean) and (when using manual for example) should reexecute the previous stage (for example if Deploy is manual and the last Build was not from the current commit Build needs to be reexecuted)

mentioned in issue #2037 (closed)

mentioned in issue gitlab-com/support-forum#950

mentioned in issue #2050 (closed)

hi @ayufan , any news about STICKY BUILDS? i'm finding this issue searching for a solution like what you have explained in your comments before. Thanks.

We plan to have something in Sticky runners space for GitLab 9.1: https://gitlab.com/gitlab-org/gitlab-ce/issues/29447

mentioned in merge request SinnerSchrader/dojo!5 (merged)

My vote is for prioritization of the stage-to-stage passing "artifacts" that aren't uploaded as a ZIP to the GitLab UI. (Can we call them intermediate build results, or just intermediates?)

Without it, to avoid the wasteful GitLab UI upload, I'm now placing my entire build into one job. The biggest problem with this is that I can't use separate images for each part of the build, since there can only be one image per job. I am using the Kubernetes runner, where I have separate images for A) compiling Go, B) creating Docker images, C) deploying via kubectl and/or SSH.

I am now forced to create one mega-image that contains all the different tools (A-C above) that are needed for my build. The entire .gitlab-ci.yml file is now unfortunately just one big sequential script. Please help! :D

You can cache artifacts between stages And you can pick which jobs need to use the cache

You can cache artifacts between stages And you can pick which jobs need to use the cache

@kanebryant, have a look at the very first reply from @ayufan at the very top: caching wasn't designed to be used in this way; it is meant to be used to speed up invocations of subsequent runs of a given job, by keeping things like dependencies (e.g. npm packages, Go vendor packages, etc.) so they don't have to be re-fetched from the public internet.

While I understand the cache can be abused to pass intermediate build results between stages, I'd rather use something that was designed for the purpose.

In my personal view, this bunch of things become such a pain, is mainly because gitlab-ci distributes stages of a single pipeline to multiple runners.

Why this? If I separate my docker build and docker push into 2 stages, my docker push may complain that the given docker image does not exist. (because this is a different runner)

The concurrency is cool, but I prefer the minimal concurrency unit is pipeline, not stages.

@cherrotluo

In any case, you can always do build and push in a single job, then you have full control over it. I understand that this may seem like a limitation, but this also forces a little to construct your pipeline to be more modular. To be fair, this often kills the speed, but we are working on improving the speed of jobs. One of the examples is to use the same machine for follow up jobs in order to speed-up runs of next stages.

The concurrency is cool, but I prefer the minimal concurrency unit is pipeline, not stages.

@cherrotluo No way. Some of us have massive pipelines, with dozens of jobs in each stage, and many jobs running on different machines. Concurrency must absolutely happen between jobs within a stage.

I agree with you @jpap for me it was a work around as I was forced to add more stages as I couldn't get multiple script lines working together in a single stage with the windows gttlab runner

Hi, I have a build with step build/test/publish/deploy.

In the build test I run ./gradlew build.

I am producing an artifact with - build/lib/*.jar.

In the same build job, I have a cache for build.

In the publish job, I want to build a docker file.

The build folder as been replaced by the artifact and I cannot access build/lib/docker to build my Dockerfile.

Is there a way to tell the artifacts not to erase the workspace build dir and use a different directory ?

This is my .gitlab-ci.yml

image: domain/node7-jdk8-sonarscanner2-docker17-debian:latest

variables:
  DOCKER_FILE: build/docker/Dockerfile
  DOCKER_IMAGE: registry.domain.com/domain/api
  DOCKER_DRIVER: overlay2
  DOCKER_HOST: tcp://docker:2375
  SPRING_PROFILES_ACTIVE: gitlab-ci
  POSTGRES_USER: gitlab-ci
  POSTGRES_PASSWORD: gitlab-ci
  POSTGRES_DB: DATA_V1

services:
  - docker:dind
  - postgres:9.6.3-alpine

before_script:
  - export VERSION=$(cat gradle.properties | grep 'version=' | awk -F 'version=' '{print $2}')
  - docker login -u ${REGISTRY_LOGIN} -p ${REGISTRY_PASSWORD} ${REGISTRY_URL}

stages:
  - build
  - test
  - publish
  - deploy

# Job: Build
build:
  stage: build
  script:
    - ./gradlew build
  cache:
    key: ${CI_BUILD_REF_NAME}
    paths:
      - build/
      - .gradle/wrapper
      - .gradle/caches
  artifacts:
    when: on_success
    name: "${CI_PROJECT_PATH}-${CI_PIPELINE_ID}-${CI_COMMIT_REF_NAME}"
    paths:
      - build/libs/*.jar
  tags:
    - docker

# Job: Test
test_dev:
  stage: test
  script:
    - ./gradlew test
    - ./gradlew sonarqube -Dsonar.login=${SONAR_LOGIN} -Dsonar.branch=dev
  only:
    - dev
  tags:
    - docker

test_master:
  stage: test
  script:
    - ./gradlew test
    - ./gradlew sonarqube -Dsonar.login=${SONAR_LOGIN}
  only:
    - master
  tags:
    - docker

# Job: Publish
publish_dev:
  stage: publish
  script:
    - pwd
    - ls -altr build/*
    - ls -altr .
    - docker build --file ${DOCKER_FILE} -t ${DOCKER_IMAGE}/staging:${VERSION} build
    - docker build --file ${DOCKER_FILE} -t ${DOCKER_IMAGE}/staging:latest build
    - docker push ${DOCKER_IMAGE}/staging:${VERSION}
    - docker push ${DOCKER_IMAGE}/staging:latest
  dependencies:
    - build
  only:
    - dev
  tags:
    - docker

publish_master:
  stage: publish
  script:
    - docker build --file ${DOCKER_FILE} -t ${DOCKER_IMAGE}/staging:${VERSION} build
    - docker build --file ${DOCKER_FILE} -t ${DOCKER_IMAGE}/staging:latest build
    - docker push ${DOCKER_IMAGE}:${VERSION}
    - docker push ${DOCKER_IMAGE}:latest
  dependencies:
    - build
  only:
    - master
  tags:
    - docker

# Job: Deploy
deploy_staging:
  stage: deploy
  script:
    - echo "Deploy to staging server"
    - sed "s/:latest/:${VERSION}/g" deploy_staging.json > deploy_staging.json.tmp
    - IP=${MARATHON_HOST} marathon deploy_staging.json.tmp update
  environment:
    name: staging
    url: https://staging-api.domain.com
  only:
    - dev
  tags:
    - docker

I apologize in advance for using this issue instead of creating my own. I think this is completely related and I encounter the same issues and same remarks as all the users here and I wish gitlab-ci will provide an official way to do this in the future.

added cache label

added Documentation and removed ~60197 labels

Caching between stages

Designs

Child items ...

Activity

`artifacts:downloadable`

Gitlab CI runner gripes

Thesis

Solution

Example 1

Example 2

Example 3

Disclaimer

The state of caching and artifacts

Sticky builds

Summary

Admin message

Admin message

Caching between stages

Activity

artifacts:downloadable

Gitlab CI runner gripes

Thesis

Solution

Example 1

Example 2

Example 3

Disclaimer

The state of caching and artifacts

Sticky builds

Summary

`artifacts:downloadable`