I'm following changes on cache and artifacts since early ci-multi-runner and even that - I'm still confused on how to use this two and what are default behaviors. In documentation we can find dozen of definitions, which can be different depending on the paragraph. In most cases this are: build, stage, job, run. I'm out with it...
Tell me if I'm right:
Build - is whole run of newly deployed changes that goes through all .gitlab-ci.yml.
Stage - is a part of build that wraps particular automation tasks (f.e. build or test).
Job - is one task defined within particular stage.
Run - is... what? A single run of job?
Now:
We have one build.
We can have multiple stages. Each stage is running one-by-one (as defined with stages).
We can have multiples jobs per stage. Jobs within stage can run parallel?
And:
What is cache default behavior and when it's enabled?
What is artifacts default behavior and when it's enabled?
Are artifacts always uploaded to GitLab?
What I suggest:
Create examples for common cache and artifacts use.
I would love to do it, but I simply can't understand how this whole things work. I failed to setup with only documentation:
How to preserve cache between builds (like npm or bower packages, composer vendors etc.)?
How to keep generated files from one stage to another?
If someone could explain this to me I'd love to create examples for documentation.
Now pardon me, I'm going to try all by trial and error method. :/
Designs
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
Caches are disabled if not defined globally or per-job
Caches are only available for all jobs in your .gitlab-ci.yml if enabled globally
Caches defined per-job are only used either a) for the next run of that job, or b) if that same cache is also defined in a subsequent job of the same run
Artifacts need to be enabled per job
Artifacts are available for subsequent jobs of the same run
Artifacts are always uploaded to GitLab (coordinator)
1.) Define a cache with the 'key: ${CI_BUILD_REF_NAME}' - so that builds of the e.g. master branch always use the same cache - during your 'build' step, e.g. for your '/node_modules' folder
2.) Define artifacts for the output of the 'build' step, e.g. the '/dist' folder
Your 'deploy' job during that run will then have the '/dist' folder to deploy somewhere
All 'build' jobs of later runs will then have the '/node_modules' folder and don't need to download, compile and install all the modules again.
Oh, and "Run" is the execution of your .gitlab-ci.yml script because of a commit, i think.
Here's a simplified example of one of our .gitlab-ci.yml:
stages: - build - test - build_dist - deploybuild: stage: build script: - npm install cache: key: ${CI_BUILD_REF_NAME} paths: - node_modules/ artifacts: paths: - node_modules/ - build/test: stage: test script: - gulp test:unitbuild_dist: stage: build_dist script: - NODE_ENV="production" gulp build --environment production artifacts: paths: - dist/deploy: stage: deploy script: - <do something with the stuff in /dist>
So basically i:
cache 'node_modules' during 'build', so that the next run doesn't need to start from scratch, and also create artifacts 'node_modules' and 'build' (which will be automatically available for the other jobs in this run)
use the artifacts from 'build' in 'test'
use the artifacts from 'build' (i would only need 'node_modules') in 'build_dist' and create the additional artifact 'dist'
use the artifact from 'dist' to deploy it
The solution to get usable caching was to set the key to '${CI_BUILD_REF_NAME}', otherwise every run would create its own cache (if i remember correctly).
I'm confused as well, ${CI_BUILD_REF_NAME} has to be declared verbatim or I have to set some variables in the settings? Or have you in mind some working project where everything is setup up already?
Yes, if you're only doing tests, you could also do the "npm install" and "npm run-my-tests" in the same job, which would save you the disk space for creating an artifact of the node_modules folder to use in the second job. Then you'd only need to set the cache and no artifacts at all.
btw, i was also hoping to get a shared cache, but that's not the case (yet?). It's a distributed cache (although that term makes even less sense). My runners do upload the cache do S3 (or minio, in my setup), but unfortunately each runner uploads it into its own path in the bucket - which means it can't be used by other runners. I'm hoping that this gets fixed, or at least the path gets configurable. See #1226 (closed)
It is made this way, because I did assume that token is basically a key for specific runner configuration/architecture/and system. So basically cache created for one runner often will not be valid when used by different runner which can run on different architecture :) It's distributed, because it makes sense to use it this way for auto-scaling where we are sure that machine have the same configuration. So possibly is OK to use the same runner token on multiple machines as long as they have the same configuration, that way you will always have the build cache, that is valid to specific configuration.
This is also a reason why by default the cache is on per-job and per-branch basis, because we assume the worst default. It holds true when you for example you test against different golang/nodejs/ruby versions and vendored dependencies will most likely not work correctly. In most cases it's too restrictive and this is why you can configure cache key to relax that assumption.
Alright, i get the point of playing it safe, but then i don't understand why anyone would want to use a "distributed" cache that can only be used by one runner anyways? Isn't that technically the same as just using the local disk?
But it's good to hear that i can share the runner token between multiple machines, as i'm currently having my dockerized runners all register as new runners (which creates quite a lot of stale records), i'll skip that step and just bake the token into the Docker image when building it =)
I thought the documentation should be more clear about the cache, the relation of cache keys and cache paths (relative/absolute). How to share cache between subsecuent builds / jobs, and how the cache artifacts are constructed and restored.
Sorry, still confused. I followed@jangrewe instructions from above, and this indeed, does create downloadable cache artifacts for node_modules. Although when new job is triggered it does installation of all dependencies from the scratch again.
That's is small self educational project so I'll most likely stick with shared runner, and really interested seeing how things work.
Yes, you are obviously confused: there is no such thing as "cache artifacts"! ;-)
You have one or more artifacts from a job in a stage of a ran, and (with my example) you have "a cache".
Like i said:
Artifacts are created during a run (your whole .gitlab-ci.yml, e.g. after a commit) and can be used by the following JOBS of that very same currently active run.
Caches can be used by following RUNS of that very same JOB (a script in a stage, like 'build' in my example) in which the cache was created (if not defined globally).
cache 'node_modules' for the NEXT run (triggered by a commit to the repo) so that they don't have to be installed again (unless the cache gets deleted)
create an artifact containing 'node_modules' and 'build' in 'job_build', so that 'job_test' in THIS run can use those files
cache 'node_modules' for the NEXT run (triggered by a commit to the repo) so that they don't have to be installed again (unless the cache gets deleted)
don't create any artifacts, as we're running the tests in the SAME job and don't have a later job (in THIS run) that would need them
Alright, i checked your builds (saw the link too late), and it looks like everything's working exactly as it's supposed to!
You can never expect the cache to be actually present (though you can expect that from artifacts), and it looks like in your case that's what happens: the cache got cleared. Probably because you're using shared runners that have some restrictions...
I'm guessing if you were using your own runners, everything would work as expected.
Anyone has managed to use the distributed cache with the shared runners?
This article seems to state that it is indeed enabled for shared runners. I managed to use artifacts but not the cache...
@rahul286 please read the above again. caches are not supposed to pass files to subsequent stages of the same run, only to subsequent runs of the same stage. artifacts will be passed to subsequent stages of the same run, and not to subsequent runs of the same stage.
if you add those 3 lines, your basically globally caching (and restoring) all untracked files for every stage of every run. that works, but it's fugly as hell (and takes quite a bit longer).
@florent-galland We've been using a shared cache (Minio) for quite some time now, and it's working perfectly fine!
Remember that artifacts are uploaded to the GitLab instance itself, while shared caches are uploaded to whatever you're using for them.