We had a call with @ayufan and @JobV today, and our proposition to handling this is something like:
We should migrate artifact data stored in database to separate model/table, something like Ci::Artifacts.
This will make it possible to store more information about artifacts. Then, later, it will be possible to introduce new features like storing more than single artifacts archive created during a build. Maybe even making release-artifacts would be possible, along with other types of artifacts for single build (release artifacts, pages artifacts, shared artifacts, etc.).
Then add a metadata attribute to artifacts, that will describe content of artifacts archive.
Metadata will be stored along with an archive created during a build. Metadata will store information about what is inside an archive, when it has been created, and other information can be also stored here (what would be nice to add here ?)
Generating tree view of artifacts content in GitLab UI using metadata instead of extracting an archive for this purpose.
This will make this implementation more efficient and will impose less stress on GitLab if user wants only to view contents, instead of downloading something.
Provided that we have artifacts metadata we can serve that through API.
It may be particularly interesting to give an access to artifacts metadata via API. It may be also used in hooks etc. Users would be able to consume this API endpoint / hooks, to check - for example - if artifacts are complete or what is the size of artifacts (many use cases here).
When user wants to download a single file or multiple files, it will be necessary to extract entire archive using Sidekiq job.
But we will do this only if it is mandatory to do so. Before that, user will see some kind of model/splash screen saying that he needs to wait for artifacts to decompress and then he will get access to files. He will also see information that it may be better to download entire archive, and it is preferred way, but if user decides to extract artifacts - it is fine.
It would be also possible to extract single file from an archive, but it is not really advantageous in terms of performance when using tar.gz. It would be more efficient if we where using zip format instead, but tar.gz has other benefits.
Cleanup extracted artifacts data (invalidate cache)
Extracted artifacts will be cached in case of further requests, and we will invalidate cache after some time (let's say, about an hour). Then everything will be cleaned up. In worst scenario, when users need to access artifacts continuously, artifacts will be extracted once in every hour.
Workhorse can extract a single file from a tar.gz to standard output with tar -Ozxf archive.tar.gz -- my/file.txt. Golang io.Copy can stream this into the HTTP repsonse body.
@jacobvosmaer: @ayufan said, that it may be particularly useful to generate metadata while receiving this archive file, instead of using Sidekiq job. I think that it is good idea (but don't know workhorse that much at this moment to say more). What do you think ?
I disagree. Why make the HTTP client wait while we parse a potentially large tar.gz file? @ayufan we have to think about NGINX / HAProxy timeouts. I think it is better to just store the file in workhorse, and inspect it later in Sidekiq.
Yesterday and today I talked a lot with @ayufan, @JobV and @jacobvosmaer about this feature, also done some TAR-related prototyping, looked how things are done in GitLab CI, and I think that we have managed to decide how this feature should be implemented.
We would do that online, while receiving the data (not after the full file is received). Gunzip will be faster than client sending the data to us. We will save on IOPS reading the full file later.
@ayufan that sounds nice in theory but looking at something like Packagecloud, where pushing a package takes forever because the web server processes it on the fly I am a little skeptical about how it will work out in practice.
On the other hand, the end result when we decorate the tar.gz in gitlab-workhorse might be simpler than having a Sidekiq worker and a "your artifacts are being processed" page.
I think doing it in Sidekiq is more natural but I am open to your "we have enough time to pipe into tar -tz argument.
@ayufan Hmm, full metadata would be certainly useful. I'm currently implementing StringPath class that would ease iteration over such metadata-paths, and it would be useful to know what format of metadata we expect to have.
It looks like we need to improve performance of this mechanism that is meant for metadata's paths traversal. I've used Linux kernel archive metadata and it works way too slow. I have some ideas how to optimize it so I will give it a try.
@grzesiek What specifically is slow? Do you have an implementation you can point to? I'd be happy to take a look and offer any suggestions for optimization.
Grzegorz BizonMarked the task Tweak performance of metadata's paths traversal as completed
Marked the task Tweak performance of metadata's paths traversal as completed
@jonathon-reinhart Thanks Jonathon, but I think that I can handle this :) (just pushed refactored code). If you want to help, take a look at new class, this MR introduces - StringPath which purpose is to simplify metadata paths traversal. It currently works well with linux-2.6.0.tar.gz metadata (~ 15k files), but I believe it can be even faster.
Grzegorz BizonMarked the task Check for security vulnerabilities in ZIP (browse CVEs) as completed
Marked the task Check for security vulnerabilities in ZIP (browse CVEs) as completed
Grzegorz BizonMarked the task Check how ZIP handles SUID/SGID as completed
Marked the task Check how ZIP handles SUID/SGID as completed
Grzegorz BizonMarked the task Check how ZIP handles absolute paths as completed
Marked the task Check how ZIP handles absolute paths as completed
ZIP handles large files properly (CLI) if compiled with ZIP64 support:
Large Archives and Zip64. zip automatically uses the Zip64 extensions when files larger than 4 GB are added to an archive, an archive containing Zip64
entries is updated (if the resulting archive still needs Zip64), the size of the archive will exceed 4 GB, or when the number of entries in the archive will
exceed about 64K.
Zip64 is also used for archives streamed from standard input as the size of such archives are not known in advance, but the option -fz- can be used to force zip to create PKZIP 2 compatible archives (as long as Zip64 extensions are not needed). You must use a PKZIP 4.5 compatible unzip, such as unzip 6.0 or later, to extract files using the Zip64 extensions.
Most of available ZIP binaries (including Debian package in stable have ZIP64 support included:
grzesiek@debian: ~ $ zip -v[...]Zip special compilation options: LARGE_FILE_SUPPORT (can read and write large files on file system) ZIP64_SUPPORT (use Zip64 to store large files in archives)[...]
LARGE_FILE_SUPPORT - OS dependencies, OK after Linux 2.4
ZIP64_SUPPORT - implementation that makes it possible to read and store files > 4GB
This is something that we need to check if we want to use Go implementation.
Grzegorz BizonMarked the task Check how ZIP handles UIDs/GIDs, UNIX permissions as completed
Marked the task Check how ZIP handles UIDs/GIDs, UNIX permissions as completed
Grzegorz BizonMarked the task Check how ZIP handle case sensitivity as completed
Marked the task Check how ZIP handle case sensitivity as completed
Simple Go implementation that iterates archive and prints a list of files (metadata), consumes around 270% more memory than CLI unzip implementation (using linux-2.6.0.tar.gz - 8080kB compared to 2968kB -- checked using GNU Time as valgrind doesn't work with Go very well), however Go implementation has very low impact on signals/calls, including IO operations (only about 14% of total calls/signals).
Consumes similar amount of memory like file-lister (linux-2.6.0.zip ~8MB), ~3MB when using CLI implementation.
Conclusion
We will use Workhorse to extract single file (as we will benefit from streaming file content to user), and we will use Ruby backend to generate metadata for given path in ZIP archive using CLI implementation of Info-ZIP.
Listing all files located inside the archive consumes about 64964 kB of memory. Compared to CLI implementation (3MB) and Go implementation (8MB) it is highly inefficient.
@dzaporozhets What do you thing about using CLI implementation of ZIP (unzip command) here? It looks like most reasonable option to generate artifacts metadata for each subsequent request in artifacts browser, but it is additional external dependency.
@grzesiek I think using 8MB for linux in Go implementation is acceptable. I like that it the Go implementation has lower IOPS: "Go implementation has very low impact on signals/calls, including IO operations (only about 14% of total calls/signals)." And not adding a dependency would be great.
@sytses, @dzaporozhets Main advantage of CLI variant is lower implementation effort, and decoupling at least half of logic from Workhorse (workhorse will be used to download single artifact, it doesn't have to be used for browsing artifacts).
And info-zip makes it possible to generate only partial metadata (@ayufan gave an example above, but it seems to inflate as much memory as generating entire metadata for me).
Decoupling artifacts browser from Workhorse also makes it easier to maintain and lowers contribution barrier for other people.
Nonetheless I don't have strong opinion on using info-zip as it may have some security flaws (quite much CVS from 2015, link somewhere above), it is external dependency and using workhorse feels somewhat more idiomatic for me here (but I didn't have an opportunity to look at Workhorse yet, so this may be a false feeling now).
Tough decision.
Note that this 8MB is valid only for linux-2.6.0.zip archive I used for prototyping. Size of this file is 50MB and it contains about 15000 files. Amount of memory consumed may differ depending on size of central directory (see ZIP Structure), but I believe that this is quite a good sample.
After investigating GitLab-Workhorse I think that using it to extract archive metadata adds quite much complexity to this solution, compared to using info-zip. I will try to talk to @marin today to ask him few questions about Workhorse.
I had a call with @marin today, to discuss unzip Go implementation in Workhorse.
It looks like implementing this feature in Workhorse would introduce much of complexity, and we would need to extend workhorse with connection to Redis (which will probably, eventually, happen in near feature).
We decided to go with simplest solution (as we value boring solutions) and to generate artifacts metadata using ci-runner, and send it along with artifacts archive.
This will add some redundancy, as we already have metadata inside the archive, but we will have full control over information, format and won't need to depend on additional, external utilities. It will be also more simple to serve metadata using API.
Then, in the future, it will be possible to change a format o metadata file to, for example, binary file using something like hash map to speed up lookups for entries relative to specific path only.
Another benefit is that we will be able to write better specs for artifacts browser, as - at this moment - we are not able to exercise entire GitLab stack with specs located in rails application because we do not preload Workhorse in specs now.
Purpose of this was to prepare gemified library which would make it possible to generate ZIP archive metadata faster, with less memory footprint and to expose each node as Ruby object. I've used a Go native extensions compiled as C shared library, then binding methods using FFI (new feature in Go 1.5). This failed because I run into intermittent SIGSEGV issues, and wasn't able to debug it with gdb (no debug symbols present that would hopefully give a hint where the problem is) and I didn't want to spend too much time on it.
Preparing benchmarks on iterating big files in Ruby plain-text vs. gzipped. It looks like Zlib does quite a good job, and impact on resources being used is acceptable, given that it saves a lot of storage.
Summary: We will store metadata generated by Go application (Workhorse preferably, or CI runner) in .gz format, then iterate this file in rails backend on each subsequent request to artifacts. It is possible to add Redis cache (using redis-objects for example) later if this will not be efficient enough. With this approach it is possible to leverage full metadata, improve performance, and yet minimize complexity.
We should implement DoS mitigation mechanism in Workhorse (or Runner). DoS is possible when malicious artifacts archive is received (with more than 1 000 000 files inside). We should not generate metadata, when archive contains more than about 20 000 entries (should we make it configurable in Workhorse?).
Grzegorz BizonMarked the task Check for possible vulnerabilities in handling arbitrary paths in metadata as completed
Marked the task Check for possible vulnerabilities in handling arbitrary paths in metadata as completed
We just had a call with @marin, @jacobvosmaer and @ayufan, talking about unzip and generating metadata for artifacts and - what is important - for GitLab Pages (as this issue is connected).
Solution for now is:
Add unzip dependency to GitLab EE, to make it easier to extract pages artifacts
Generate artifacts metadata in Workhorse as previously decided
This should be placed in separate file in Workhorse, but in same package main, and executed as subcommand
With this approach we create a coupling between GitLab Rails -> Workhorse, so we have now two direction coupling
But it is easier to add little-application into Workhorse and expose a command than to introduce new mechanism for managing little-Go-application from Rails (@jacobvosmaer suggested a rake task for go get'ting Go apps, but then stated it is too complex for now, and we should go with something less complex).
Part of this feature is being implemented in gitlab-workhorse. It is not clear if that can be finished on time for 8.4. Considering it is not finished ATM, maybe we already know that it is not on time. cc @rspeicher