After months of hiring we managed to finally hire a 2nd database specialist
Query timings of project issues pages were reduced by 10x, and total response timings of these pages by (only) 2x
Loading times of the "Explore Projects" page have been significantly reduced (personal note: still waiting on the deploy to get the actual numbers, but a 2 sec reduction is expected)
The "events" table is in the process of being migrated, allowing us to save roughly 140 GB of disk space and potentially a large amount of space used for PostgreSQL buffers
With these changes querying event feeds can be, in the best cases, 66 times faster than before
Sidekiq and Unicorn database connections are separated in pgbouncer. This means that a spike in Sidekiq activity won't result in Unicorn not having any available connections, improving availability
pgbouncer is still running in our temporary setup instead of being based on omnibus pgbouncer
InfluxDB is incapable of handling 25 hours of data, which means we can have accurate p95s/etc on a per day basis; this is a bit annoying when planning what to work on
@ernstvn I added @briann because security has always been done with infra. If you want to have a separate security FGU that would be good. There is a lot of info covering many areas in the infra FGU , so splitting out security would be helpful in reducing that and getting more focus on security.
All git clones/fetches on GitLab.com have been using Gitaly for the past week
Will update the issue with some graphs showing performance and NFS traffic changes
Gitaly-Ruby: we've been focusing on running ruby code directly on the File Servers. This code is running experimentally in GL 9.5. Gitaly-Ruby saves us the effort of porting from Ruby/Rugged to Go and helps increase our velocity so that GitLab.com can be running without NFS sooner.
We're well on track for exceeding our OKR goal of 25 migrations in the quarter
Will update the issue with graphs
Concerns
Resourcing: GitLab needs Gitaly to be delivered faster. We're considering bringing more people onto the project.
File Server Vertical Scaling. Gitaly is handling more operations with every release, but until we're off NFS, we're unable to scale horizontally. This means that git operations are being concentrated on our 12 file servers. Our concern is that they have enough capacity to handle all git traffic until we can turn off NFS. Once NFS is off, we can start scaling horizontally, so this issue will resolve itself.
Plans
Migrations, migrations, migrations, migrations! Deliver migrations faster, deliver more migrations
@briann thanks for adding; @sitschner please do include Security for this iteration. But feel free to ask Brian to present that part :-) I think it makes sense to have a separate FGU for Security when it is a larger team. For now, Infra + Security makes sense.
Thanks for putting the FGU together @sitschner, and thanks everyone for input! To clean up the slides prior to publication (and as a note for future updates), can each team lead please make sure to include links to the relevant issues in the slides?