(execute in Digital Ocean droplets with 2 GB/2 CPUs / 40 GB SSD)
Benchmark 1: notification channel throughput
Fire 10K hook notifications from Primary to Secondary in different geografical locations
(NYC3 -> AMS3) (just receive an generate the jobs, workers must be turned off to isolate)
Open rails console on primary
Select a random project and emulate any notification trigger with something like : 10000.times { }
Before commenting further, let me see if I understand what happens with Geo:
User pushes a commit.
This triggers a Webhook to call POST /geo/receive_events with an update request.
GeoRepositoryUpdateWorkerruns and clones or pulls latest code.
I think the main questions to be answered:
How frequently can this update be handled?
What happens if I have multiple workers for the same repo get fired at the same time? It looks like gitlab-shell is run on the same directory. Are there are any issues there?
Do other Sidekiq workers get fired after the update (e.g. PostReceive?)
I think it makes sense to have a separate Sidekiq queue but that queue can still be operated on the same worker so that in the future another worker could take the load.
Other issues here is that memory and CPU graphs are highly dependent on a lot of things, so I'm not quite sure you'll be able to separate Geo itself from the overhead of GitLab.
@stanhu you are right about how git push / update in Geo works.
Other jobs may be fired after update in many different situations (on primary node only), ex:
PostReceive triggers in the context of a project execute_hooksand execute_services and also execute_hooks for system hooks. They all schedule async jobs.
This is something that will change from project to project, so the idea is to see how much load are we adding just to get Geo working.
I know that this is more like a micro benchmark, as I believe optimizing the other part of the load is out of scope from this issue. Although real world will be different from each company, that is something they already optimizes for, we are just adding an extra layer.
@brodock I think this basic testing you listed is helpful for testing, and I think you should try these things and see what might break. I also noticed that gitlab_projects.rb in gitlab_shell times out after 120 seconds. I'd imagine this could prevent your large 1 GB files from finishing.
I also think it might be a more realistic test to mirror the gitlab-ce repo and see how things perform on a day-to-day basis.
I couldn't tell whether GeoRepositoryUpdateWorker also caused the PostReceive worker to run, but I assume it has to because otherwise the Geo site wouldn't see the updates. I couldn't tell based on a quick scan of the code.
Also, if two update events happen within a short amount of time, two git processes may be running in the same directory. One may abort due to some error. Will this be an issue and cause some exception to bubble up?
@stanhu I believe git process uses some file locking mechanism (I must check that), it may either fail or wait/timeout depending on how gitlab-shell handles that (I also have to read and try to understand the code). As this is executed inside a sidekiq job, this is not a problem as it will handle the error or time it out and retry later.
Having multiple updates being queued is not too problematic from the user's point of view, as let's say you have 100 updates queued for the same project, the first update that runs will handle all the updates. We will still waste resources with the other 99s, but the user itself will not be affected.
This is something that we may try to solve in the future if becomes a problem. @dzaporozhets wants a simpler implementation by now.
Again, I think it would be helpful to test Geo both in your outlined stress plan, but also just try it out or the next few weeks between two servers geographically distant from each other with a GitLab CE repo (or some repo getting a lot of activity). Maybe with the RC1 you can really test this?
Great. The reason I ask about whether PostReceive runs after a Geo node receives an update is that all the system/project hooks could run, which may take more time/CPU/memory than the actual fetch of the repository data. For example, try to configure the emails-on-push hook for the Geo node. :)
Oh I may have miss-read what you said early about PostReceive. On secondary Geo node, they don't run at all, as we handle everything from the api endpoint only. On primary, it's PostReceive who indirectly fires the webhooks see:
Update to the benchmark effort.
I've installed InfluxDB and Grafana in 2 separated machines for both instances in their geographical locations (NYC3 and AMS2).
I've tried to get cpu usage integrated with Gitlab::Metrics but getting accurate cpu usage per process seems to be a hard science, so I'm rolling back that idea.
@yorickpeterse said: ah, you specifically want to measure the load it has on the host?
The tricky thing with CPU usage is that you can only get a snapshot of the current usage, which may not reflect the usage of the workload
That is, if you measure the usage at the end you might not see the actual CPU usage (as it may have been lowered by then)
I don't know if there are any external tools that can measure CPU usage of a process over time
I've made a patch to be able to track system load which will help give an idea on how much we are requiring from the machine, which ultimately will help us answer how much is too heavy for the minimal required machine.
Gitlab::Metrics.measurewill be used to measure pieces of the codebase that we care for this benchmark
Rehearsal ------------------------------------------------System hooks 44.610000 7.010000 51.620000 ( 74.400804)-------------------------------------- total: 51.620000sec user system total realSystem hooks 44.310000 5.860000 50.170000 ( 80.495860) => [#<Benchmark::Tms:0x00000007d3b560 @label="System hooks", @real=80.495859522, @cstime=0.0, @cutime=0.0, @stime=5.859999999999999, @utime=44.31, @total=50.17>]
This first information is about the "load" we put into the PostReceive step, which is minimal as the simulated 10.000 scheduled changes finished the "scheduling" step in < 2 minutes on the minimal specs machine.
This doesn't tell the whole history.
Delivering the hooks took about 10 minutes and the bottleneck is the concurrency level of the receiving machine (which was set to 3).
Updating the repository requires launching up gitlab-shell and it is CPU bound:
This is the throughput for the repository update on that machine:
I will try another batch with a bigger machine (more CPU) to get a better idea on the bottleneck.
If we limit the concurrency level on sidekiq this will probably provide better numbers to us as the load on the machine will be smaller.
This first batch used all the 25 workers a standard install enables, which is too much for this type of operation in this type of machine.
sidekiq concurrency not defined (used default: 25)
unicorn concurrency: 3
Secondary Machine:
4CPU 8GB ($80 DigitalOcean container)
sidekiq concurrency: 12
unicorn concurrency: 3
Benchmark was re-executed using same primary specs but secondary with 4CPU 8GB ($80 digital ocean container)
This execution gave some additional insights.
Primary node
Notification by SystemHook API is one of the bottlenecks here, as secondary machine was running unicorn with only 3 process, so it took very long time to send all the notifications.
Increasing the amount of unicorns, will decrease the time here.
Memory consumption and load maintained stability
The amount of transactions rails is doing is only for API calls to authorize git operations, like this:
Started POST "/api/v3/internal/allowed" for 127.0.0.1 at 2016-05-26 16:49:11 -0400
So the jump after 16:40 was after sidekiq stopped sending SystemHook notifications to secondary machines (as we are using constrained resources for this machine, it made this difference).
Secondary node
Same increase in throughput at 16:40 as the machine load decreased after the end of the hook notifications.
The decrease in load at 16:40 is because it stopped receiving hook notifications from primary.
There is a balance that needs to be done here. Increasing unicorn concurrency levels would speedup but we will have to decrease sidekiq concurrency level as both are CPU bond.
If we were with gitlab.com infrastructure, scaling that would be easy as workers run in separated machines.
Bottleneck again is the amount of unicorn process. This would not happen with old buffered technique, with this machine/resources setup.
Maybe something we should consider to port to SystemHooks.
Some theories
There is still no conclusion to why it's taking too long to update the repositories. It can be network, the load or unicorn concurrency on primary machine that is not high enough to handle all API calls to authorize access.
I will try a new batch, with fewer updates (1K instead of 10K), but using the same $80 machines for both, unicorn concurrency at 4 and sidekiq at 10.
If that was true, than we should have a considerable impact on the throughput / time taken:
It is slighty better but not much.
Benchmark Summary
communication channel is impacted by the unicorn concurrency level on the node (more unicorns, faster delivery)
git repository update take ~ 4 seconds to 10 seconds to finish, time has little influence on the load in primary machine, but more on the load in the secondary machine. This has also influence in network latency/distance, etc. It's a CPU heavy operation.
It's better to have workers on a different machine than the unicorns (web app), as both are CPU bound.
If you you aren't updating fast enough, you probably need either more workers, or more unicorns on secondaries (check sidekiq queue size, if it's huge, there is your best shot).