-
Benchmarks
Ran benchmarks today. I don't remember it being this slow when I first ran it, perhaps because it was so much faster than previous approaches.
2.5m issues in 10m45s
pry(main)> Benchmark.bm{|b| b.report('bulk_insert:'){Issue.bulk_insert{|w|([project_id: p.id, author_id: author.id, title: FFaker::Lorem.sentence]* 2500000).each{|i| w.add(i)}};nil} };nil user system total real bulk_insert: 312.370000 176.510000 488.880000 (645.096816)
1.5m projects in 41m21s
Benchmark.bm{|b| b.report('bulk_insert:'){Project.bulk_insert{|w| 1500000.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize("_"), creator_id: author.id, namespace_id: author.namespace_id}}.each{|p| w.add(p)}}}};nil user system total real bulk_insert: 930.670000 861.590000 1792.260000 (2481.286568)
-
1.5m projects in 30m22s with columns specified
This approach doesn't use the default values so might mean less data being sent to postgres. This gem just generates INSERT operations with sets of 500 rows, so we could probably do better by in
[97] pry(main)> Benchmark.bm do |b| [97] pry(main)* b.report('bulk_insert:') do [97] pry(main)* Project.bulk_insert(:name, :path, :creator_id, :namespace_id){|w| 1500000.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize("_"), creator_id: author.id, namespace_id: author.namespace_id}}.each{|p| w.add(p)}} [97] pry(main)* end [97] pry(main)* end user system total real bulk_insert: 502.490000 749.220000 1251.710000 (1822.020671)
1.5m projects in 4m23s using generate_series
Much better performance this way, and could extend to use default values similar to https://github.com/jamis/bulk_insert/blob/master/lib/bulk_insert/worker.rb#L38-L44
[136] pry(main)> Benchmark.bm do |b| b.report('raw_insert:') do query = <<-ENDSQL INSERT INTO projects (name, path, creator_id, namespace_id) SELECT md5(random()::text), md5(random()::text), #{author.id}, #{author.namespace_id} FROM generate_series(1, 1500000) s(i); ENDSQL puts ActiveRecord::Base.connection.execute(query).values.join("\n");nil end end user system total real raw_insert: 0.000000 0.010000 0.010000 (263.568029)
-
1.5m projects in 5m33s using generate_series and defaults
Using the active_record-pg_generate_series gem to include default values for each column. Without the default values the UI will be broken, as values like created_at are used throughout.
[6] pry(main)> Benchmark.bm do |b| b.report('generate_series:') do Project.insert_using_generate_series(1, 1500000) do |sql| sql.name = raw("md5(random()::text)") sql.path = raw("md5(random()::text)") sql.creator_id = author.id sql.namespace_id = author.namespace_id end end end user system total real generate_series: 0.000000 0.000000 0.000000 (332.865800)
1.5m projects in 2m7s by replacing md5 with number sequence
This version saves time by naming projects
seed_project_1500000
instead of calculating multiplemd5
hashes. It still doesn't create associated tables likeproject_features
, so the projects aren't very useful, but is a good staring point for slowing things down.[40] pry(main)> Benchmark.bm do |b| b.report('generate_series:') do Project.insert_using_generate_series(1, 1500000) do |sql| project_name = raw("'seed_project_' || seq") sql.name = project_name sql.path = project_name sql.creator_id = author.id sql.namespace_id = author.namespace_id end end end user system total real generate_series: 0.000000 0.000000 0.000000 (127.377769)
Edited by username-removed-128633 -
Rake task used for testing Related Issues feature:
desc "Bulk insert" task bulk_insert: :environment do # require 'bulk_insert' or manually paste code from Gem first # After doing the following my queries ran at 1/10th of the speed of staging, instead of 1/5000th of the speed. # Disable database insertion logs so speed isn't limited by ability to print to console old_logger = ActiveRecord::Base.logger ActiveRecord::Base.logger = nil # Create lots of issues user = FactoryGirl.create(:user, username: "user-#{rand}", email: "foo-#{rand}@example.com") namespace = FactoryGirl.create(:namespace, owner: user, route: nil, name: "hi-#{rand}") p = FactoryGirl.create(:empty_project, creator: user, namespace: namespace) Issue.bulk_insert{|w|([project_id: p.id, author_id: user.id, title: FFaker::Lorem.sentence]* 2500000).each{|i| w.add(i)}};nil # Create lots of projects Project.bulk_insert{|w| 1500000.times.map{|i|{name: n=FFaker::Product.product, path: n.parameterize("_"), creator_id: user.id, namespace_id: user.namespace_id}}.each{|p| w.add(p)}} # Force a different/slower query plan by updating project visibility Project.where(visibility_level: Gitlab::VisibilityLevel::PRIVATE).limit(200000).update_all(visibility_level: Gitlab::VisibilityLevel::PUBLIC) Project.where(visibility_level: Gitlab::VisibilityLevel::PRIVATE).limit(20000).update_all(visibility_level: Gitlab::VisibilityLevel::INTERNAL) IssueLink.bulk_insert do |w| count = Issue.count first_issues = Issue.all.take(100_000) first_issues.each_with_index.each do |issue, i| relations = ((i+1)..(i+6)).to_a.map do |x| target_issue = first_issues[x] if target_issue { source_id: issue.id, target_id: target_issue.id } end end.compact relations.each { |i| w.add(i) } end end # Reset logging ActiveRecord::Base.logger = old_logger end
-
@oswaldo You might be interested in https://gitlab.com/gitlab-org/gitlab-ce/issues/28149 (I've started working on that: https://gitlab.com/gitlab-org/gitlab-ce/compare/master...28149-improve-seed)...
-
Great to know @rymai, thanks!
Please register or sign in to comment