[Fix #51242] Rework in_batches(use_ranges: true) to be more efficient (!51243) · Merge requests · Rodrigo Muino Tomonari / rails-user-lock

Rodrigo Muino Tomonari requested to merge github/fork/maximerety/optim-in-batches-using-offset into main Mar 03, 2024

Created by: maximerety

Motivation / Background

Taken from: https://github.com/rails/rails/issues/51242

In ActiveRecord 7.1, a new option was added to ActiveRecord::Batches#in_batches, use_ranges: true, that enables a more efficient way to generate queries from the provided scope for each batch.

The resulting queries use ranges to select record ids, e.g. WHERE id > 10000 AND id <= 20000 instead of passing a possibly very long list of ids in an IN clause, e.g. IN (10000, 10001, [...,] 19999, 20000).

This option was implemented in https://github.com/rails/rails/pull/45414.

The implementation of use_ranges: true relies on the already implemented strategy to select all ids from the range (needed for use_ranges: false), when generating the scope for each batch.

The typical SQL query to generate a scope for a batch is (example with batches of 10k records):

SELECT "users"."id" FROM "users" WHERE "users"."id" > ... ORDER BY "users"."id" ASC LIMIT 10000;

But subsequently, all these ids but the last one are discarded:

https://github.com/rails/rails/blob/v7.1.3.2/activerecord/lib/active_record/relation/batches.rb#L379-L380

This method wastes resources in the database (CPU, I/O), in transit (bandwidth) and in the Ruby process (RAM).

This Pull Request is a proposal to save these resources by implementing an additional optimisation.

Detail

The description below applies only to the use of in_batches with the option use_ranges: true.

Instead of returning all the ids for each batch, we use OFFSET to search for the last id in the next batch and return just one id, e.g.

  -- queries for all batches but the last (before/after)
- SELECT "users"."id" FROM "users" WHERE "users"."id" > ... ORDER BY "users"."id" ASC LIMIT 10000;
+ SELECT "users"."id" FROM "users" WHERE "users"."id" > ... ORDER BY "users"."id" ASC LIMIT 1 OFFSET 9999;

There is one downside, however.

Our OFFSET-based query won't be able to return the last id if the next batch is smaller than the limit or if there's no next batch. We always need one additional query to get the size and last id of the very last batch. We use the same query as before the optim for that purpose:

  -- queries for the last batch (before/after)
- SELECT "users"."id" FROM "users" WHERE "users"."id" > ... ORDER BY "users"."id" ASC LIMIT 10000;
+ SELECT "users"."id" FROM "users" WHERE "users"."id" > ... ORDER BY "users"."id" ASC LIMIT 1 OFFSET 9999;
+ SELECT "users"."id" FROM "users" WHERE "users"."id" > ... ORDER BY "users"."id" ASC LIMIT 10000;

Unless we only have a handful of small batches, this strategy is a winner overall, as it reduces the time and network resources spent generating batches, as shown by the benchmark below.

Additional information

Benchmark

Let's reuse the benchmark from the description of https://github.com/rails/rails/pull/45414, i.e. the same users table with 10M records, but with two modifications:

Batches of 10k records instead of 1k;
No call to batch.count on each batch since we are measuring only on the time and resources needed to generate the scopes, not to use them.

Here is the code used:

CREATE TABLE users (id bigserial PRIMARY KEY, val integer DEFAULT 0);
INSERT INTO users SELECT i FROM generate_series(1, 10000000) AS i;

start = Process.clock_gettime(Process::CLOCK_MONOTONIC)

count = 0
User.in_batches(of: 10_000) do |batch|
  count += 1  # previously: batch.count, but we don't want to trigger any other queries in this benchmark
end

puts "Count = #{count}"
elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
puts "Elapsed: #{elapsed}s"

The benchmark is executed on a single machine (a recent macbook), with a round-trip-time < 1ms.

The network (I/O) stats are obtained by comparing the result of this command before/after a benchmark run:

docker stats --no-stream postgres --format 'table {{.NetIO}}'

Here are the results:

Batching method	Duration	Network (I/O)
in_batches(use_ranges: true)	5.6 s.	1.2 MB / 180 MB
in_batches(use_ranges: true) + optim from this PR	2.3 s.	< 0.5 MB / < 0.2 MB

In this benchmark, we managed to generate scopes for batches x2.4 times faster and with x900 times less bandwidth. These results will of course typically vary with the batch size used, and the network speed.

The absolute values may not seem like huge gains, but I'm considering using in_batches(use_ranges: true) on tables with over 3 billion records, where this benchmark would show a saving of at least 16+ minutes and 52.7 GB of bandwidth (+ the database resources needed to extract/send this traffic).

Other strategies considered

Using OFFSET is not be the only way to achieve the same kind of optimisation.

For example, we could consider using a strategy based on the calculation of the maximum id + batch size for each batch (details folded below).

Using a CTE

We could use a CTE like this one:

WITH "batch_ids" AS (
    -- same query as today
    SELECT "users"."id" FROM "users" WHERE (id > 10000) ORDER BY "users"."id" ASC LIMIT 10000
)
SELECT COUNT(id) AS last_batch_size, MAX(id) AS last_batch_id FROM batch_ids;

This would be advantageous, as we would not have to use that "extra" query for the last batch as we did above.

Nevertheless, using a CTE would have the following downsides:

CTEs are not supported yet by all RDBMS supported by Active Record (MySQL < 8.0)
To my knowledge, ActiveRecord does not provide any way to easily generate a query with two distinct calculations such as a count and a maximum. The "manual" workarounds for this could be complicated.

For the record, the query above can be generated with:

User.connection.select_one(
  User
    .with(batch_ids: User.select('id').where('id > 10000').order(:id).limit(10000))
    .from('batch_ids')
    .select('COUNT(id) AS last_batch_size', 'MAX(id) AS last_batch_id')
)

Using a subquery

We could use a subquery like this one:

SELECT
    COUNT(id) AS last_batch_size, MAX(id) AS last_batch_id
FROM (
    -- same query as today
    SELECT "users"."id" FROM "users" WHERE (id > 10000) ORDER BY "users"."id" ASC LIMIT 10000
) subquery;

Even better than the CTE version because this would be compatible with all RDBMS supported by ActiveRecord.

For the record, the query above can be generated with:

User.connection.select_one(
  User.select('COUNT(id) AS last_batch_size', 'MAX(id) AS last_batch_id').from(
    User.select('id').where('id > 10000').order(:id).limit(10000)
  )
)

But to support all sorts of primary keys (including composite) and multiple calculations in a single query, I suspect some complicated work is required in Active Record. If you think this is feasible at a reasonable cost, please provide some guidance 🙏

Checklist

Before submitting the PR make sure the following are checked:

This Pull Request is related to one change. Changes that are unrelated should be opened in separate PRs.
Commit message has a detailed description of what changed and why. If this PR fixes a related issue include it in the commit message. Ex: [Fix #issue-number]
Tests are added or updated if you fix a bug or add a feature.
CHANGELOG files are updated for the changed libraries if there is a behavior change or additional feature. Minor bug fixes and documentation changes should not be included.

Admin message