slow queries identified, representative data sample and/or system access
would allow testing alternative query structures and assist in identifying
possible changes in indexing strategy, see
https://gitlab.com/gitlab-com/infrastructure/issues/1786)
Staging with realistic load by running tests
Review possibility to enable full logging, to allow collection of all queries
to enable replaying on staging
Capacity planning
Implementation of pg_bouncer and other improvements will impact capacity planning and scalability
Review shared buffer utilization through pg_buffercache (requires installation) to determine memory utilization and working set size
Provide recommendations regarding changes to shared_buffers and system memory amounts
Recommendations regarding metrics for scalability and capacity planning
@pcarranza This was discussed via Email. Crunchy will be our consulting partner for PostgreSQL. Quoting some of the Emails:
Stephen Frost <stephen@crunchydata.com>3:37 PM (21 hours ago)to me, Ernst Yorick,I understand we are all set up to get started whenever you are ready tobegin. Based on the work-plan, we will be starting off with ahealth-check of the existing system. Typically, we would do that duringan on-site visit where we could put hands-on-keyboard to review how theexisting system is performing, review log files, the existingconfiguration, and more, but my understanding is that everything will bedone remotely.If that understanding is correct, we will need to work out how toperform the same activities remotely. For starters, at least, we couldsimply use email or similar for sharing configuration files and otherinformation, though I'd also like to review the PostgreSQL log files andthose might be quite large. Another option would be to set me up withaccess to a system where you could place the log files and otherinformation, or I could set up a location where you could place thefiles. I am guessing they could have sensitive information andnaturally we would want to make sure that anything we do is done in asecure and safe manner regarding your data and following your policies.I'm also happy to simply pull down gitlab and set up a system here andstart looking at it, but that really isn't the same as looking at arunning system with load on it.Please let me know your thoughts on how you'd like to proceed.Thanks!
Ernst van Nierop6:31 PM (18 hours ago)to Stephen, me Hi Stephen,I’ll let Yorick respond to set up an actual time for a video call for the kick-off, hopefully early next week.Regarding how to get set up and how we can share information, those are great questions ☺For starters, can you please let us know what your handle is on GitLab.com? (If you don’t have an account there yet, please make one). We’d like you to be able to be pinged and to comment on the public issues in the infrastructure issue tracker (https://gitlab.com/gitlab-com/infrastructure/issues, try filtering by the database label). That won’t let you see confidential issues just yet though. Will it just be you Stephen, or should we add others as well?If you _make_ a confidential issue yourself, for example calling it “Health Check”, then you will continue to be able to see it, so that may be a good way to keep this open to the company + Crunchy, but closed to the rest of the world. We can the post logs there. We should avoid using email as our primary means of communication, but rather work through the issue tracker for written communication, and augment that with (recorded) video calls when / where necessary.Best
@sfrost Per the Email discussion, I'm fine with doing the kick-off on April 3rd since you're out next week. In the mean time, is there anything I can prepare for you to make the kick-off smoother?
In some instances, it might make sense to propose considering making changes to the log settings to collect additional information ahead of the kick-off, but in this environment (similar to other large-scale, extremely busy systems), we need to take a particularly careful approach to any changes to ensure that we don't destabilize the environment.
As noted above, one of the first things to look at is the configuration (in particular, postgresql.conf, but we can also review pg_hba/pg_ident if you are interested in reviewing connectivity/security considerations) and the PostgreSQL log files. Those are not required for the kick-off, but will definitely be of interest early on.
We'll have a meeting on Tuesday and Thursday at 17:00 UTC
Between these meetings we'll work async. Due to timezone differences any data/output/logs/etc is best requested in bulk
If possible we should get @sfrost some form of access to the database server so he can run commands himself (ideally without having to give him sudo access), instead of me being his proxy. cc @pcarranza
@sfrost will review the above configuration options, ask for the output of commands, etc to get a better understanding of how things are running
After that we'll decide what to work on next
Outtakes from today's meeting (the ones I can remember):
Definitely keep moving forward with pgbouncer
@sfrost suggested running pgbouncer on the application servers (meaning GitLab connects to localhost), given there aren't too many application servers. This could be interesting, but may not work due to web/background processes being quite resource (in particular CPU) heavy
Foreign keys are a good thing, and @sfrost was glad we're using these (at least more than we have in the past)
@sfrost mentioned we have a lot of indexes on some tables, and could probably benefit from partial indexes in various places. @yorickpeterse remarked that we've been trying to reduce the number of indexes, and prevent developers from adding new ones unless absolutely necessary. We'll need to start enforcing this better at some point, though @yorickpeterse isn't quite sure yet how
PostgreSQL's full-text search might be better than trigram indexes, or even ElasticSearch. @yorickpeterse remarked that unfortunately we can not drop ElasticSearch anymore at this point
Database load balancing setup makes sense, definitely better than synchronous replication (if I understood @sfrost correctly)
Monitoring SQL transaction timings (not just slow ones) would be a really good addition to the monitoring system
For PostgreSQL @sfrost mentioned that if you have 32 cores ideally you try not to have more than 50 concurrent active connections; the closer to the core count, the better
@yorickpeterse mentioned the goal of reducing PostgreSQL's max_connections from 3000 to 200-300; @sfrost mentioned we should probably go even lower
The hot_standby_feedback issue we ran into (https://gitlab.com/gitlab-com/infrastructure/issues/1501) is probably the result of a long running transaction, and not hot_standby_feedback itself. We need to test this to be certain.
@sfrost mentioned that as available RAM goes down on a DB server, IO load may increase due to the DB having to load data from disk (if there's not enough RAM). This means we have to carefully scale down hosts in small steps, instead of immediately picking the smallest instance we can cram things into
Another addition would be to perhaps give @sfrost access to Slack (e.g. a single-channel guest for the #database channel)? This could be useful to ask questions and get more immediate feedback.
Regarding load balancing vs. sync replication- those really serve different use-cases and, due to that, one is not 'better' or 'worse' in a general sense. In particular, if you absolutely must not lose any transactions, then you must use sync replication. To avoid having a single point of failure, multiple sync replicas must then be used. The load balancing approach described is good to avoid having the delay on commits which will naturally happen when sync replication is used but that is only a valid approach if sync replication is not required for multi-system commit durability.
PostgreSQL 9.6 provides an alternative mechanism for ensuring that any queries against replicas are current with the 'remote_apply' option. This can be used in cases where multi-system commit durability is required and sync replication is used to ensure that replicas are always current by moving the "waiting" from the read/select side to the write/commit side. The 'remote_apply' options makes PostgreSQL wait until the sync replicas have confirmed that they have replayed the COMMIT before allowing the COMMIT to complete on the primary, ensuring that the very next transaction is able to use a replica. This removes the need to complicate the application logic with tracking the current WAL position, as gitlab is currently doing, and ensures that replicas are always able to be used for all read-only transactions. Naturally, this also means that COMMITs associated with transactions that write data will take longer, potentially substantially longer if there is a burst above capacity or other activity which slows down ongoing replication.
First- has there been any use of ALTER SYSTEM? If so, it would be
useful to see the postgresql.auto.conf file (from the PG data
directory) as well, to see what has been overridden there.
listen_addresses = '0.0.0.0'
While not a large deal, I would typically recommend using '' to
indicate all interfaces. This requires a restart and, as with all
of the recommended changes, should be tested in a similar
environment as production before changing as there can be edge cases
where '' might fail.
max_connections = 3000
Far too high, really. Would be good to reduce, as discussed/planned.
shared_buffers = 112896MB
This is extremely interesting, given the numbers you mentioned on our
kick-off call that the system was using only ~40-60GB is used. We
need to review the postgresql.conf on the other systems and look at
the actual memory numbers (which provide more details than what is
visible through the monitoring system).
Shared buffers should always be fully allocated and every page read
from the database will pull a page into memory. If these numbers are
accurate (and I suspect they may not be) then the system has only
ever pulled something less than 60GB of pages from the disk since the
last reboot.
temp_buffers = 8MB (default setting)
If any of your processes are using temporary tables, it may make sense
to consider increasing this. This can be done on a per-session basis
if done before any temporary tables are created. In particular, I am
wondering if any of your background processes use temp tables.
work_mem = 16MB
Your queries are relatively simple in the general case and so I'm not
sure that it's worth changing this, but if you have particular queries
which run slowly, depending on what those queries are doing, it may be
helpful to raise this value.
shared_preload_libraries = 'pg_stat_statements'
Very good module to have. Are you currently monitoring the view
provided?
max_wal_wize = 2GB
This isn't very much. We can review why checkpoints are happening to
see if it would make sense to adjust it, but just in general this is
a relatively small amount. Consider how much data your system is able
to write across a 5 minute period (probably upwards of 10GB).
checkpoint_completion_target = 0.7
Generally speaking, I would recommend making this 0.9. There's typically
little advantage to having it be less than that.
wal_keep_segments = 512
Generally speaking, I discourage using the server's WAL space for the
archive. If replicas fall behind enough that they have to get data which
the primary is no longer holding then having them pull the data back from
the archive is unlikely to be an issue, unless the archive is particularly
slow for some reason (in which case, I might recommend reconsidering your
archiving approach).
Logging
Note that the following changes may impact performance as they will increase
the amount of logging done on the system. I would recommend testing the
impact in a staging environment with load, if possible.
log_min_duration_statement = 2000
This is a rather high setting for this variable. Lowering this can provide
very useful timing detail about queries that are running for longer than we
would like them to. Of course, this has to be offset against the increase
in log traffic, but it looks like this could at least be lowered to 1000
based on the stats and perhaps even lower.
pg_stat_statements is also helpful, of course, but it doesn't provide timing
information about when a given query took the 'max time' which is recorded,
leaving one to have to wonder if there was a correllation with another bit
of activity on the system (a spike in queries, a checkpoint happening, etc).
log_checkpoints = off (default)
This should be enabled. Generally speaking, it shouldn't create much log
traffic as it will, ideally, only be 2 log messages every 5 minutes or so.
log_connections/disconnections = off (default)
I would typically recommend enabling this. If the applications are
maintaining an ongoing connection (and not reconnecting for every query)
then this shouldn't create much additional logging traffic, but can be
useful to know when trying to make sense of other log messages.
log_line_prefix = '%t [%p]: [%l-1] '
I generally recommend using:
'%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '
or something equivilant which will include %a, %u, %d, and %h, all of
which are missing from your log_line_prefix. This also, conveniently,
matches what pgbadger expects (a very useful log analyzing tool) while
including additional, typically useful, information. If there is
concern about the size increase in the log files, this can be reworked
to eliminate some of the text.
log_lock_waits = off (default)
This is an extremely useful option to enable as it will tell you whenever
a process has been blocked behind another process due to locking, and the
lock has taken longer than (deadlock_timeout) amount of time (defaults to
1 second). This shouldn't increase your logs much, if at all.
log_temp_files = -1 (default)
If temp files are being created by queries, it's useful to know about it.
I generally recommend setting this to '0' unless there is good cause not
to. If there is a lot of temp file usage (which I tend to doubt in this
environment, but it's possible), then this could increase log volume.
track_io_timing (missing)
This postgresql.conf file is missing (at least) this parameter, indicating
that it was, perhaps, copied up from an older version of PG. This
parameter was added in PostgreSQL 9.2. In general, I would recommend
using the postgresql.conf for the most recent version of PostgreSQL and
copying through the settings which are appropriate for that version
instead of just copying up the postgresql.conf as the parameters do change
from release to release.
This parameter is particularly useful in identifying when I/O takes a long
time, but it is necessary to make sure that the time can be gathered with
little overhead. The 'pg_test_timing' tool is included with PostgreSQL
and should be used to ensure that most timing queries are less than one
usec before enabling track_io_timing.
Autovacuum
The settings configured for autovacuum appear to be rather agressive. Viewing
the log files (thanks to having log_autovacuum_min_duration set to zero) will
help to determine if this is the case. VACUUM, in general, is cheaper on 9.6
than it was previously, but that doesn't necessairly make it a good idea to
spend an excessive amount of resources on it. The delay and cost limits look
like they likely keep autovacuum from sucking up all resources but it'll be
useful to review the logs to see how effective they are.
max_locks_per_transaction = 64 (default)
While I doubt you've had any issues with locks thus far, it's important to
understand how the number of overall locks are calculated. The total
number of locks are calculated by multiplying max_locks_per_transaction and
max_connections. This means that your current max number of locks is quite
large, but as max_connections is reduced, so will the number of locks. Note
that these are relation-level locks, not row-level, so the number of
relations in the system also has to be considered. We will review this by
looking at pg_dump output and can adjust this number if necessary.
pg_stat_statements
There doesn't appear to be any configuration of pg_stat_statements, unless that
has been done through other means (eg: ALTER SYSTEM). In particular, it may be
useful to consider how many distinct statements are used and adjust the value
of pg_stat_statements.max accordingly. I wouldn't generally recommend lowering
it, but if there are more distinct statements than the default max of 500 then it
may make sense to increase that value. We may be able to determine if we need to
adjust this setting based on the data collected.
The following commands will help us understand how the system is operating
today and what a typical workload looks like. None of these commands is
anticipated to be very expensive to run, they are just collecting
information which should be readily available and reasonably low volume.
All commands are run with 'psql' or 'pg_dump' and assume that the user is
a superuser. The psql output format used is CSV to allow for re-loading of
the data into PG for analysis. Comments about each command are included
in-line. No data is expected to be particularly sensitive, though it is
possible that there are values in queries returned which contain user data.
If this is a concern then the data should be reviewed prior to being shared.
The commands should all work, naturally with the '##DATABASE##' replaced
with the appropriate database name (or removed if your configuration defaults
to the correct database).
Pull information about all databases, as those are not included by pg_dump
psql -d ##DATABASE## -c "COPY (table pg_database) TO STDOUT WITH CSV HEADER;" > pg_database.csv
Pull information about all tablespaces, as those are not included by pg_dump
psql -d ##DATABASE## -c "COPY (table pg_tablespace) TO STDOUT WITH CSV HEADER;" > pg_tablespace.csv
Current configuration information
Get the current settings
psql -d ##DATABASE## -c "COPY (table pg_settings) TO STDOUT WITH CSV HEADER;" > pg_settings.csv
Pull information about how PostgreSQL was built and installed
psql -d ##DATABASE## -c "COPY (table pg_config) TO STDOUT WITH CSV HEADER;" > pg_config.csv
Pull information about what is currently configured in postgresql.conf
psql -d ##DATABASE## -c "COPY (table pg_file_settings) TO STDOUT WITH CSV HEADER;" > pg_file_settings.csv
Statistical information
Pull table statistics
psql -d ##DATABASE## -c "COPY (table pg_stat_all_tables) TO STDOUT WITH CSV HEADER;" > pg_stat_all_tables.csv
Pull block-level table statistics
psql -d ##DATABASE## -c "COPY (table pg_statio_all_tables) TO STDOUT WITH CSV HEADER;" > pg_statio_all_tables.csv
Pull index statistics
psql -d ##DATABASE## -c "COPY (table pg_stat_all_indexes) TO STDOUT WITH CSV HEADER;" > pg_stat_all_indexes.csv
Pull block-level index statistics
psql -d ##DATABASE## -c "COPY (table pg_statio_all_indexes) TO STDOUT WITH CSV HEADER;" > pg_statio_all_indexes.csv
Pull background writer statistics
psql -d ##DATABASE## -c "COPY (table pg_stat_bgwriter) TO STDOUT WITH CSV HEADER;" > pg_stat_bgwriter.csv
Pull archiver statistics
psql -d ##DATABASE## -c "COPY (table pg_stat_archiver) TO STDOUT WITH CSV HEADER;" > pg_stat_archiver.csv
Pull in-progress VACUUM information
psql -d ##DATABASE## -c "COPY (table pg_stat_progress_vacuum) TO STDOUT WITH CSV HEADER;" > pg_stat_progress_vacuum.csv
Pull status of replicas
psql -d ##DATABASE## -c "COPY (table pg_stat_replication) TO STDOUT WITH CSV HEADER;" > pg_stat_replication.csv
Pull the size of all tables, indexes, materialized views, and toast tables
psql -d ##DATABASE## -c "COPY (select relname, nspname, pg_relation_size(pg_class.oid) from pg_class join pg_namespace on relnamespace = pg_namespace.oid where relkind IN ('r', 'i', 'm', 't')) TO STDOUT WITH CSV HEADER;" > pg_class_sizes.csv
Query data
Pull query information from pg_stat_statements (will fail if pg_stat_statements has not been installed)
psql -d ##DATABASE## -c "COPY (table pg_stat_statements) TO STDOUT WITH CSV HEADER;" > pg_stat_statements.csv
Current activity information- ongoing queries and transaction lengths
psql -d ##DATABASE## -c "COPY (table pg_stat_activity) TO STDOUT WITH CSV HEADER;" > pg_stat_activity.csv
Current locks being held
psql -d ##DATABASE## -c "COPY (table pg_locks) TO STDOUT WITH CSV HEADER;" > pg_locks.csv
With all of the above .sql/.csv files created, please review them if you are concerned about
any sensitive data, and once you are comfortable, provide them via this issue or however is
most convenient for you.
PostgreSQL 9.6 provides an alternative mechanism for ensuring that any queries
against replicas are current with the 'remote_apply' option.
I probably should have mentioned this during the kick-off. We looked into
remote_apply as an alternative to full synchronous replication, combined with
setting the standby replica names to 2(*) (since we have 2 replicas).
Ultimately we decided to not go with this solution. As far as I understand, if
one or more replicas are not responding every query using remote_apply will
block until the number of available replicas is equal or greater to the number
configured in the standby replica names list. For our use case (= just balancing
load) this didn't feel like the right approach.
First- has there been any use of ALTER SYSTEM?
Not that I am aware of. Our configuration is usually done in postgresql.conf,
and sometimes using ALTER TABLE (e.g. for table-specific vacuuming settings).
shared_buffers = 112896MB
This setting was based on our old memory consumption (prior to load balancing
and pgbouncer). I anticipate being able to lower this value, but we'll have to
do so carefully.
If any of your processes are using temporary tables, it may make sense to
consider increasing this. This can be done on a per-session basis if done
before any temporary tables are created. In particular, I am wondering if any
of your background processes use temp tables.
Our application does not make explicit use of temporary tables. I do think we
have a few queries that end up materializing data (as in, the query planner ends
up doing this), but I'm not 100% certain.
shared_preload_libraries = 'pg_stat_statements'
Very good module to have. Are you currently monitoring the view provided?
Yes, we use it for our PostgreSQL dashboards.
This isn't very much. We can review why checkpoints are happening to see if
it would make sense to adjust it, but just in general this is a relatively
small amount. Consider how much data your system is able to write across a 5
minute period (probably upwards of 10GB).
Does increasing this value have an impact on the time it takes to replicate data
to secondaries?
Generally speaking, I discourage using the server's WAL space for the archive.
If replicas fall behind enough that they have to get data which the primary is
no longer holding then having them pull the data back from the archive is
unlikely to be an issue, unless the archive is particularly slow for some
reason (in which case, I might recommend reconsidering your archiving
approach).
We archive to S3 and Azure. While these services aren't that slow, I think they
might be slower than just replicating from a primary directly. Having said that,
I haven't tested this so I'm not certain.
The settings configured for autovacuum appear to be rather agressive. Viewing
the log files (thanks to having log_autovacuum_min_duration set to zero) will
help to determine if this is the case. VACUUM, in general, is cheaper on 9.6
than it was previously, but that doesn't necessairly make it a good idea to
spend an excessive amount of resources on it. The delay and cost limits look
like they likely keep autovacuum from sucking up all resources but it'll be
useful to review the logs to see how effective they are.
We used to have more relaxed settings, but this could easily lead to many (>=
10k) dead tuples in some of the more frequently updated tables. Having said
that, we could probably adjust a few to be more relaxed.
Per our discussion regarding deliverables, please find below where we are currently.
Naturally, we can discuss and adjust the items as needed.
Health Check
Analyze existing configuration and architecture, provide recommendations (ongoing)
Review system statistics and identify any areas of concern (ongoing)
Review schema and provide recommendations (ongoing; data/system access would help identify opportunities for improvement)
Review PostgreSQL log files to identify any issues and recommendations (pending, waiting for log files)
Backup and Restore
Review existing backup methodologies and provide recommendations (pending)
PGBouncer
Review pgbouncer.ini and provide recommendations (done)
Discuss and provide recommendations regarding PGBouncer locations/architecture (done)
Read replicas
Review replica stats and log files (pending)
Determine cause of hot_standby causing table bloat (done; likely related to issue addressed in 9.6.2)
HA / Failover
Review corosync/pacemaker configuration and provide recommendations (pending)
Assist with understanding of corosync/pacemaker, current cluster status, managing failover (pending)
Monitoring
Review existing monitoring, provide recommendations for additional monitoring (ongoing)
Tuning / Settings
Review postgresql.conf and provide recommendations (done)
Review autovacuum runs (pending log files with autovacuum information)
Review autovacuum settings and provide recommendations (pending)
Application binding to replicas
We discussed this during our calls but unclear if further discussion or review necessary. (done, unless further analysis requested)
Application improvements
Review slow queries for possible optimization (ongoing; slow queries identified, representative data sample and/or system access would allow testing alternative query structures and assist in identifying possible changes in indexing strategy)
Staging with realistic load by running tests
Review possibility to enable full logging, to allow collection of all queries to enable replaying on staging (pending logging changes and impact analysis)
Capacity planning
Implementation of pg_bouncer and other improvements will impact capacity planning and scalability.
Review shared buffer utilization through pg_buffercache (requires installation) to determine memory utilization and working set size
Provide recommendations regarding changes to shared_buffers and system memory amounts
Review of existing monitoring (based on what is visible to me through the public site) and recommendations for additional monitoring:
Monitor WAL/xlog files which are ready to be archived (count of files in pg_xlog/archive_status which end in ".ready")
Point-in-time recovery requires WAL files. If many WAL files are not yet archived and a failure occurs then the backup will be missing, potentially, a lot of WAL.
WAL files not being archived may be indicative of a problem with the archive_command
Unarchived WAL files may continue to accumulate and result in disk space issues
Monitor database transaction age
In PostgreSQL, we record the oldest transaction id (xid) of all tuples in pg_class (on a relation basis) and the minimum of all relations in a database in pg_database
Transaction IDs will eventually "wrap" around and be re-used
If PostgreSQL runs out of available transaction IDs, it is no longer able to accept new transactions
SELECT datname, age(datfrozenxid) FROM pg_database;
If done on a per-table basis, be sure to consider both the main table and any TOAST table
Total number of locks acquired
There is a finite number of heavy-weight locks available in PostgreSQL, which is max_connections * max_locks_per_transaction
Running out of locks will prevent processes from being able to take new locks, resulting in errors
The pg_locks table contains the current PostgreSQL locks
Transaction length
Long running transaction can cause bloat as VACUUM is unable to mark deleted rows as being available space due to open transactions
The view pg_stat_activity includes information about all transactions, when they started, and their current status.
Replicas must also be checked for long-running transactions when hot_standby_feedback is being used
Connections which are 'idle in transaction' may be worth monitoring for explicitly as these can be particularly problematic
PGBouncer monitoring
pgbouncer provides a number of different stats which should be monitored to avoid reaching limits set in pgbouncer.ini
Monitoring of "waiting" clients can indicate if the pgbouncer configuration should be changed to allow more connections to PG
The number of clients may approach the maximum configured, resulting in subsequent connection attempts being delayed or failing
Review existing backup methodologies and provide recommendations (pending)
For this part I would like you or @northrup to join (when he's back) to explain a bit about this to @sfrost , since I know very little about our WAL-E setup. We have our meetings on Tuesday/Thursday at 17:00 UTC. What day would be convenient for you?
Last updated: Tue Apr 11 18:10:26 2017 Last change: Sat Mar 11 08:13:22 2017 by hacluster via crmd on db1.cluster.gitlab.comStack: corosyncCurrent DC: db1.cluster.gitlab.com (version 1.1.14-70404b0) - partition WITHOUT quorum2 nodes and 1 resource configuredOnline: [ db1.cluster.gitlab.com ]OFFLINE: [ db2.cluster.gitlab.com ]Full list of resources: gitlab_pgsql (ocf::pacemaker:gitlab_pgsql): Started db1.cluster.gitlab.comFailed Actions:* gitlab_pgsql_monitor_30000 on db1.cluster.gitlab.com 'unknown error' (1): call=119, status=Timed Out, exitreason='none', last-rc-change='Wed Mar 29 03:06:16 2017', queued=0ms, exec=0ms
db2 is an old secondary. Output on a secondary that's active:
Last updated: Tue Apr 11 18:11:17 2017 Last change: Fri Mar 17 14:05:53 2017 by hacluster via crmd on db3Stack: corosyncCurrent DC: db3 (version 1.1.14-70404b0) - partition with quorum1 node and 1 resource configuredOnline: [ db3 ]Full list of resources: gitlab_pgsql (ocf::pacemaker:gitlab_pgsql): Started db3
@sfrost Log files of the secondary: removed This URL also expires 48 hours from now.
The log directories are from runit, so their structure is a bit goofy. There are a bunch of *.s.gz files, these are archived log files in gzip format. The file current contains the most recent logs in plain text format. Uncompressed the total log size is probably a few GBs.
Following up from our discussion regarding recursive queries, the query I had wasn't quite right, but here's one which I believe to be correct, and includes the project_group_links piece (hopefully I understood that relationship correctly):
WITH RECURSIVE namespace_access AS ( SELECT namespaces.id, access_level FROM members join namespaces on (members.source_id = namespaces.id) WHERE members.user_id = 1 and members.source_type = 'Namespace' UNION ALL SELECT namespaces.id, access_level FROM namespaces, namespace_access WHERE namespaces.parent_id = namespace_access.id),project_namespace_access as ( SELECT projects.id, access_level FROM projects JOIN namespace_access ON (projects.namespace_id = namespace_access.id)),project_group_access as ( SELECT project_id, group_access as access_level FROM project_group_links JOIN namespace_access ON (project_group_links.group_id = namespace_access.id)),project_direct_access as ( SELECT source_id as id, access_level FROM members WHERE members.user_id = 1 AND members.source_type = 'Project')SELECT id as project_id, min(access_level) as access_levelfrom (select * from project_namespace_accessunion all select * from project_direct_accessunion all select * from project_group_access ) as all_authorizationsgroup by idorder by id;
This query ends up being a bit expensive, largely because there's no index on project_group_links (group_id), I believe, but it's still only 18ms. Without including the project_group_links CTE the query time was around 2ms. There may also be opportunities to optimize this query, I haven't looked very hard to see if that would be possible.
(Note that the 'order by' at the end isn't strictly necessary, of course)
While analyzing the results and comparing them to this:
select project_id, access_level from project_authorizations where user_id = 1 order by project_id;
I discovered that there was one more authorization in project_authorizations that didn't exist through the above query and it was for this project: 695082
I see that the "creator_id" of this project is '1', but there exist other projects which have a create_id of '1' that don't have an entry in project_authorizations for user_id = 1 (eg: project 18533), so I assume that creator doesn't actually imply access.
Is there something the above query is missing regarding access, or..?
There are also cases where the authorization_level doesn't seem to match what's in project_authorizations. I expect those issues are solvable, but thought it would make sense to first make sure we are at least matching up on the set of projects for which a given user should have access and then we can work on how to make sure that the access level is correct. In particular, I'm wondering if the "group_level" should override access through the "namespace"? I would have thought that the minimum of all accesses is the "correct" access level for a given user/project, but that doesn't appear to match to what is currently in project_authorizations in all cases.
ERROR: null value in column "approvals_before_merge" violates not-null constraint
ERROR: statement timeout
while updating tuple (191,93) in relation "tags"
ERROR: canceling autovacuum task
gitlabhq_production.public.notes
gitlabhq_production.public.issues
Modifying the postgresql.conf as discussed earlier to include more information (in particular, log_lock_waits and the log_line_prefix, though also log_checkpoints and the other items) would help get a better picture regarding what's happening to cause some of these longer running queries and why things like the autovacuum runs against certain tables are sometimes being canceled.
@sfrost Based on your query I did some tinkering, and I ended up with the following:
WITH RECURSIVE children(id, parent_id, name) AS ( SELECT id, parent_id, name FROM namespaces WHERE path = 'path-of-the-top-level-group' UNION ALL SELECT ns.id, ns.parent_id, ns.name FROM children AS ch, namespaces AS ns WHERE ns.parent_id = ch.id)SELECT *FROM children;
This will get a top level namespace along with all of its child namespaces recursively. It doesn't perform too badly either:
QUERY PLAN-------------------------------------------------------------------------------------------------------------------------------------------------------------------------- CTE Scan on children (cost=3603.79..3619.21 rows=771 width=524) (actual time=0.034..0.085 rows=4 loops=1) CTE children -> Recursive Union (cost=0.43..3603.79 rows=771 width=18) (actual time=0.032..0.081 rows=4 loops=1) -> Index Scan using index_namespaces_on_path on namespaces (cost=0.43..8.45 rows=1 width=18) (actual time=0.029..0.029 rows=1 loops=1) Index Cond: ((path)::text = 'group-with-many-sub-groups'::text) -> Nested Loop (cost=0.43..357.99 rows=77 width=18) (actual time=0.012..0.014 rows=1 loops=3) -> WorkTable Scan on children ch (cost=0.00..0.20 rows=10 width=4) (actual time=0.000..0.000 rows=1 loops=3) -> Index Scan using index_namespaces_on_parent_id_and_id on namespaces ns (cost=0.43..35.70 rows=8 width=18) (actual time=0.008..0.009 rows=1 loops=4) Index Cond: (parent_id = ch.id) Planning time: 0.227 ms Execution time: 0.132 ms
I'll see if I can adjust the current sub groups queries to use a similar pattern.
After some hacking I ended up with the following (not too pretty looking) query:
WITHRECURSIVEnamespace_ids(id,access_level)AS(-- First we get all the namespaces that a user has direct access to.SELECTnamespaces.id,members.access_levelFROMnamespacesINNERJOINmembersONmembers.source_id=namespaces.idWHEREmembers.source_type='Namespace'ANDnamespaces.type='Group'ANDmembers.user_id=1ANDmembers.requested_atISNULLANDshare_with_group_lock='f'ANDdeleted_atISNULLUNION-- Then we recursively get all their child namespaces.SELECTns.id,members.access_levelFROMnamespace_idsASids,namespacesASnsLEFTJOINmembersONmembers.source_id=ns.idANDmembers.source_type='Namespace'ANDmembers.source_id=1WHEREns.parent_id=ids.idANDns.share_with_group_lock='f'ANDns.deleted_atISNULL)SELECT1asuser_id,projects.idASproject_id,MAX(GREATEST(project_group_links.group_access,namespace_ids.access_level))ASaccess_levelFROMnamespace_idsINNERJOINprojectsONprojects.namespace_id=namespace_ids.idLEFTJOINproject_group_linksONproject_group_links.group_id=namespace_ids.idGROUPBYuser_id,projects.id;
This spits out the access levels of the projects you either have access to by being a member of a group, my having access to a nested group, or if the project is shared with one of your groups. Having said that, this query is:
Rather ugly because of the MAX + GREATEST combination
Probably wrong as I have done very little testing
Nevertheless, all of the above shows we could in theory use the raw queries directly given that:
WITHRECURSIVEnamespaces_cte(id,access_level)AS(-- First we get all the projectss that a user has direct access to.SELECTnamespaces.id,members.access_levelFROMnamespacesINNERJOINmembersONmembers.source_id=namespaces.idWHEREmembers.source_type='Namespace'ANDnamespaces.type='Group'ANDmembers.user_id=1ANDmembers.requested_atISNULLANDnamespaces.share_with_group_lock='f'UNION-- Then we recursively get all the projects of child namespaces.SELECTns.id,members.access_levelFROMnamespaces_cteASids,namespacesASnsLEFTJOINmembersONmembers.source_id=ns.idANDmembers.source_type='Namespace'ANDmembers.source_id=1ANDmembers.requested_atISNULLWHEREns.parent_id=ids.idANDns.type='Group'ANDns.share_with_group_lock='f')SELECT1ASuser_id,projects.idASproject_id,LEAST(project_group_links.group_access,namespaces_cte.access_level)ASaccess_levelFROMnamespaces_cteINNERJOINproject_group_linksONproject_group_links.group_id=namespaces_cte.idINNERJOINprojectsONprojects.id=project_group_links.project_idORDERBYprojects.idASC;
This gets all the project authorizations for the projects shared with any groups you have direct access to or projects for any of their sub-groups. On production this appears to produce the right results, and in a much shorter amount of time. The currently released approach will take around 1.6 seconds to get the data, whereas the above query can do it in about 10 ms.
WITH RECURSIVE namespace_access AS ( SELECT namespaces.id, access_level FROM members join namespaces on (members.source_id = namespaces.id) WHERE members.user_id = 1 and members.source_type = 'Namespace' UNION ALL SELECT namespaces.id, access_level FROM namespaces, namespace_access WHERE namespaces.parent_id = namespace_access.id)SELECT id as project_id, min(access_level) as access_levelfrom ( SELECT projects.id, access_level FROM projects JOIN namespace_access ON (projects.namespace_id = namespace_access.id)union all SELECT source_id as id, access_level FROM members WHERE members.user_id = 1 AND members.source_type = 'Project'union all SELECT project_id, group_access as access_level FROM project_group_links JOIN namespace_access ON (project_group_links.group_id = namespace_access.id) ) as all_authorizationsgroup by idorder by id;
@sfrost So with some further tinkering during the call I came up with:
WITHRECURSIVEnamespaces_cte(id,access_level)AS(-- First we get all the namespaces the user has direct access to.SELECTnamespaces.id,members.access_levelFROMnamespacesINNERJOINmembersONmembers.source_id=namespaces.idWHEREmembers.source_type='Namespace'ANDnamespaces.type='Group'ANDmembers.user_id=1ANDmembers.requested_atISNULLANDnamespaces.share_with_group_lock='f'UNION-- Then we'll get all child namespaces recursively.SELECTns.id,COALESCE(members.access_level,ids.access_level)FROMnamespaces_cteASids,namespacesASnsLEFTJOINmembersONmembers.source_id=ns.idANDmembers.source_type='Namespace'ANDmembers.source_id=1ANDmembers.requested_atISNULLWHEREns.parent_id=ids.idANDns.type='Group'ANDns.share_with_group_lock='f')SELECTproject_id,MAX(access_level)ASaccess_levelFROM(SELECTprojects.idASproject_id,namespaces_cte.access_levelASaccess_levelFROMnamespaces_cteINNERJOINprojectsONprojects.namespace_id=namespaces_cte.idUNIONSELECTproject_group_links.project_idASproject_id,project_group_links.group_accessASaccess_levelFROMnamespaces_cteINNERJOINproject_group_linksONproject_group_links.group_id=namespaces_cte.id)authorizationsGROUPBYproject_id;
This requires a GROUP BY on the outer query to ensure we get the MAX of the unioned set.
Regarding the current backup and restore approach, my understanding is:
WAL-E is used for backups
Nightly FULL backups are done, ~3.5 hours required to perform the backup
Full backups are kept for 8 days
WAL is also kept for 8 days
Restore testing is performed manually, approximately monthly, and includes restoring a full backup and replaying WAL to a specific point in time. Restore time is quite variable.
Plan is to automate the restore process for rebuilding testing.
Recommendations moving forward:
Regular, ideally automated, testing of backups is extremely important. Everyone who might be involved should understand the basics of performing a restore, a runbook should exist to facilitate a restore, and utilization of automated restores is very helpful to ensure that the backups are valid and useful.
Restoring backups into staging on a daily or weekly basis is a good approach for testing that a backup is valid.
Additional testing would ideally be done, eg: running "pg_dump" across the restored backup, to identify any obvious table corruption. Note that this does not check for all kinds of corruption and does not test indexes at all.
Using the results of the "pg_dump" and restoring that data into a new, clean, cluster will provide testing of foreign key relationships, when they exist, as well as constraints. With proper foreign keys and constraints, this can provide an excellent test to ensure that the data in the database is consistent and relatively free of corruption.
PostgreSQL does have support for page-level checksums, though to enable them a logical pg_dump/restore would have to be done, which is reasonably expensive. With page-level checksums, some backup technologies (eg: pgBackrest) are able to validate those checksums during a backup and identify any checksum failures. WAL-E does not currently support this.
Query analysis based on the pg_stat_statements_history contents:
Items you already created issues for (I believe):
Query of the form: SELECT "users".* FROM "users" WHERE "users"."id" = 1245 AND ("users"."id" NOT IN (SELECT "approvals"."user_id" FROM "approvals" WHERE "approvals"."merge_request_id" = 1234)) ORDER BY "users"."id" DESC ; is performing a sequential scan across the approvals table.
Adding an index to approvals on merge_request_id should improve the performance of this query.
Users table is very wide and frequently updated, should consider moving frequently updated columns into another, narrower, table.
Missing indexes for foreign key referring columns (general issue)
DELETE FROM "ci_pipelines" WHERE "ci_pipelines"."id" = ? AND "ci_pipelines"."lock_version" = ?
The ci_pipelines table is referred to from the ci_pipelines table itself and from two other tables (ci_builds and merge_request_metrics). The ci_pipelines table and ci_builds table do not have an index on the referring column, leading to this query taking >5s to perform sequential scans. Creating an index in each table on the "auto_canceled_by_id" column should greatly improve performance of this DELETE.
SELECT COUNT(*) FROM ci_runners WHERE active=true AND contacted_at >= NOW() - '60 minutes'::INTERVAL;
The ci_runners table does not have an index on contacted_at, even though that column is regularly used in queries. Adding a partial index on (active = true) over contacted_at should improve performance of this query.
Possible index on web_hooks.type would be useful for query: SELECT "web_hooks".* FROM "web_hooks" WHERE "web_hooks"."type" IN (?) ORDER BY "web_hooks"."id" DESC. Removing the ORDER BY may also improve performance, if it's not actually needed.
Additional queries which likely should have issues also created for them:
SELECT "ci_builds".* FROM "ci_builds" INNER JOIN "projects" ON "projects"."id" = "ci_builds"."project_id" AND "projects"."pending_delete" = 'f' WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."status" = 'running' AND (ci_builds.updated_at < '2017-04-20 16:00:01.780025') ORDER BY "ci_builds"."id" ASC LIMIT 50
;
There is no index on "updated_at". The index on ci_builds (id) is used, but it has to filter out essentially the entire table, making the index scan not performant. Having an index on updated_at should greatly improve the performance of this query. If the values for "type" and "status" are always "Ci::Build" and "running", then a partial index over updated_at for those exact values would be even more efficient as those values appear to be a small portion of the table.
An alternative approach would be to use a CTE to pull out just the records which are Ci::Build and running, then filter and join afterwards, along these lines:
with build_running as (select * from ci_builds WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."status" = 'running') select * from build_running where updated_at < '2017-04-20 16:00:01.780025';
The above query takes ~65ms, instead of 30+s.
This query takes on the order of 10s: UPDATE "ci_builds" SET "user_id" = ? WHERE "ci_builds"."type" IN (?) AND "ci_builds"."user_id" = ?
The above query would almost certainly benefit from an index on "user_id" in ci_builds. It is also not entirely clear to me why the user_id field is being changed, but if that's the correct business logic then there should be an index to support the query and improve its performance.
This query takes on the order of 3.5s: SELECT "lfs_objects".* FROM "lfs_objects" LEFT JOIN lfs_objects_projects ON lfs_objects_projects.lfs_object_id = lfs_objects.id WHERE "lfs_objects_projects"."id" IS NULL
An index on lfs_objects_projects (lfs_object_id) would allow an in-order index traversal of the lfs_objects (id) index and the proposed new index to feed into a Merge Left Join, likely improving the performance of this query. What is not entirely clear is the intent of this query as it appears to generally return zero rows- should there be an FK relationship here instead, removing the need to validate that no rows are missing?
I came across another interesting query which looks something like this:
SELECT "events"."id" AS t0_r0, "events"."target_type" AS t0_r1, "events"."target_id" AS t0_r2, "events"."title" AS t0_r3, "events"."data" AS t0_r4, "events"."project_id" AS t0_r5, "events"."created_at" AS t0_r6, "events"."updated_at" AS t0_r7, "events"."action" AS t0_r8, "events"."author_id" AS t0_r9, "projects"."id" AS t1_r0, "projects"."name" AS t1_r1, "projects"."path" AS t1_r2, "projects"."description" AS t1_r3,...FROM"events"LEFT OUTER JOIN "projects" ON "projects"."id" = "events"."project_id" AND "projects"."pending_delete" = 'f'LEFT OUTER JOIN "routes" ON "routes"."source_id" = "projects"."id" AND "routes"."source_type" = 'Project'LEFT OUTER JOIN "namespaces" ON "namespaces"."id" = "projects"."namespace_id" AND "namespaces"."deleted_at" IS NULLLEFT OUTER JOIN "users" ON "users"."id" = "events"."author_id"WHERE("events"."author_id" IS NOT NULL)AND "events"."author_id" = 423915AND "projects"."pending_delete" = 'f'AND (projects.id IN (SELECT "projects"."id" FROM "projects" INNER JOIN "project_authorizations" ON "projects"."id" = "project_authorizations"."project_id" WHERE "projects"."pending_delete" = 'f' AND "project_authorizations"."user_id" = 758045 UNION SELECT "projects"."id" FROM "projects" WHERE "projects"."visibility_level" IN (20, 10)))ORDER BY "events"."id" DESCLIMIT 20;
Over the period of time I was looking at, this had an average run time of 1.2s.
It seems to be a rather... convoluted query, however. In particular, the IN clause and sub-select are curious as a list is built up of all projects which have a visibility level of 20 or 10, plus a list of all projects the user has access to via project_authorizations, as I understand it, but with this filter (and the filter on projects.pending_delete), the "left join" against projects is effectively an inner join (since any records in events not matching in projects will get filtered because 'projects.pending_delete would be NULL).
Here is an alternative query which I believe returns the same results in only about 100ms:
SELECT "events"."id" AS t0_r0, "events"."target_type" AS t0_r1, "events"."target_id" AS t0_r2, "events"."title" AS t0_r3, "events"."data" AS t0_r4, "events"."project_id" AS t0_r5, "events"."created_at" AS t0_r6, "events"."updated_at" AS t0_r7, "events"."action" AS t0_r8, "events"."author_id" AS t0_r9, "projects"."id" AS t1_r0, "projects"."name" AS t1_r1, "projects"."path" AS t1_r2, "projects"."description" AS t1_r3,...FROM "events"JOIN "projects" ON "projects"."id" = "events"."project_id" AND "projects"."pending_delete" = 'f'LEFT OUTER JOIN "routes" ON "routes"."source_id" = "projects"."id" AND "routes"."source_type" = 'Project'LEFT OUTER JOIN "namespaces" ON "namespaces"."id" = "projects"."namespace_id" AND "namespaces"."deleted_at" IS NULLLEFT OUTER JOIN "users" ON "users"."id" = "events"."author_id"WHERE "events"."author_id" = 423915AND (projects.visibility_level = 10 or projects.visibility_level = 20 or exists (select 1 from project_authorizations where projects.id = project_authorizations.project_id AND project_authorizations.user_id = 758045))ORDER BY "events"."id" DESCLIMIT 20;
Of course, we should discuss and double-check that this is correct.
As this seems to be a reasonably popular query (run over 1000 times in the sample I was reviewing), and the current query results in a sequential scan across the projects table, I'm suspicious that this may be one of the queries contributing to the high number of sequential scans across the projects table which is seen in the monitoring.
Restoring backups into staging on a daily or weekly basis is a good approach
for testing that a backup is valid.
If I recall this was something we wanted to do, but I'm not sure what the status
of this is. Perhaps @northrup or @pcarranza can clarify.
Regarding the queries, for most I did create issues. For some of the ones we
didn't discuss during the call I created the following issues:
SELECT "ci_builds".* FROM "ci_builds" INNER JOIN "projects" ON "projects"."id" = "ci_builds"."project_id" AND "projects"."pending_delete" = 'f' WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."status" = 'running' AND (ci_builds.updated_at < '2017-04-20 16:00:01.780025') ORDER BY "ci_builds"."id" ASC LIMIT 50
There is no index on "updated_at". The index on ci_builds (id) is used, but it has to filter out essentially the entire table, making the index scan not performant. Having an index on updated_at should greatly improve the performance of this query. If the values for "type" and "status" are always "Ci::Build" and "running", then a partial index over updated_at for those exact values would be even more efficient as those values appear to be a small portion of the table.
This query takes on the order of 3.5s: SELECT "lfs_objects".* FROM "lfs_objects" LEFT JOIN lfs_objects_projects ON lfs_objects_projects.lfs_object_id = lfs_objects.id WHERE "lfs_objects_projects"."id" IS NULL
I'm not sure what this query is used for either, but I did get some data
returned when I ran it just now:
Over the period of time I was looking at, this had an average run time of 1.2s.
It seems to be a rather... convoluted query, however. In particular, the IN
clause and sub-select are curious as a list is built up of all projects which
have a visibility level of 20 or 10, plus a list of all projects the user has
access to via project_authorizations, as I understand it, but with this filter
(and the filter on projects.pending_delete), the "left join" against projects is
effectively an inner join (since any records in events not matching in projects
will get filtered because 'projects.pending_delete would be NULL).
Projects with a visibility of 20 are public projects, with 10 being internal
ones (= only visible to logged in users) if I'm not mistaken. This query in
particular is probably used for some dashboard showing recent events of projects
one has access to. This is probably from the
https://gitlab.com/dashboard/activity page, or something similar.
The amount of JOINs certainly seems like a massive overkill, but they may be
needed by the application to work around N+1 query problems.
The IN clause is a result of how our ORM works. When you pass a query/relation
object to the ORM it will use an IN. For example:
some_table.where(some_column: some_other_query)
Would produce something along the lines of:
SELECT *FROM some_tableWHERE some_column IN ( some_other_query )
@yorickpeterse Regarding the queries generated with some_table.where(some_column: some_other_query) I suspect that many of those may turn into slow queries, as the system grows, and it might make sense to start trying to discourage that practice to get ahead of it.
Requested during status call today, for @sfrost and @yorickpeterse : please add links to body of issue when updating their status. Link to comment (for discussions) or issues. Makes it easier to track, makes it easier to see which issues are closed.
yorickpeterse-stagingmarked the checklist item Implementation of pg_bouncer and other improvements will impact capacity planning and scalability as completed
marked the checklist item Implementation of pg_bouncer and other improvements will impact capacity planning and scalability as completed
Review system statistics and identify any areas of concern
This is something of a catch-all, but regarding the configuration changes made, here's where things stand:
The following changes were made, as I understand it, and reviewing the log files as a result would help to identify if there are any areas of concern from these pieces:
log_min_duration_statement- Lowered to 1000 from 2000. This is useful, but less interesting since we've been able to work off of the information provided through pg_stat_statements.
log_checkpoints- Seeing the information in the log files with regard to checkpoints would be very helpful to see how much time is spent in the various parts of checkpointing as well as seeing how large each checkpoint is. There have been cases where the number of checkpoints has gone up quite a bit, based on grafana data, and being able to see what happened in those cases would be very helpful.
log_temp_files- Set to zero, meaning that any queries which create temp files will be logged. This could prove interesting as it may indicate cases where work_mem is not set as high as it should be and it can also show queries which are being slow due to working with temp files instead of other reasons.
Changes still pending include:
log_lock_waits- One of the ongoing areas of concern is the number of blocked queries being seen in grafana. Once log_lock_waits is available and enabled (issue https://gitlab.com/gitlab-com/infrastructure/issues/1556) then we'll be able to see in the PostgreSQL log files any time a process is blocked for more than 1s.
track_io_timing- Another area of concern from the grafana data is the IO timing on the primary. We understand that to be an issue with how it was set up in the existing environment, but having track_io_timing (issue https://gitlab.com/gitlab-com/infrastructure/issues/1558) would give us insight into how that is impacting query performance through pg_stat_statements.
tracking 'idle in transaction'- Lastly, again from grafana, the number of "idle in transaction" backend processes at any given time is concerning. Tracking how long processes are idle in transaction for (issue https://gitlab.com/gitlab-com/infrastructure/issues/1588) and reporting on ones which are idle in transaction for a long time would help with investigating where those are coming from and hopefully they can be reduced. We saw one particular case where a long 'find' (as I recall) was being run across the filesystem while a transaction was held open in the database.
Review replica stats and log files
For this item, ideally, we would enable the same options listed above on the replicas as well to get more information into their log files for analysis too. However, the much more interesting thing to do would be to collect the pg_stat_statements information on the replicas and periodically reset those stats. We currently have the pg_stat_statements history from the primary which has proven very useful but there may be queries which are slow that are only run on the replicas and collecting this information would allow us to see those. This is a bit complicated since replicas can't be written to directly. One approach would be to simply connect to the replica, COPY the pg_stat_statements data to a file, reset the stats, and then connect to the primary to load that hour's data. Another approach would be to use a foreign data wrapper and have a process which runs on the primary periodically to query the data through the FDW and store it into a table on the primary, though something would still need to connect to the replica to reset the stats.
Review autovacuum runs
Yes, if you're able to provide a log file (a complete one would be fine and would allow me to review the other items mentioned above- no need to filter it, I can do that) that would be great.
Review shared buffer utilization
With pg_buffercache enabled we're able to see how the cache is being used, which is extremely helpful, but we've identified a number of cases where poor query plans are resulting in large portions of the cache being used for certain tables. Once those queries have been changed to minimize the amount of data they look at, this will prove much more interesting as we would be able to determine how much of the cache is active and if it might be possible to reduce the size of shared buffers. Right now, a great deal of the cache is used, very frequently, due to the poor query plans (though, to be fair, table design plays a part here as well). Being able to periodically extract out the shared buffer cache information in a manner similar to pg_stat_statements, and to collect that information on the replicas as well, would also be informative, though the 9.6 version of pg_buffercache has the downside that it may cause some small delay on ongoing operations, so it might be best to collect this less often and perhaps during slower period of time.
Yes, if you're able to provide a log file (a complete one would be fine and would allow me to review the other items mentioned above- no need to filter it, I can do that) that would be great.
This is the current log file of the primary: removed
The log file starts at 9:55 and ends at 11:17, or 82 minutes.
What this shows is that a VACUUM is executed for almost every minute against the project_mirror_data table. Reviewing the information from pg_stat_statements for this table, we see that it's updated ~10x every second. This table is reasonably small and narrow, however, and therefore the VACUUMs are quick- taking less than a second to run. The entire table is also in shared buffers all the time, which helps make the vacuum fast too.
The same is generally true for the ci_runners table- relatively small table, entirely in shared buffers, vacuum finishes in 5-10 seconds.
The next is ci_builds, which is a much larger table. The frequent vacuums are able to complete in only about 80 seconds though, likely mostly due to much of the table being in shared buffers and the rest being in the filesystem cache. The autovacuum costs on this table were adjusted to allow higher rates, meaning this table is processed at ~100MB/s.
merge_request_diffs is a very large table but almost all of the space from it is in the TOAST table, the main table being only ~5GB. Further, most of it isn't changed, and so a VACUUM only has to touch a fraction of the table (around 600MB), thus the vacuum only takes ~15s.
project_statistics is another relatively small table, a VACUUM of which finishes in ~10s.
merge_requests is where things start to get particularly interesting- it's large and enough of it is being changed that a lot is VACUUM'd- about 2G with each run. Combined with the low autovacuum cost settings, the table ends up taking around 10 minutes to be VACUUM'd. Looking at how it was VACUUM'd 7 times over 82 minutes, that means that an autovacuum worker (out of the 6 configured) was essentially always running against this table.
ci_pipelines is also pretty large and changing, but VACUUM is able to process through it in about 3 minutes on average.
namespace_statistics is small and VACUUM'd quickly
projects is another large table which is changing a lot. About 1G of the table is being looked at with each VACUUM run and the runs take 10-15 minutes each. Given that there were 5 runs over 82 minutes, we can conclude that this table also ties up an autovacuum worker essentially full-time.
remote_mirrors is small and VACUUM'd quickly
todos is reasonably small and only takes a few minutes to vacuum each time through.
The next big table is events, where a VACUUM appears to go over ~10G of the table each time through, requiring close to half an hour for each run. There are only 2 runs included in the report, but it seems likely to conclude that this table as chews up an autovacuum worker near to full-time.
issues is another large table, requiring ~40 minutes to VACUUM, and likely chews up an entire autovacuum worker. Only about 1.7G of this table has to be checked over, but that appears to result in a very large number of cache misses and quite a few pages being dirtied.
notes is also very large, requires 30-35 minutes to VACUUM and requires another autovacuum worker.
The rest of the tables look to be reasonably small or not updated enough to require a lot of VACUUM work to occur on them.
I'll follow-up to this comment with another explaining some concerns here.
There are two main points to make from the above insight into the VACUUM'ing details provided here:
There's a number of wide tables being UPDATE'd quite a bit with only a fraction of the row changing. The "projects" table is a great example of this- it's updated something like 10 times a second for a certain subset of columns, perhaps 10 at most (and they are not updated at the same time, but independently!), while the table has 56 columns. An UPDATE will cause the entire row to be updated and therefore copied, resulting in a large amount of churn to the table with each of the individual UPDATE statements being run and dead tuples resulting which have to be VACUUM'd. As mentioned elsewhere, but perhaps illustrated best here, moving those frequently updated columns to narrower tables would greatly reduce the amount of work required for executing the UPDATE and for performing the cleanup VACUUM afterwards.
There are only 6 autovacuum workers enabled on this cluster and nearly all of them appear to be getting used all the time. While bloat appears to be managed reasonably well and the global database frozen xid isn't so far behind to be worried about currently, it strikes me that we aren't too far from the point where the autovacuum workers won't be able to keep up with the current configuration. That configuration could be changed in a couple different ways to address this:
Autovacuums could be allowed to proceed faster by adjusting the costing to let them run faster. Today, this is unlikely to induce much additional load on the overall system since the entire database, basically, is in memory somewhere, but that may not be able to be the case down the road.
Additional autovacuum workers could be configured, allowing more to be run concurrently to hopefully keep up with the change rates. Again, given that the database fits in cache today, this isn't likely to cause issues for the overall system, particularly given that the overall CPU load is much lower thanks to the other optimizations which have been done.
In an environment where random access is more expensive than sequential access, which is typically the case in databases that are not fully cached in memory, having fewer workers which are doing primarily sequential access (which is typically what a VACUUM is doing) would be preferable.
Reduce the overall write load by moving the heavily updated columns to different tables and updating all the columns at once, as much as possible, while also eliminating any unnecessary or useless work being done, thus reducing the amount of work VACUUM has to do.
Lastly, an issue which came out of this analysis is that it's possible the projects table and the project_mirror_data table are being updated extensively in cases where no useful progress is being made. Specifically, projects which have not had a successful mirror in over 6 months appear to still being tried over and over again, even for cases which look like they really should be a "hard" failure case.
The system, overall, appears like it could do with some kind of cut-off where, if mirroring has failed for a certain amount of time (perhaps a week?) then a notice is sent out saying that it's hard-failed and will not be retried until some user action is taken. Perhaps certain failure cases could be considered to be hard-failed immediately (such as an authorization failed or the remote repository is not found or that the default branch has diverged from its upstream counterpart.
Eliminating all of these jobs which look like they may be fruitless may be able to reduce the overall system load significantly, though I've not done a formal analysis of the potential impact.
yorickpeterse-stagingmarked the checklist item Review autovacuum settings and provide recommendations as completed
marked the checklist item Review autovacuum settings and provide recommendations as completed
yorickpeterse-stagingmarked the checklist item Review shared buffer utilization through pg_buffercache (requires installation) to determine memory utilization and working set size as completed
marked the checklist item Review shared buffer utilization through pg_buffercache (requires installation) to determine memory utilization and working set size as completed
Review possibility to enable full logging, to allow collection of all queries to enable replaying on staging
I wasn't able to find an issue for this yet but it's also notoriously difficult to predict what will happen when full logging is enabled on an environment.
One approach to consider here would be to enable statement logging at the beginning of a deployment for a few minutes (prior to actually doing anything) as users may be more accepting if it turns out there are issues due to it.
Note that enabling statement-level logging only requires a "reload" of PostgreSQL, it does not need a restart and it can be switched off quickly also.
Provide recommendations regarding changes to shared_buffers and system memory amounts
The fallout from this is that the events table really needs to be addressed and, once done, further analysis should be done of what's in shared_buffers. The effort to change the events table is being tracked in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/12463 I believe
Recommendations regarding metrics for scalability and capacity planning
Unfortunately, the grafana graphs no longer appear to go back more than a month, so it's not very easy to gauge what the long-term growth rate in transactions/CPU-load/IO-load are, but during the time which I've been involved, my general feeling has been that we've reduced all of the above through the query optimization efforts and will continue to do so with the events rework. There looks to be a very long path forward to be had by continuing to improve queries and optimize data structures which strikes me as likely to out-pace load increases based on recent history. Of course, it's always possible there will be a "windfall" event which will drive up the activity significantly, though the current systems appear reasonably well positioned to be able to handle that and increased read load can be handled by spinning up additional read replicas. Addressing a serious increase in write load could be a concern, but we'll know much better how much of one when the changes to db1's I/O system are made (to match db3/db4) and the events table has been cleaned up to no longer include very large writes of data that's mostly/completely unused.
Addressing a serious increase in write load could be a concern, but we'll know much better how much of one when the changes to db1's I/O system are made (to match db3/db4) and the events table has been cleaned up to no longer include very large writes of data that's mostly/completely unused.
Imagine the write load is 10x what it is today. What would you foresee / propose?
@ernstvn By and large, I think the existing systems could handle a 10x increase in write load, but there wouldn't be as much room for things like the current events migration, possibly leading to having to throttle such changes down. That said, it looks like the max_wal_size still needs to be adjusted farther up to avoid forced checkpoints, and the checkpoint_timeout could be raised from the current default of 5m to reduce the number of FPI writes which are happening in the WAL stream.
To address that concern, I continue to recommend improving the data design to reduce the amount of writing which is currently happening at the disk level. Quite a few very wide tables (users, projects) are being constantly updated to just change a subset of fields, requiring a great deal more writing than is necessary. Moving the frequently updated field to a narrower table would be a great step towards reducing the write load and increasing the scalability of the system when it comes to handling more write traffic.
At 100x, I doubt the current system and data design would be able to handle the load. The minor tweaks discussed above regarding max_wal_size and checkpoint_timeout would also help but I doubt they would be enough to address a 100x increase in write throughput. Improving the data design and system efficiency through reworking the tables would help a great deal and could be sufficient to handle a 100x increase but the question is if the amount of WAL being generated would be more than the system can handle. Moving off of a virtualized environment and on to physical hardware with SSDs would certainly help a great deal. While db3 appears to be better about managing the write load, the grafana graphs still have concerning spikes of latency that do not appear to correlate to an increased number of checkpoints or other indicators of actual increased write load.
If a 10x or 100x increase in write load is likely in the near future, I would strongly recommend expending more effort to migrate frequently updated columns out of wide tables and into their own tables where most of the columns are updated in one statement together whenever updates are done. The tweaks to max_wal_size and checkpoint_timeout should also be considered, but increasing checkpoint_timeout is likely to lead to longer recovery times in the event of a system crash or unclean system reboot, so that needs to be considered.
There is some remaining work such as enabling certain settings (mostly for monitoring purposes), but there are separate issues for those. As a result I'll close this one and we'll open it up later today once we have confirmed all sensitive data has been removed.