crunchy-postgres-health-check

Tagging @yorickpeterse here as well as @pcarranza for visibility.

I'm missing context here, could you elaborate a bit @sfrost please? what is the goal here?

added database label

@pcarranza This was discussed via Email. Crunchy will be our consulting partner for PostgreSQL. Quoting some of the Emails:

Stephen Frost <stephen@crunchydata.com>
3:37 PM (21 hours ago)

to me, Ernst 
Yorick,

I understand we are all set up to get started whenever you are ready to
begin.  Based on the work-plan, we will be starting off with a
health-check of the existing system.  Typically, we would do that during
an on-site visit where we could put hands-on-keyboard to review how the
existing system is performing, review log files, the existing
configuration, and more, but my understanding is that everything will be
done remotely.

If that understanding is correct, we will need to work out how to
perform the same activities remotely.  For starters, at least, we could
simply use email or similar for sharing configuration files and other
information, though I'd also like to review the PostgreSQL log files and
those might be quite large.  Another option would be to set me up with
access to a system where you could place the log files and other
information, or I could set up a location where you could place the
files.  I am guessing they could have sensitive information and
naturally we would want to make sure that anything we do is done in a
secure and safe manner regarding your data and following your policies.

I'm also happy to simply pull down gitlab and set up a system here and
start looking at it, but that really isn't the same as looking at a
running system with load on it.

Please let me know your thoughts on how you'd like to proceed.

Thanks!

Ernst van Nierop
6:31 PM (18 hours ago)

to Stephen, me 
Hi Stephen,

I’ll let Yorick respond to set up an actual time for a video call for the kick-off, hopefully early next week.

Regarding how to get set up and how we can share information, those are great questions ☺
For starters, can you please let us know what your handle is on GitLab.com? (If you don’t have an account there yet, please make one). We’d like you to be able to be pinged and to comment on the public issues in the infrastructure issue tracker (https://gitlab.com/gitlab-com/infrastructure/issues, try filtering by the database label).  That won’t let you see confidential issues just yet though. Will it just be you Stephen, or should we add others as well?

If you _make_ a confidential issue yourself, for example calling it “Health Check”, then you will continue to be able to see it, so that may be a good way to keep this open to the company + Crunchy, but closed to the rest of the world. We can the post logs there. We should avoid using email as our primary means of communication, but rather work through the issue tracker for written communication, and augment that with (recorded) video calls when / where necessary.

Best

I'll be their primary contact person/guinea pig.

assigned to @yorickpeterse

cc @ernstvn

@sfrost Per the Email discussion, I'm fine with doing the kick-off on April 3rd since you're out next week. In the mean time, is there anything I can prepare for you to make the kick-off smoother?

Ah, thanks @yorickpeterse

Yorick,

Thanks for the suggestion.

In some instances, it might make sense to propose considering making changes to the log settings to collect additional information ahead of the kick-off, but in this environment (similar to other large-scale, extremely busy systems), we need to take a particularly careful approach to any changes to ensure that we don't destabilize the environment.

As noted above, one of the first things to look at is the configuration (in particular, postgresql.conf, but we can also review pg_hba/pg_ident if you are interested in reviewing connectivity/security considerations) and the PostgreSQL log files. Those are not required for the kick-off, but will definitely be of interest early on.

Thanks!

Contents of:

postgresql.conf: [removed]
pg_hba.conf: [removed]
pg_ident.conf: [removed]

Google document for today's kick-off (contains the agenda, some details about our setup, etc): https://docs.google.com/document/u/1/d/1PoXpflrWQkQMPihYgU6XMzFNKGn0RU_dbdXU2abeROE/edit#

added keep confidential label

Pgbouncer configuration: removed

Plans thus far:

We'll have a meeting on Tuesday and Thursday at 17:00 UTC
Between these meetings we'll work async. Due to timezone differences any data/output/logs/etc is best requested in bulk
If possible we should get @sfrost some form of access to the database server so he can run commands himself (ideally without having to give him sudo access), instead of me being his proxy. cc @pcarranza
@sfrost will review the above configuration options, ask for the output of commands, etc to get a better understanding of how things are running
After that we'll decide what to work on next

Outtakes from today's meeting (the ones I can remember):

Definitely keep moving forward with pgbouncer
@sfrost suggested running pgbouncer on the application servers (meaning GitLab connects to localhost), given there aren't too many application servers. This could be interesting, but may not work due to web/background processes being quite resource (in particular CPU) heavy
Foreign keys are a good thing, and @sfrost was glad we're using these (at least more than we have in the past)
@sfrost mentioned we have a lot of indexes on some tables, and could probably benefit from partial indexes in various places. @yorickpeterse remarked that we've been trying to reduce the number of indexes, and prevent developers from adding new ones unless absolutely necessary. We'll need to start enforcing this better at some point, though @yorickpeterse isn't quite sure yet how
PostgreSQL's full-text search might be better than trigram indexes, or even ElasticSearch. @yorickpeterse remarked that unfortunately we can not drop ElasticSearch anymore at this point
Database load balancing setup makes sense, definitely better than synchronous replication (if I understood @sfrost correctly)
Monitoring SQL transaction timings (not just slow ones) would be a really good addition to the monitoring system
For PostgreSQL @sfrost mentioned that if you have 32 cores ideally you try not to have more than 50 concurrent active connections; the closer to the core count, the better
@yorickpeterse mentioned the goal of reducing PostgreSQL's max_connections from 3000 to 200-300; @sfrost mentioned we should probably go even lower
The hot_standby_feedback issue we ran into (https://gitlab.com/gitlab-com/infrastructure/issues/1501) is probably the result of a long running transaction, and not hot_standby_feedback itself. We need to test this to be certain.
@sfrost mentioned we might benefit from enabling parallel sequence scans since he noticed we have quite a bunch of sequence scans going on. @yorickpeterse mentioned there was an issue for this, which can be found at https://gitlab.com/gitlab-com/infrastructure/issues/1351
@sfrost mentioned that as available RAM goes down on a DB server, IO load may increase due to the DB having to load data from disk (if there's not enough RAM). This means we have to carefully scale down hosts in small steps, instead of immediately picking the smallest instance we can cram things into
@sfrost looked at the configure options for PostgreSQL (https://gitlab.com/gitlab-org/omnibus-gitlab/blob/master/config/software/postgresql_new.rb#L46) and mentioned preferring libedit instead of libreadline may not be ideal

Goals this week:

Get @sfrost whatever data he requests (given we can provide it)
Try to get @sfrost access in some shape or form so @yorickpeterse doesn't have to be his proxy

Another addition would be to perhaps give @sfrost access to Slack (e.g. a single-channel guest for the #database channel)? This could be useful to ask questions and get more immediate feedback.

perhaps give @sfrost access to Slack

@yorickpeterse Good idea, I just added @sfrost as a single channel guest on #database, hoping that worked.

Regarding load balancing vs. sync replication- those really serve different use-cases and, due to that, one is not 'better' or 'worse' in a general sense. In particular, if you absolutely must not lose any transactions, then you must use sync replication. To avoid having a single point of failure, multiple sync replicas must then be used. The load balancing approach described is good to avoid having the delay on commits which will naturally happen when sync replication is used but that is only a valid approach if sync replication is not required for multi-system commit durability.

PostgreSQL 9.6 provides an alternative mechanism for ensuring that any queries against replicas are current with the 'remote_apply' option. This can be used in cases where multi-system commit durability is required and sync replication is used to ensure that replicas are always current by moving the "waiting" from the read/select side to the write/commit side. The 'remote_apply' options makes PostgreSQL wait until the sync replicas have confirmed that they have replayed the COMMIT before allowing the COMMIT to complete on the primary, ensuring that the very next transaction is able to use a replica. This removes the need to complicate the application logic with tracking the current WAL position, as gitlab is currently doing, and ensures that replicas are always able to be used for all read-only transactions. Naturally, this also means that COMMITs associated with transactions that write data will take longer, potentially substantially longer if there is a burst above capacity or other activity which slows down ongoing replication.

First- has there been any use of ALTER SYSTEM? If so, it would be useful to see the postgresql.auto.conf file (from the PG data directory) as well, to see what has been overridden there.

listen_addresses = '0.0.0.0'

While not a large deal, I would typically recommend using '' to indicate all interfaces. This requires a restart and, as with all of the recommended changes, should be tested in a similar environment as production before changing as there can be edge cases where '' might fail.

max_connections = 3000

Far too high, really. Would be good to reduce, as discussed/planned.

shared_buffers = 112896MB

This is extremely interesting, given the numbers you mentioned on our kick-off call that the system was using only ~40-60GB is used. We need to review the postgresql.conf on the other systems and look at the actual memory numbers (which provide more details than what is visible through the monitoring system).
Shared buffers should always be fully allocated and every page read from the database will pull a page into memory. If these numbers are accurate (and I suspect they may not be) then the system has only ever pulled something less than 60GB of pages from the disk since the last reboot.

temp_buffers = 8MB (default setting)

If any of your processes are using temporary tables, it may make sense to consider increasing this. This can be done on a per-session basis if done before any temporary tables are created. In particular, I am wondering if any of your background processes use temp tables.

work_mem = 16MB

Your queries are relatively simple in the general case and so I'm not sure that it's worth changing this, but if you have particular queries which run slowly, depending on what those queries are doing, it may be helpful to raise this value.

shared_preload_libraries = 'pg_stat_statements'

Very good module to have. Are you currently monitoring the view provided?

max_wal_wize = 2GB

This isn't very much. We can review why checkpoints are happening to see if it would make sense to adjust it, but just in general this is a relatively small amount. Consider how much data your system is able to write across a 5 minute period (probably upwards of 10GB).

checkpoint_completion_target = 0.7

Generally speaking, I would recommend making this 0.9. There's typically little advantage to having it be less than that.

wal_keep_segments = 512

Generally speaking, I discourage using the server's WAL space for the archive. If replicas fall behind enough that they have to get data which the primary is no longer holding then having them pull the data back from the archive is unlikely to be an issue, unless the archive is particularly slow for some reason (in which case, I might recommend reconsidering your archiving approach).

Logging

Note that the following changes may impact performance as they will increase the amount of logging done on the system. I would recommend testing the impact in a staging environment with load, if possible.

log_min_duration_statement = 2000

This is a rather high setting for this variable. Lowering this can provide very useful timing detail about queries that are running for longer than we would like them to. Of course, this has to be offset against the increase in log traffic, but it looks like this could at least be lowered to 1000 based on the stats and perhaps even lower.
pg_stat_statements is also helpful, of course, but it doesn't provide timing information about when a given query took the 'max time' which is recorded, leaving one to have to wonder if there was a correllation with another bit of activity on the system (a spike in queries, a checkpoint happening, etc).

log_checkpoints = off (default)

This should be enabled. Generally speaking, it shouldn't create much log traffic as it will, ideally, only be 2 log messages every 5 minutes or so.

log_connections/disconnections = off (default)

I would typically recommend enabling this. If the applications are maintaining an ongoing connection (and not reconnecting for every query) then this shouldn't create much additional logging traffic, but can be useful to know when trying to make sense of other log messages.

log_line_prefix = '%t [%p]: [%l-1] '

I generally recommend using: '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h ' or something equivilant which will include %a, %u, %d, and %h, all of which are missing from your log_line_prefix. This also, conveniently, matches what pgbadger expects (a very useful log analyzing tool) while including additional, typically useful, information. If there is concern about the size increase in the log files, this can be reworked to eliminate some of the text.

log_lock_waits = off (default)

This is an extremely useful option to enable as it will tell you whenever a process has been blocked behind another process due to locking, and the lock has taken longer than (deadlock_timeout) amount of time (defaults to 1 second). This shouldn't increase your logs much, if at all.

log_temp_files = -1 (default)

If temp files are being created by queries, it's useful to know about it. I generally recommend setting this to '0' unless there is good cause not to. If there is a lot of temp file usage (which I tend to doubt in this environment, but it's possible), then this could increase log volume.

track_io_timing (missing)

This postgresql.conf file is missing (at least) this parameter, indicating that it was, perhaps, copied up from an older version of PG. This parameter was added in PostgreSQL 9.2. In general, I would recommend using the postgresql.conf for the most recent version of PostgreSQL and copying through the settings which are appropriate for that version instead of just copying up the postgresql.conf as the parameters do change from release to release.
This parameter is particularly useful in identifying when I/O takes a long time, but it is necessary to make sure that the time can be gathered with little overhead. The 'pg_test_timing' tool is included with PostgreSQL and should be used to ensure that most timing queries are less than one usec before enabling track_io_timing.

Autovacuum

The settings configured for autovacuum appear to be rather agressive. Viewing the log files (thanks to having log_autovacuum_min_duration set to zero) will help to determine if this is the case. VACUUM, in general, is cheaper on 9.6 than it was previously, but that doesn't necessairly make it a good idea to spend an excessive amount of resources on it. The delay and cost limits look like they likely keep autovacuum from sucking up all resources but it'll be useful to review the logs to see how effective they are.

max_locks_per_transaction = 64 (default)

While I doubt you've had any issues with locks thus far, it's important to understand how the number of overall locks are calculated. The total number of locks are calculated by multiplying max_locks_per_transaction and max_connections. This means that your current max number of locks is quite large, but as max_connections is reduced, so will the number of locks. Note that these are relation-level locks, not row-level, so the number of relations in the system also has to be considered. We will review this by looking at pg_dump output and can adjust this number if necessary.

pg_stat_statements

There doesn't appear to be any configuration of pg_stat_statements, unless that has been done through other means (eg: ALTER SYSTEM). In particular, it may be useful to consider how many distinct statements are used and adjust the value of pg_stat_statements.max accordingly. I wouldn't generally recommend lowering it, but if there are more distinct statements than the default max of 500 then it may make sense to increase that value. We may be able to determine if we need to adjust this setting based on the data collected.

The following commands will help us understand how the system is operating today and what a typical workload looks like. None of these commands is anticipated to be very expensive to run, they are just collecting information which should be readily available and reasonably low volume.

All commands are run with 'psql' or 'pg_dump' and assume that the user is a superuser. The psql output format used is CSV to allow for re-loading of the data into PG for analysis. Comments about each command are included in-line. No data is expected to be particularly sensitive, though it is possible that there are values in queries returned which contain user data. If this is a concern then the data should be reviewed prior to being shared.

The commands should all work, naturally with the '##DATABASE##' replaced with the appropriate database name (or removed if your configuration defaults to the correct database).

Schema, database, tablespace info

Extract out the schema definition

pg_dump -d ##DATABASE## --schema-only > schema.sql

Pull information about all databases, as those are not included by pg_dump

psql -d ##DATABASE## -c "COPY (table pg_database) TO STDOUT WITH CSV HEADER;" > pg_database.csv

Pull information about all tablespaces, as those are not included by pg_dump

psql -d ##DATABASE## -c "COPY (table pg_tablespace) TO STDOUT WITH CSV HEADER;" > pg_tablespace.csv

Current configuration information

Get the current settings

psql -d ##DATABASE## -c "COPY (table pg_settings) TO STDOUT WITH CSV HEADER;" > pg_settings.csv

Pull information about how PostgreSQL was built and installed

psql -d ##DATABASE## -c "COPY (table pg_config) TO STDOUT WITH CSV HEADER;" > pg_config.csv

Pull information about what is currently configured in postgresql.conf

psql -d ##DATABASE## -c "COPY (table pg_file_settings) TO STDOUT WITH CSV HEADER;" > pg_file_settings.csv

Statistical information

Pull table statistics

psql -d ##DATABASE## -c "COPY (table pg_stat_all_tables) TO STDOUT WITH CSV HEADER;" > pg_stat_all_tables.csv

Pull block-level table statistics

psql -d ##DATABASE## -c "COPY (table pg_statio_all_tables) TO STDOUT WITH CSV HEADER;" > pg_statio_all_tables.csv

Pull index statistics

psql -d ##DATABASE## -c "COPY (table pg_stat_all_indexes) TO STDOUT WITH CSV HEADER;" > pg_stat_all_indexes.csv

Pull block-level index statistics

psql -d ##DATABASE## -c "COPY (table pg_statio_all_indexes) TO STDOUT WITH CSV HEADER;" > pg_statio_all_indexes.csv

Pull background writer statistics

psql -d ##DATABASE## -c "COPY (table pg_stat_bgwriter) TO STDOUT WITH CSV HEADER;" > pg_stat_bgwriter.csv

Pull archiver statistics

psql -d ##DATABASE## -c "COPY (table pg_stat_archiver) TO STDOUT WITH CSV HEADER;" > pg_stat_archiver.csv

Pull in-progress VACUUM information

psql -d ##DATABASE## -c "COPY (table pg_stat_progress_vacuum) TO STDOUT WITH CSV HEADER;" > pg_stat_progress_vacuum.csv

Pull status of replicas

psql -d ##DATABASE## -c "COPY (table pg_stat_replication) TO STDOUT WITH CSV HEADER;" > pg_stat_replication.csv

Pull the size of all tables, indexes, materialized views, and toast tables

psql -d ##DATABASE## -c "COPY (select relname, nspname, pg_relation_size(pg_class.oid) from pg_class join pg_namespace on relnamespace = pg_namespace.oid where relkind IN ('r', 'i', 'm', 't')) TO STDOUT WITH CSV HEADER;" > pg_class_sizes.csv

Query data

Pull query information from pg_stat_statements (will fail if pg_stat_statements has not been installed)

psql -d ##DATABASE## -c "COPY (table pg_stat_statements) TO STDOUT WITH CSV HEADER;" > pg_stat_statements.csv

Current activity information- ongoing queries and transaction lengths

psql -d ##DATABASE## -c "COPY (table pg_stat_activity) TO STDOUT WITH CSV HEADER;" > pg_stat_activity.csv

Current locks being held

psql -d ##DATABASE## -c "COPY (table pg_locks) TO STDOUT WITH CSV HEADER;" > pg_locks.csv

With all of the above .sql/.csv files created, please review them if you are concerned about any sensitive data, and once you are comfortable, provide them via this issue or however is most convenient for you.

System level information

Detailed process information across 5 minutes

top -c -b -d 2 -n 300 > top.log

Detailed process summary information, ctrl-c after 5 minutes

vmstat -awntSM 5 > vmstat.log

Detailed i/o information, ctrl-c after 5 minutes

iostat -cdmtx 5 > iostat.log

Detailed memory information

cat /proc/meminfo > meminfo.log

@sfrost

PostgreSQL 9.6 provides an alternative mechanism for ensuring that any queries against replicas are current with the 'remote_apply' option.

I probably should have mentioned this during the kick-off. We looked into remote_apply as an alternative to full synchronous replication, combined with setting the standby replica names to 2(*) (since we have 2 replicas).

Ultimately we decided to not go with this solution. As far as I understand, if one or more replicas are not responding every query using remote_apply will block until the number of available replicas is equal or greater to the number configured in the standby replica names list. For our use case (= just balancing load) this didn't feel like the right approach.

First- has there been any use of ALTER SYSTEM?

Not that I am aware of. Our configuration is usually done in postgresql.conf, and sometimes using ALTER TABLE (e.g. for table-specific vacuuming settings).

shared_buffers = 112896MB

This setting was based on our old memory consumption (prior to load balancing and pgbouncer). I anticipate being able to lower this value, but we'll have to do so carefully.

If any of your processes are using temporary tables, it may make sense to consider increasing this. This can be done on a per-session basis if done before any temporary tables are created. In particular, I am wondering if any of your background processes use temp tables.

Our application does not make explicit use of temporary tables. I do think we have a few queries that end up materializing data (as in, the query planner ends up doing this), but I'm not 100% certain.

shared_preload_libraries = 'pg_stat_statements' Very good module to have. Are you currently monitoring the view provided?

Yes, we use it for our PostgreSQL dashboards.

This isn't very much. We can review why checkpoints are happening to see if it would make sense to adjust it, but just in general this is a relatively small amount. Consider how much data your system is able to write across a 5 minute period (probably upwards of 10GB).

Does increasing this value have an impact on the time it takes to replicate data to secondaries?

Generally speaking, I discourage using the server's WAL space for the archive. If replicas fall behind enough that they have to get data which the primary is no longer holding then having them pull the data back from the archive is unlikely to be an issue, unless the archive is particularly slow for some reason (in which case, I might recommend reconsidering your archiving approach).

We archive to S3 and Azure. While these services aren't that slow, I think they might be slower than just replicating from a primary directly. Having said that, I haven't tested this so I'm not certain.

The settings configured for autovacuum appear to be rather agressive. Viewing the log files (thanks to having log_autovacuum_min_duration set to zero) will help to determine if this is the case. VACUUM, in general, is cheaper on 9.6 than it was previously, but that doesn't necessairly make it a good idea to spend an excessive amount of resources on it. The delay and cost limits look like they likely keep autovacuum from sucking up all resources but it'll be useful to review the logs to see how effective they are.

We used to have more relaxed settings, but this could easily lead to many (>= 10k) dead tuples in some of the more frequently updated tables. Having said that, we could probably adjust a few to be more relaxed.

CSV files

schema.sql: removed
pg_database.csv: removed
pg_tablespace.csv: removed
pg_class_sizes.csv: removed
pg_config.csv: removed
pg_file_settings.csv: removed
pg_locks.csv: removed
pg_settings.csv: removed
pg_stat_activity.csv: removed
pg_stat_all_indexes.csv: removed
pg_stat_all_tables.csv: removed
pg_stat_archiver.csv: removed
pg_stat_bgwriter.csv: removed
pg_statio_all_indexes.csv: removed
pg_statio_all_tables.csv: removed
pg_stat_progress_vacuum.csv: removed
pg_stat_replication.csv: removed
pg_stat_statements.csv: removed

System Info

top -c -b -d 2 -n 300 > top.log: top.log.gz: removed
vmstat -awntSM 5 > vmstat.log: vmstat.log: removed
iostat -cdmtx 5 > iostat.log: iostat.log: removed
cat /proc/meminfo > meminfo.log: meminfo.log: removed

Thanks for providing all of this information, I've been going over it.

Is it possible for the PostgreSQL logs to be made available also?

Also, this would be helpful:

Collect pg_class oids

psql -d ##DATABASE## -c "COPY (select oid,relname from pg_class) TO STDOUT WITH CSV HEADER;" > pg_class_oids.csv

Is it possible for the PostgreSQL logs to be made available also?

How much log data would you like? A few hours, days, etc?

collect pg_class oids

pg_class_oids.csv: removed

mentioned in issue #1552 (closed)

mentioned in issue #1554 (closed)

mentioned in issue #1558

mentioned in issue #1559 (closed)

mentioned in issue #1560

@yorickpeterse

Per our discussion regarding deliverables, please find below where we are currently.

Naturally, we can discuss and adjust the items as needed.

Health Check
- Analyze existing configuration and architecture, provide recommendations (ongoing)
- Review system statistics and identify any areas of concern (ongoing)
- Review schema and provide recommendations (ongoing; data/system access would help identify opportunities for improvement)
- Review PostgreSQL log files to identify any issues and recommendations (pending, waiting for log files)
Backup and Restore
- Review existing backup methodologies and provide recommendations (pending)
PGBouncer
- Review pgbouncer.ini and provide recommendations (done)
- Discuss and provide recommendations regarding PGBouncer locations/architecture (done)
Read replicas
- Review replica stats and log files (pending)
- Determine cause of hot_standby causing table bloat (done; likely related to issue addressed in 9.6.2)
HA / Failover
- Review corosync/pacemaker configuration and provide recommendations (pending)
- Assist with understanding of corosync/pacemaker, current cluster status, managing failover (pending)
Monitoring
- Review existing monitoring, provide recommendations for additional monitoring (ongoing)
Tuning / Settings
- Review postgresql.conf and provide recommendations (done)
- Review autovacuum runs (pending log files with autovacuum information)
- Review autovacuum settings and provide recommendations (pending)
Application binding to replicas
- We discussed this during our calls but unclear if further discussion or review necessary. (done, unless further analysis requested)
Application improvements
- Review slow queries for possible optimization (ongoing; slow queries identified, representative data sample and/or system access would allow testing alternative query structures and assist in identifying possible changes in indexing strategy)
Staging with realistic load by running tests
- Review possibility to enable full logging, to allow collection of all queries to enable replaying on staging (pending logging changes and impact analysis)
Capacity planning
- Implementation of pg_bouncer and other improvements will impact capacity planning and scalability.
- Review shared buffer utilization through pg_buffercache (requires installation) to determine memory utilization and working set size
- Provide recommendations regarding changes to shared_buffers and system memory amounts

Review of existing monitoring (based on what is visible to me through the public site) and recommendations for additional monitoring:

Monitor WAL/xlog files which are ready to be archived (count of files in pg_xlog/archive_status which end in ".ready")
- Point-in-time recovery requires WAL files. If many WAL files are not yet archived and a failure occurs then the backup will be missing, potentially, a lot of WAL.
- WAL files not being archived may be indicative of a problem with the archive_command
- Unarchived WAL files may continue to accumulate and result in disk space issues
Monitor database transaction age
- In PostgreSQL, we record the oldest transaction id (xid) of all tuples in pg_class (on a relation basis) and the minimum of all relations in a database in pg_database
- Transaction IDs will eventually "wrap" around and be re-used
- If PostgreSQL runs out of available transaction IDs, it is no longer able to accept new transactions
- SELECT datname, age(datfrozenxid) FROM pg_database;
- If done on a per-table basis, be sure to consider both the main table and any TOAST table
Total number of locks acquired
- There is a finite number of heavy-weight locks available in PostgreSQL, which is max_connections * max_locks_per_transaction
- Running out of locks will prevent processes from being able to take new locks, resulting in errors
- The pg_locks table contains the current PostgreSQL locks
Transaction length
- Long running transaction can cause bloat as VACUUM is unable to mark deleted rows as being available space due to open transactions
- The view pg_stat_activity includes information about all transactions, when they started, and their current status.
- Replicas must also be checked for long-running transactions when hot_standby_feedback is being used
- Connections which are 'idle in transaction' may be worth monitoring for explicitly as these can be particularly problematic
PGBouncer monitoring
- pgbouncer provides a number of different stats which should be monitored to avoid reaching limits set in pgbouncer.ini
- Monitoring of "waiting" clients can indicate if the pgbouncer configuration should be changed to allow more connections to PG
- The number of clients may approach the maximum configured, resulting in subsequent connection attempts being delayed or failing

@pcarranza

Review existing backup methodologies and provide recommendations (pending)

For this part I would like you or @northrup to join (when he's back) to explain a bit about this to @sfrost , since I know very little about our WAL-E setup. We have our meetings on Tuesday/Thursday at 17:00 UTC. What day would be convenient for you?

@pcarranza @northrup

Review corosync/pacemaker configuration and provide recommendations (pending)

For this I would also need some help. I know very little about corosync, so I can't really talk much about this.

@sfrost

Here's the corosync configuration file: removed

Output of crm status on a primary:

Last updated: Tue Apr 11 18:10:26 2017          Last change: Sat Mar 11 08:13:22 2017 by hacluster via crmd on db1.cluster.gitlab.com
Stack: corosync
Current DC: db1.cluster.gitlab.com (version 1.1.14-70404b0) - partition WITHOUT quorum
2 nodes and 1 resource configured

Online: [ db1.cluster.gitlab.com ]
OFFLINE: [ db2.cluster.gitlab.com ]

Full list of resources:

 gitlab_pgsql   (ocf::pacemaker:gitlab_pgsql):  Started db1.cluster.gitlab.com

Failed Actions:
* gitlab_pgsql_monitor_30000 on db1.cluster.gitlab.com 'unknown error' (1): call=119, status=Timed Out, exitreason='none',
    last-rc-change='Wed Mar 29 03:06:16 2017', queued=0ms, exec=0ms

db2 is an old secondary. Output on a secondary that's active:

Last updated: Tue Apr 11 18:11:17 2017          Last change: Fri Mar 17 14:05:53 2017 by hacluster via crmd on db3
Stack: corosync
Current DC: db3 (version 1.1.14-70404b0) - partition with quorum
1 node and 1 resource configured

Online: [ db3 ]

Full list of resources:

 gitlab_pgsql   (ocf::pacemaker:gitlab_pgsql):  Started db3

Monitor database transaction age

For this there is the following issue: https://gitlab.com/gitlab-com/infrastructure/issues/1344

mentioned in issue #1587

Total number of locks acquired

I created issue https://gitlab.com/gitlab-com/infrastructure/issues/1587 for this.

mentioned in issue #1588

Transaction length

Created issue https://gitlab.com/gitlab-com/infrastructure/issues/1588

PGBouncer monitoring

There's an existing issue to add monitoring support via prometheus: https://gitlab.com/gitlab-com/infrastructure/issues/1543

mentioned in issue #1589

Monitor WAL/xlog files which are ready to be archived (count of files in pg_xlog/archive_status which end in ".ready")

https://gitlab.com/gitlab-com/infrastructure/issues/1589

@sfrost Your log files of the primary can be found at removed, this URL will expire 48 hours from now.

@sfrost Log files of the secondary: removed This URL also expires 48 hours from now.

The log directories are from runit, so their structure is a bit goofy. There are a bunch of *.s.gz files, these are archived log files in gzip format. The file current contains the most recent logs in plain text format. Uncompressed the total log size is probably a few GBs.

@yorickpeterse

Following up from our discussion regarding recursive queries, the query I had wasn't quite right, but here's one which I believe to be correct, and includes the project_group_links piece (hopefully I understood that relationship correctly):

WITH RECURSIVE namespace_access AS (
    SELECT namespaces.id, access_level
    FROM members
    join namespaces on (members.source_id = namespaces.id)
    WHERE
    members.user_id = 1
    and members.source_type = 'Namespace'
    UNION ALL
    SELECT namespaces.id, access_level
    FROM namespaces, namespace_access
    WHERE namespaces.parent_id = namespace_access.id
),
project_namespace_access as (
    SELECT projects.id, access_level
     FROM projects
     JOIN namespace_access ON (projects.namespace_id = namespace_access.id)
),
project_group_access as (
    SELECT project_id, group_access as access_level
    FROM project_group_links
    JOIN namespace_access ON (project_group_links.group_id = namespace_access.id)
),
project_direct_access as (
  SELECT source_id as id, access_level
  FROM members
  WHERE members.user_id = 1
  AND members.source_type = 'Project'
)
SELECT
    id as project_id,
    min(access_level) as access_level
from
    (select * from project_namespace_access
union all
    select * from project_direct_access
union all
    select * from project_group_access
 ) as all_authorizations
group by id
order by id;

This query ends up being a bit expensive, largely because there's no index on project_group_links (group_id), I believe, but it's still only 18ms. Without including the project_group_links CTE the query time was around 2ms. There may also be opportunities to optimize this query, I haven't looked very hard to see if that would be possible.

(Note that the 'order by' at the end isn't strictly necessary, of course)

While analyzing the results and comparing them to this:

select project_id, access_level from project_authorizations where user_id = 1 order by project_id;

I discovered that there was one more authorization in project_authorizations that didn't exist through the above query and it was for this project: 695082

I see that the "creator_id" of this project is '1', but there exist other projects which have a create_id of '1' that don't have an entry in project_authorizations for user_id = 1 (eg: project 18533), so I assume that creator doesn't actually imply access.

Is there something the above query is missing regarding access, or..?

There are also cases where the authorization_level doesn't seem to match what's in project_authorizations. I expect those issues are solvable, but thought it would make sense to first make sure we are at least matching up on the set of projects for which a given user should have access and then we can work on how to make sure that the access level is correct. In particular, I'm wondering if the "group_level" should override access through the "namespace"? I would have thought that the minimum of all accesses is the "correct" access level for a given user/project, but that doesn't appear to match to what is currently in project_authorizations in all cases.

Looking forward to discussing this further!

@yorickpeterse

Since we didn't get to all of the items I had discovered in the log files which were shared, I wanted to list them here:

log file is missing database, user, applicaiton name, etc?

ERROR: pg_stat_statements - not created in all databases? ERROR: relation "pg_stat_statements_history" already exists

ERROR: duplicate key value violations

unique constraint "index_issues_on_project_id_and_iid"
unique constraint "index_deployments_on_project_id_and_iid"
unique constraint "index_routes_on_path"

ERROR: null value in column "approvals_before_merge" violates not-null constraint

ERROR: statement timeout

while updating tuple (191,93) in relation "tags"

ERROR: canceling autovacuum task

gitlabhq_production.public.notes
gitlabhq_production.public.issues

Modifying the postgresql.conf as discussed earlier to include more information (in particular, log_lock_waits and the log_line_prefix, though also log_checkpoints and the other items) would help get a better picture regarding what's happening to cause some of these longer running queries and why things like the autovacuum runs against certain tables are sometimes being canceled.

Thanks!

@sfrost Based on your query I did some tinkering, and I ended up with the following:

WITH RECURSIVE children(id, parent_id, name) AS (
    SELECT id, parent_id, name
    FROM namespaces
    WHERE path = 'path-of-the-top-level-group'

    UNION ALL

    SELECT ns.id, ns.parent_id, ns.name
    FROM children AS ch, namespaces AS ns
    WHERE ns.parent_id = ch.id
)
SELECT *
FROM children;

This will get a top level namespace along with all of its child namespaces recursively. It doesn't perform too badly either:

                                                                                QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on children  (cost=3603.79..3619.21 rows=771 width=524) (actual time=0.034..0.085 rows=4 loops=1)
   CTE children
     ->  Recursive Union  (cost=0.43..3603.79 rows=771 width=18) (actual time=0.032..0.081 rows=4 loops=1)
           ->  Index Scan using index_namespaces_on_path on namespaces  (cost=0.43..8.45 rows=1 width=18) (actual time=0.029..0.029 rows=1 loops=1)
                 Index Cond: ((path)::text = 'group-with-many-sub-groups'::text)
           ->  Nested Loop  (cost=0.43..357.99 rows=77 width=18) (actual time=0.012..0.014 rows=1 loops=3)
                 ->  WorkTable Scan on children ch  (cost=0.00..0.20 rows=10 width=4) (actual time=0.000..0.000 rows=1 loops=3)
                 ->  Index Scan using index_namespaces_on_parent_id_and_id on namespaces ns  (cost=0.43..35.70 rows=8 width=18) (actual time=0.008..0.009 rows=1 loops=4)
                       Index Cond: (parent_id = ch.id)
 Planning time: 0.227 ms
 Execution time: 0.132 ms

I'll see if I can adjust the current sub groups queries to use a similar pattern.

After some hacking I ended up with the following (not too pretty looking) query:

WITH RECURSIVE namespace_ids(id, access_level) AS (
    -- First we get all the namespaces that a user has direct access to.
    SELECT namespaces.id, members.access_level
    FROM namespaces
    INNER JOIN members ON members.source_id = namespaces.id
    WHERE members.source_type = 'Namespace'
    AND namespaces.type = 'Group'
    AND members.user_id = 1
    AND members.requested_at IS NULL
    AND share_with_group_lock = 'f'
    AND deleted_at IS NULL

    UNION

    -- Then we recursively get all their child namespaces.
    SELECT ns.id, members.access_level
    FROM namespace_ids AS ids, namespaces AS ns
    LEFT JOIN members
        ON members.source_id = ns.id
        AND members.source_type = 'Namespace'
        AND members.source_id = 1
    WHERE ns.parent_id = ids.id
    AND ns.share_with_group_lock = 'f'
    AND ns.deleted_at IS NULL
)
SELECT 1 as user_id,
    projects.id AS project_id,
    MAX(GREATEST(project_group_links.group_access, namespace_ids.access_level)) AS access_level
FROM namespace_ids
INNER JOIN projects ON projects.namespace_id = namespace_ids.id
LEFT JOIN project_group_links ON project_group_links.group_id = namespace_ids.id
GROUP BY user_id, projects.id;

This spits out the access levels of the projects you either have access to by being a member of a group, my having access to a nested group, or if the project is shared with one of your groups. Having said that, this query is:

Rather ugly because of the MAX + GREATEST combination
Probably wrong as I have done very little testing

Nevertheless, all of the above shows we could in theory use the raw queries directly given that:

They all perform well (obviously)
We can find proper MySQL equivalents

@DouweM: digging through all of the above I found the following: for getting nested groups we use LEAST, e.g.:

SELECT members.user_id,
    projects.id AS project_id,
    LEAST(project_group_links.group_access, members.access_level) AS access_level

FROM namespaces

INNER JOIN routes ON routes.source_id = namespaces.id AND routes.source_type = 'Namespace'
INNER JOIN project_group_links ON project_group_links.group_id = namespaces.id
INNER JOIN projects ON projects.id = project_group_links.project_id AND projects.pending_delete = 'f'
INNER JOIN routes r2 ON routes.path LIKE CONCAT(r2.path, '/%')
INNER JOIN members ON members.source_id = r2.source_id AND members.source_type = r2.source_type
INNER JOIN namespaces project_namespace ON project_namespace.id = projects.namespace_id

WHERE namespaces.deleted_at IS NULL
AND namespaces.type IN ('Group')
AND (members.user_id = 1)
AND (project_namespace.share_with_group_lock = 'f')
ORDER BY namespaces.id DESC;

Should this not use GREATEST instead?

Further hacking resulted in this query:

WITH RECURSIVE namespaces_cte(id, access_level) AS (
    -- First we get all the projectss that a user has direct access to.
    SELECT namespaces.id, members.access_level
    FROM namespaces
    INNER JOIN members ON members.source_id = namespaces.id
    WHERE members.source_type = 'Namespace'
    AND namespaces.type = 'Group'
    AND members.user_id = 1
    AND members.requested_at IS NULL
    AND namespaces.share_with_group_lock = 'f'

    UNION

    -- Then we recursively get all the projects of child namespaces.
    SELECT ns.id, members.access_level
    FROM namespaces_cte AS ids, namespaces AS ns
    LEFT JOIN members
        ON members.source_id = ns.id
        AND members.source_type = 'Namespace'
        AND members.source_id = 1
        AND members.requested_at IS NULL
    WHERE ns.parent_id = ids.id
    AND ns.type = 'Group'
    AND ns.share_with_group_lock = 'f'
)
SELECT 1 AS user_id,
    projects.id AS project_id,
    LEAST(project_group_links.group_access, namespaces_cte.access_level) AS access_level

FROM namespaces_cte
INNER JOIN project_group_links ON project_group_links.group_id = namespaces_cte.id
INNER JOIN projects ON projects.id = project_group_links.project_id
ORDER BY projects.id ASC;

This gets all the project authorizations for the projects shared with any groups you have direct access to or projects for any of their sub-groups. On production this appears to produce the right results, and in a much shorter amount of time. The currently released approach will take around 1.6 seconds to get the data, whereas the above query can do it in about 10 ms.

Worth noting: my above query is prone to including duplicates, so a SELECT DISTINCT is necessary.

WITH RECURSIVE namespace_access AS (
    SELECT namespaces.id, access_level
    FROM members
    join namespaces on (members.source_id = namespaces.id)
    WHERE
    members.user_id = 1
    and members.source_type = 'Namespace'
    UNION ALL
    SELECT namespaces.id, access_level
    FROM namespaces, namespace_access
    WHERE namespaces.parent_id = namespace_access.id
)
SELECT
    id as project_id,
    min(access_level) as access_level
from
    (
    SELECT projects.id, access_level
     FROM projects
     JOIN namespace_access ON (projects.namespace_id = namespace_access.id)
union all
  SELECT source_id as id, access_level
  FROM members
  WHERE members.user_id = 1
  AND members.source_type = 'Project'
union all
    SELECT project_id, group_access as access_level
    FROM project_group_links
    JOIN namespace_access ON (project_group_links.group_id = namespace_access.id)
 ) as all_authorizations
group by id
order by id;

@sfrost So with some further tinkering during the call I came up with:

WITH RECURSIVE namespaces_cte(id, access_level) AS (
    -- First we get all the namespaces the user has direct access to.
    SELECT namespaces.id, members.access_level
    FROM namespaces
    INNER JOIN members ON members.source_id = namespaces.id
    WHERE members.source_type = 'Namespace'
    AND namespaces.type = 'Group'
    AND members.user_id = 1
    AND members.requested_at IS NULL
    AND namespaces.share_with_group_lock = 'f'

    UNION

    -- Then we'll get all child namespaces recursively.
    SELECT ns.id, COALESCE(members.access_level, ids.access_level)
    FROM namespaces_cte AS ids, namespaces AS ns
    LEFT JOIN members
        ON members.source_id = ns.id
        AND members.source_type = 'Namespace'
        AND members.source_id = 1
        AND members.requested_at IS NULL
    WHERE ns.parent_id = ids.id
    AND ns.type = 'Group'
    AND ns.share_with_group_lock = 'f'
)
SELECT project_id, MAX(access_level) AS access_level
FROM (
    SELECT projects.id AS project_id, namespaces_cte.access_level AS access_level
    FROM namespaces_cte
    INNER JOIN projects ON projects.namespace_id = namespaces_cte.id

    UNION

    SELECT project_group_links.project_id AS project_id, project_group_links.group_access AS access_level
    FROM namespaces_cte
    INNER JOIN project_group_links ON project_group_links.group_id = namespaces_cte.id
) authorizations
GROUP BY project_id;

This requires a GROUP BY on the outer query to ensure we get the MAX of the unioned set.

@yorickpeterse

Regarding the current backup and restore approach, my understanding is:

WAL-E is used for backups
Nightly FULL backups are done, ~3.5 hours required to perform the backup
Full backups are kept for 8 days
WAL is also kept for 8 days
Restore testing is performed manually, approximately monthly, and includes restoring a full backup and replaying WAL to a specific point in time. Restore time is quite variable.
Plan is to automate the restore process for rebuilding testing.

Recommendations moving forward:

Regular, ideally automated, testing of backups is extremely important. Everyone who might be involved should understand the basics of performing a restore, a runbook should exist to facilitate a restore, and utilization of automated restores is very helpful to ensure that the backups are valid and useful.
Restoring backups into staging on a daily or weekly basis is a good approach for testing that a backup is valid.
Additional testing would ideally be done, eg: running "pg_dump" across the restored backup, to identify any obvious table corruption. Note that this does not check for all kinds of corruption and does not test indexes at all.
Using the results of the "pg_dump" and restoring that data into a new, clean, cluster will provide testing of foreign key relationships, when they exist, as well as constraints. With proper foreign keys and constraints, this can provide an excellent test to ensure that the data in the database is consistent and relatively free of corruption.
PostgreSQL does have support for page-level checksums, though to enable them a logical pg_dump/restore would have to be done, which is reasonably expensive. With page-level checksums, some backup technologies (eg: pgBackrest) are able to validate those checksums during a backup and identify any checksum failures. WAL-E does not currently support this.

@yorickpeterse

Query analysis based on the pg_stat_statements_history contents:

Items you already created issues for (I believe):

Query of the form: SELECT "users".* FROM "users" WHERE "users"."id" = 1245 AND ("users"."id" NOT IN (SELECT "approvals"."user_id" FROM "approvals" WHERE "approvals"."merge_request_id" = 1234)) ORDER BY "users"."id" DESC ; is performing a sequential scan across the approvals table.
Adding an index to approvals on merge_request_id should improve the performance of this query.
Users table is very wide and frequently updated, should consider moving frequently updated columns into another, narrower, table.
Missing indexes for foreign key referring columns (general issue)
DELETE FROM "ci_pipelines" WHERE "ci_pipelines"."id" = ? AND "ci_pipelines"."lock_version" = ?
The ci_pipelines table is referred to from the ci_pipelines table itself and from two other tables (ci_builds and merge_request_metrics). The ci_pipelines table and ci_builds table do not have an index on the referring column, leading to this query taking >5s to perform sequential scans. Creating an index in each table on the "auto_canceled_by_id" column should greatly improve performance of this DELETE.
SELECT COUNT(*) FROM ci_runners WHERE active=true AND contacted_at >= NOW() - '60 minutes'::INTERVAL;
The ci_runners table does not have an index on contacted_at, even though that column is regularly used in queries. Adding a partial index on (active = true) over contacted_at should improve performance of this query.
Possible index on web_hooks.type would be useful for query: SELECT "web_hooks".* FROM "web_hooks" WHERE "web_hooks"."type" IN (?) ORDER BY "web_hooks"."id" DESC. Removing the ORDER BY may also improve performance, if it's not actually needed.

Additional queries which likely should have issues also created for them:

SELECT "ci_builds".* FROM "ci_builds" INNER JOIN "projects" ON "projects"."id" = "ci_builds"."project_id" AND "projects"."pending_delete" = 'f' WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."status" = 'running' AND (ci_builds.updated_at < '2017-04-20 16:00:01.780025') ORDER BY "ci_builds"."id" ASC LIMIT 50 ;
There is no index on "updated_at". The index on ci_builds (id) is used, but it has to filter out essentially the entire table, making the index scan not performant. Having an index on updated_at should greatly improve the performance of this query. If the values for "type" and "status" are always "Ci::Build" and "running", then a partial index over updated_at for those exact values would be even more efficient as those values appear to be a small portion of the table.
An alternative approach would be to use a CTE to pull out just the records which are Ci::Build and running, then filter and join afterwards, along these lines:
with build_running as (select * from ci_builds WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."status" = 'running') select * from build_running where updated_at < '2017-04-20 16:00:01.780025';
The above query takes ~65ms, instead of 30+s.
This query takes on the order of 10s: UPDATE "ci_builds" SET "user_id" = ? WHERE "ci_builds"."type" IN (?) AND "ci_builds"."user_id" = ?
The above query would almost certainly benefit from an index on "user_id" in ci_builds. It is also not entirely clear to me why the user_id field is being changed, but if that's the correct business logic then there should be an index to support the query and improve its performance.
This query takes on the order of 3.5s: SELECT "lfs_objects".* FROM "lfs_objects" LEFT JOIN lfs_objects_projects ON lfs_objects_projects.lfs_object_id = lfs_objects.id WHERE "lfs_objects_projects"."id" IS NULL
An index on lfs_objects_projects (lfs_object_id) would allow an in-order index traversal of the lfs_objects (id) index and the proposed new index to feed into a Merge Left Join, likely improving the performance of this query. What is not entirely clear is the intent of this query as it appears to generally return zero rows- should there be an FK relationship here instead, removing the need to validate that no rows are missing?

@yorickpeterse

I came across another interesting query which looks something like this:

SELECT
  "events"."id" AS t0_r0,
  "events"."target_type" AS t0_r1,
  "events"."target_id" AS t0_r2,
  "events"."title" AS t0_r3,
  "events"."data" AS t0_r4,
  "events"."project_id" AS t0_r5,
  "events"."created_at" AS t0_r6,
  "events"."updated_at" AS t0_r7,
  "events"."action" AS t0_r8,
  "events"."author_id" AS t0_r9,
  "projects"."id" AS t1_r0,
  "projects"."name" AS t1_r1,
  "projects"."path" AS t1_r2,
  "projects"."description" AS t1_r3,
...
FROM
"events"
LEFT OUTER JOIN "projects" ON "projects"."id" = "events"."project_id" AND "projects"."pending_delete" = 'f'
LEFT OUTER JOIN "routes" ON "routes"."source_id" = "projects"."id" AND "routes"."source_type" = 'Project'
LEFT OUTER JOIN "namespaces" ON "namespaces"."id" = "projects"."namespace_id" AND "namespaces"."deleted_at" IS NULL
LEFT OUTER JOIN "users" ON "users"."id" = "events"."author_id"
WHERE
("events"."author_id" IS NOT NULL)
AND "events"."author_id" = 423915
AND "projects"."pending_delete" = 'f'
AND (projects.id IN
  (SELECT "projects"."id" FROM "projects"
   INNER JOIN "project_authorizations" ON "projects"."id" = "project_authorizations"."project_id"
   WHERE "projects"."pending_delete" = 'f' AND "project_authorizations"."user_id" = 758045
   UNION
   SELECT "projects"."id" FROM "projects" WHERE "projects"."visibility_level" IN (20, 10)))
ORDER BY "events"."id" DESC
LIMIT 20;

Over the period of time I was looking at, this had an average run time of 1.2s. It seems to be a rather... convoluted query, however. In particular, the IN clause and sub-select are curious as a list is built up of all projects which have a visibility level of 20 or 10, plus a list of all projects the user has access to via project_authorizations, as I understand it, but with this filter (and the filter on projects.pending_delete), the "left join" against projects is effectively an inner join (since any records in events not matching in projects will get filtered because 'projects.pending_delete would be NULL).

Here is an alternative query which I believe returns the same results in only about 100ms:

SELECT
  "events"."id" AS t0_r0,
  "events"."target_type" AS t0_r1,
  "events"."target_id" AS t0_r2,
  "events"."title" AS t0_r3,
  "events"."data" AS t0_r4,
  "events"."project_id" AS t0_r5,
  "events"."created_at" AS t0_r6,
  "events"."updated_at" AS t0_r7,
  "events"."action" AS t0_r8,
  "events"."author_id" AS t0_r9,
  "projects"."id" AS t1_r0,
  "projects"."name" AS t1_r1,
  "projects"."path" AS t1_r2,
  "projects"."description" AS t1_r3,
...
FROM "events"
JOIN "projects" ON "projects"."id" = "events"."project_id" AND "projects"."pending_delete" = 'f'
LEFT OUTER JOIN "routes" ON "routes"."source_id" = "projects"."id" AND "routes"."source_type" = 'Project'
LEFT OUTER JOIN "namespaces" ON "namespaces"."id" = "projects"."namespace_id" AND "namespaces"."deleted_at" IS NULL
LEFT OUTER JOIN "users" ON "users"."id" = "events"."author_id"
WHERE 
      "events"."author_id" = 423915
AND (projects.visibility_level = 10 or projects.visibility_level = 20 or exists (select 1 from project_authorizations where projects.id = project_authorizations.project_id AND project_authorizations.user_id = 758045))
ORDER BY "events"."id" DESC
LIMIT 20;

Of course, we should discuss and double-check that this is correct.

As this seems to be a reasonably popular query (run over 1000 times in the sample I was reviewing), and the current query results in a sequential scan across the projects table, I'm suspicious that this may be one of the queries contributing to the high number of sequential scans across the projects table which is seen in the monitoring.

Thanks!

@sfrost

Restoring backups into staging on a daily or weekly basis is a good approach for testing that a backup is valid.

If I recall this was something we wanted to do, but I'm not sure what the status of this is. Perhaps @northrup or @pcarranza can clarify.

Regarding the queries, for most I did create issues. For some of the ones we didn't discuss during the call I created the following issues:

SELECT "ci_builds".* FROM "ci_builds" INNER JOIN "projects" ON "projects"."id" = "ci_builds"."project_id" AND "projects"."pending_delete" = 'f' WHERE "ci_builds"."type" IN ('Ci::Build') AND "ci_builds"."status" = 'running' AND (ci_builds.updated_at < '2017-04-20 16:00:01.780025') ORDER BY "ci_builds"."id" ASC LIMIT 50 There is no index on "updated_at". The index on ci_builds (id) is used, but it has to filter out essentially the entire table, making the index scan not performant. Having an index on updated_at should greatly improve the performance of this query. If the values for "type" and "status" are always "Ci::Build" and "running", then a partial index over updated_at for those exact values would be even more efficient as those values appear to be a small portion of the table.

https://gitlab.com/gitlab-org/gitlab-ce/issues/31251

A partial index probably won't help since the status field can vary in values.

This query takes on the order of 10s: UPDATE "ci_builds" SET "user_id" = ? WHERE "ci_builds"."type" IN (?) AND "ci_builds"."user_id" = ?

https://gitlab.com/gitlab-org/gitlab-ce/issues/31252

This query takes on the order of 3.5s: SELECT "lfs_objects".* FROM "lfs_objects" LEFT JOIN lfs_objects_projects ON lfs_objects_projects.lfs_object_id = lfs_objects.id WHERE "lfs_objects_projects"."id" IS NULL

I'm not sure what this query is used for either, but I did get some data returned when I ran it just now:

   id    |                               oid                                |   size   |         created_at         |         updated_at         |                             file
---------+------------------------------------------------------------------+----------+----------------------------+----------------------------+--------------------------------------------------------------
 1531628 | 2731d89637abf8431b8456f134d00bb29656706e485dcd84be0b5fd4e506b27e |    80371 | 2017-03-15 07:17:06.807601 | 2017-03-15 07:17:06.807601 | d89637abf8431b8456f134d00bb29656706e485dcd84be0b5fd4e506b27e
 1531696 | d8b072d84fafa165efe7d1958c26f99a113f62d488f9c34b1bdf16b8683f45a9 |   143794 | 2017-03-15 07:17:24.645249 | 2017-03-15 07:17:24.645249 | 72d84fafa165efe7d1958c26f99a113f62d488f9c34b1bdf16b8683f45a9
 1531216 | 82a6eec2a140e2c60b6e77e071d64a020601baf4f2abe14bbd52066a7fe16cfc |   105039 | 2017-03-15 07:14:49.074607 | 2017-03-15 07:14:49.074607 | eec2a140e2c60b6e77e071d64a020601baf4f2abe14bbd52066a7fe16cfc
 1531952 | c14c2605562367efcda59ccbef6644bc56b27d7c946e53e503616d734fb02124 |    59949 | 2017-03-15 07:21:40.882456 | 2017-03-15 07:21:40.882456 | 2605562367efcda59ccbef6644bc56b27d7c946e53e503616d734fb02124
 1531533 | d74f7e17655ac029bfb59a1ad39f949937e15beb1fd9c2d9e348255b567c6692 |   103344 | 2017-03-15 07:16:35.529814 | 2017-03-15 07:16:35.529814 | 7e17655ac029bfb59a1ad39f949937e15beb1fd9c2d9e348255b567c6692
 1531362 | 1c6cfffe5873c79200a794189db8060ee79c0b8ac04871cc9cd04c8c06f656d2 |    86338 | 2017-03-15 07:15:42.008094 | 2017-03-15 07:15:42.008094 | fffe5873c79200a794189db8060ee79c0b8ac04871cc9cd04c8c06f656d2

I created an issue for looking into this query's performance here: https://gitlab.com/gitlab-org/gitlab-ce/issues/31253.

Over the period of time I was looking at, this had an average run time of 1.2s. It seems to be a rather... convoluted query, however. In particular, the IN clause and sub-select are curious as a list is built up of all projects which have a visibility level of 20 or 10, plus a list of all projects the user has access to via project_authorizations, as I understand it, but with this filter (and the filter on projects.pending_delete), the "left join" against projects is effectively an inner join (since any records in events not matching in projects will get filtered because 'projects.pending_delete would be NULL).

Projects with a visibility of 20 are public projects, with 10 being internal ones (= only visible to logged in users) if I'm not mistaken. This query in particular is probably used for some dashboard showing recent events of projects one has access to. This is probably from the https://gitlab.com/dashboard/activity page, or something similar.

The amount of JOINs certainly seems like a massive overkill, but they may be needed by the application to work around N+1 query problems.

The IN clause is a result of how our ORM works. When you pass a query/relation object to the ORM it will use an IN. For example:

some_table.where(some_column: some_other_query)

Would produce something along the lines of:

SELECT *
FROM some_table
WHERE some_column IN ( some_other_query )

The issue to look into this is: https://gitlab.com/gitlab-org/gitlab-ce/issues/31255

@yorickpeterse Regarding the queries generated with some_table.where(some_column: some_other_query) I suspect that many of those may turn into slow queries, as the system grows, and it might make sense to start trying to discourage that practice to get ahead of it.

For the backup related recommendations @sfrost wrote in https://gitlab.com/gitlab-com/infrastructure/issues/1448#note_27843348 , I created https://gitlab.com/gitlab-com/infrastructure/issues/1668 .

Notes from today's call:

Bi-weekly call, 13:00 - 14:00
- Discuss recursive query
- Status regarding postgresql.conf changes
- Analysis of idle connection increases
- Permissions
- Schema review discussion - timestamps w/o timezone, FKs, et al
- Load balancing, pgBouncer discussion

Issue for reviewing the DB schema: https://gitlab.com/gitlab-com/infrastructure/issues/1709

Update on 2017-05-16, ~ 78 hrs spent.

Requested during status call today, for @sfrost and @yorickpeterse : please add links to body of issue when updating their status. Link to comment (for discussions) or issues. Makes it easier to track, makes it easier to see which issues are closed.

changed the description

marked the checklist item Analyze existing configuration and architecture, provide recommendations as completed

marked the checklist item Determine cause of hot_standby causing table bloat: as completed

changed the description

marked the checklist item Implementation of pg_bouncer and other improvements will impact capacity planning and scalability as completed

changed the description

marked the checklist item Review corosync/pacemaker configuration and provide recommendations as completed

changed the description

marked the checklist item Assist with understanding of corosync/pacemaker, current cluster status, managing failover as completed

changed the description

marked the checklist item Review slow queries for possible optimization as completed

@sfrost

Review system statistics and identify any areas of concern

pending some monitoring changes listed above

Since we have made various changes to the configuration, which do we need to make to resolve this item?

Review replica stats and log files

Anything that needs to be done here? Should we enable some form of additional logging to get the right data?

Review autovacuum runs

requires log files with autovacuum information

Review autovacuum settings and provide recommendations

This is enabled on the primary. Shall I provide a log file containing this data so we can resolve this item?

Review shared buffer utilization through pg_buffercache (requires installation) to determine memory utilization and working set size

Provide recommendations regarding changes to shared_buffers and system memory amounts

With pg_buffercache enabled is there something that remains to be done here?

@yorickpeterse

Review system statistics and identify any areas of concern

This is something of a catch-all, but regarding the configuration changes made, here's where things stand:

The following changes were made, as I understand it, and reviewing the log files as a result would help to identify if there are any areas of concern from these pieces:

log_min_duration_statement- Lowered to 1000 from 2000. This is useful, but less interesting since we've been able to work off of the information provided through pg_stat_statements.

log_checkpoints- Seeing the information in the log files with regard to checkpoints would be very helpful to see how much time is spent in the various parts of checkpointing as well as seeing how large each checkpoint is. There have been cases where the number of checkpoints has gone up quite a bit, based on grafana data, and being able to see what happened in those cases would be very helpful.

log_temp_files- Set to zero, meaning that any queries which create temp files will be logged. This could prove interesting as it may indicate cases where work_mem is not set as high as it should be and it can also show queries which are being slow due to working with temp files instead of other reasons.

Changes still pending include:

log_lock_waits- One of the ongoing areas of concern is the number of blocked queries being seen in grafana. Once log_lock_waits is available and enabled (issue https://gitlab.com/gitlab-com/infrastructure/issues/1556) then we'll be able to see in the PostgreSQL log files any time a process is blocked for more than 1s.

track_io_timing- Another area of concern from the grafana data is the IO timing on the primary. We understand that to be an issue with how it was set up in the existing environment, but having track_io_timing (issue https://gitlab.com/gitlab-com/infrastructure/issues/1558) would give us insight into how that is impacting query performance through pg_stat_statements.

tracking 'idle in transaction'- Lastly, again from grafana, the number of "idle in transaction" backend processes at any given time is concerning. Tracking how long processes are idle in transaction for (issue https://gitlab.com/gitlab-com/infrastructure/issues/1588) and reporting on ones which are idle in transaction for a long time would help with investigating where those are coming from and hopefully they can be reduced. We saw one particular case where a long 'find' (as I recall) was being run across the filesystem while a transaction was held open in the database.

Review replica stats and log files

For this item, ideally, we would enable the same options listed above on the replicas as well to get more information into their log files for analysis too. However, the much more interesting thing to do would be to collect the pg_stat_statements information on the replicas and periodically reset those stats. We currently have the pg_stat_statements history from the primary which has proven very useful but there may be queries which are slow that are only run on the replicas and collecting this information would allow us to see those. This is a bit complicated since replicas can't be written to directly. One approach would be to simply connect to the replica, COPY the pg_stat_statements data to a file, reset the stats, and then connect to the primary to load that hour's data. Another approach would be to use a foreign data wrapper and have a process which runs on the primary periodically to query the data through the FDW and store it into a table on the primary, though something would still need to connect to the replica to reset the stats.

Review autovacuum runs

Yes, if you're able to provide a log file (a complete one would be fine and would allow me to review the other items mentioned above- no need to filter it, I can do that) that would be great.

Review shared buffer utilization

With pg_buffercache enabled we're able to see how the cache is being used, which is extremely helpful, but we've identified a number of cases where poor query plans are resulting in large portions of the cache being used for certain tables. Once those queries have been changed to minimize the amount of data they look at, this will prove much more interesting as we would be able to determine how much of the cache is active and if it might be possible to reduce the size of shared buffers. Right now, a great deal of the cache is used, very frequently, due to the poor query plans (though, to be fair, table design plays a part here as well). Being able to periodically extract out the shared buffer cache information in a manner similar to pg_stat_statements, and to collect that information on the replicas as well, would also be informative, though the 9.6 version of pg_buffercache has the downside that it may cause some small delay on ongoing operations, so it might be best to collect this less often and perhaps during slower period of time.

@sfrost

Yes, if you're able to provide a log file (a complete one would be fine and would allow me to review the other items mentioned above- no need to filter it, I can do that) that would be great.

This is the current log file of the primary: removed

Running the log file through pgbadger and then pulling out the VACUUMs gives us the following table:

Table	Number of vacuums	Index scans
gitlabhq_production.public.project_mirror_data	59	4
gitlabhq_production.public.ci_runners	51	31
gitlabhq_production.public.ci_builds	25	22
gitlabhq_production.pg_catalog.pg_statistic	18	13
gitlabhq_production.public.merge_request_diffs	14	14
gitlabhq_production.public.project_statistics	13	11
gitlabhq_production.pg_toast.pg_toast_2619	9	9
gitlabhq_production.public.merge_requests	7	7
gitlabhq_production.public.ci_pipelines	6	6
gitlabhq_production.public.namespace_statistics	5	5
gitlabhq_production.public.projects	5	5
gitlabhq_production.public.remote_mirrors	3	3
gitlabhq_production.public.todos	2	2
gitlabhq_production.public.events	2	2
gitlabhq_production.public.issues	2	2
gitlabhq_production.public.notes	2	2
gitlabhq_production.pg_toast.pg_toast_33776	1	1
gitlabhq_production.public.label_links	1	1
gitlabhq_production.public.project_authorizations	1	1
gitlabhq_production.public.merge_requests_closing_issues	1	1
gitlabhq_production.public.ci_pipeline_schedules	1	1
gitlabhq_production.public.users	1	1
gitlabhq_production.public.lfs_objects	1	1
Total	230	145

The log file starts at 9:55 and ends at 11:17, or 82 minutes.

What this shows is that a VACUUM is executed for almost every minute against the project_mirror_data table. Reviewing the information from pg_stat_statements for this table, we see that it's updated ~10x every second. This table is reasonably small and narrow, however, and therefore the VACUUMs are quick- taking less than a second to run. The entire table is also in shared buffers all the time, which helps make the vacuum fast too.

The same is generally true for the ci_runners table- relatively small table, entirely in shared buffers, vacuum finishes in 5-10 seconds.

The next is ci_builds, which is a much larger table. The frequent vacuums are able to complete in only about 80 seconds though, likely mostly due to much of the table being in shared buffers and the rest being in the filesystem cache. The autovacuum costs on this table were adjusted to allow higher rates, meaning this table is processed at ~100MB/s.

merge_request_diffs is a very large table but almost all of the space from it is in the TOAST table, the main table being only ~5GB. Further, most of it isn't changed, and so a VACUUM only has to touch a fraction of the table (around 600MB), thus the vacuum only takes ~15s.

project_statistics is another relatively small table, a VACUUM of which finishes in ~10s.

merge_requests is where things start to get particularly interesting- it's large and enough of it is being changed that a lot is VACUUM'd- about 2G with each run. Combined with the low autovacuum cost settings, the table ends up taking around 10 minutes to be VACUUM'd. Looking at how it was VACUUM'd 7 times over 82 minutes, that means that an autovacuum worker (out of the 6 configured) was essentially always running against this table.

ci_pipelines is also pretty large and changing, but VACUUM is able to process through it in about 3 minutes on average.

namespace_statistics is small and VACUUM'd quickly

projects is another large table which is changing a lot. About 1G of the table is being looked at with each VACUUM run and the runs take 10-15 minutes each. Given that there were 5 runs over 82 minutes, we can conclude that this table also ties up an autovacuum worker essentially full-time.

remote_mirrors is small and VACUUM'd quickly

todos is reasonably small and only takes a few minutes to vacuum each time through.

The next big table is events, where a VACUUM appears to go over ~10G of the table each time through, requiring close to half an hour for each run. There are only 2 runs included in the report, but it seems likely to conclude that this table as chews up an autovacuum worker near to full-time.

issues is another large table, requiring ~40 minutes to VACUUM, and likely chews up an entire autovacuum worker. Only about 1.7G of this table has to be checked over, but that appears to result in a very large number of cache misses and quite a few pages being dirtied.

notes is also very large, requires 30-35 minutes to VACUUM and requires another autovacuum worker.

The rest of the tables look to be reasonably small or not updated enough to require a lot of VACUUM work to occur on them.

I'll follow-up to this comment with another explaining some concerns here.

There are two main points to make from the above insight into the VACUUM'ing details provided here:

There's a number of wide tables being UPDATE'd quite a bit with only a fraction of the row changing. The "projects" table is a great example of this- it's updated something like 10 times a second for a certain subset of columns, perhaps 10 at most (and they are not updated at the same time, but independently!), while the table has 56 columns. An UPDATE will cause the entire row to be updated and therefore copied, resulting in a large amount of churn to the table with each of the individual UPDATE statements being run and dead tuples resulting which have to be VACUUM'd. As mentioned elsewhere, but perhaps illustrated best here, moving those frequently updated columns to narrower tables would greatly reduce the amount of work required for executing the UPDATE and for performing the cleanup VACUUM afterwards.
There are only 6 autovacuum workers enabled on this cluster and nearly all of them appear to be getting used all the time. While bloat appears to be managed reasonably well and the global database frozen xid isn't so far behind to be worried about currently, it strikes me that we aren't too far from the point where the autovacuum workers won't be able to keep up with the current configuration. That configuration could be changed in a couple different ways to address this:
Autovacuums could be allowed to proceed faster by adjusting the costing to let them run faster. Today, this is unlikely to induce much additional load on the overall system since the entire database, basically, is in memory somewhere, but that may not be able to be the case down the road.
Additional autovacuum workers could be configured, allowing more to be run concurrently to hopefully keep up with the change rates. Again, given that the database fits in cache today, this isn't likely to cause issues for the overall system, particularly given that the overall CPU load is much lower thanks to the other optimizations which have been done.
In an environment where random access is more expensive than sequential access, which is typically the case in databases that are not fully cached in memory, having fewer workers which are doing primarily sequential access (which is typically what a VACUUM is doing) would be preferable.
Reduce the overall write load by moving the heavily updated columns to different tables and updating all the columns at once, as much as possible, while also eliminating any unnecessary or useless work being done, thus reducing the amount of work VACUUM has to do.

Lastly, an issue which came out of this analysis is that it's possible the projects table and the project_mirror_data table are being updated extensively in cases where no useful progress is being made. Specifically, projects which have not had a successful mirror in over 6 months appear to still being tried over and over again, even for cases which look like they really should be a "hard" failure case.

The system, overall, appears like it could do with some kind of cut-off where, if mirroring has failed for a certain amount of time (perhaps a week?) then a notice is sent out saying that it's hard-failed and will not be retried until some user action is taken. Perhaps certain failure cases could be considered to be hard-failed immediately (such as an authorization failed or the remote repository is not found or that the default branch has diverged from its upstream counterpart.

Eliminating all of these jobs which look like they may be fruitless may be able to reduce the overall system load significantly, though I've not done a formal analysis of the potential impact.

@tiagonbotelho @DouweM @pcarranza Very useful information here regarding the project_mirror_data table. We should improve this.

Thanks @sfrost and @stanhu I'll open an issue to further discuss this topic.

@yorickpeterse I believe these can be checked off now, unless you have any specific questions regarding them:

Review autovacuum runs
requires log files with autovacuum information
Review autovacuum settings and provide recommendations

marked the checklist item Review autovacuum runs as completed

changed the description

marked the checklist item Review autovacuum settings and provide recommendations as completed

marked the checklist item Review shared buffer utilization through pg_buffercache (requires installation) to determine memory utilization and working set size as completed

Updates on remaining open items:

Review system stats and identify areas of concern
- Ideally, we would be able to review Grafana graphs and log files after these issues have been implemented/enabled:
- https://gitlab.com/gitlab-com/infrastructure/issues/1554
- https://gitlab.com/gitlab-com/infrastructure/issues/1556
- https://gitlab.com/gitlab-com/infrastructure/issues/1558
Review replica stats and log files
- Would be great to see Grafana results and get the logs from replicas after enabling similar/same options as discussed above.
- Also https://gitlab.com/gitlab-com/infrastructure/issues/2249 in relevant to replicas and would greatly improve the visibility into all queries being run against PG in the environment.
Review possibility to enable full logging, to allow collection of all queries to enable replaying on staging
- I wasn't able to find an issue for this yet but it's also notoriously difficult to predict what will happen when full logging is enabled on an environment.
- One approach to consider here would be to enable statement logging at the beginning of a deployment for a few minutes (prior to actually doing anything) as users may be more accepting if it turns out there are issues due to it.
- Note that enabling statement-level logging only requires a "reload" of PostgreSQL, it does not need a restart and it can be switched off quickly also.
Provide recommendations regarding changes to shared_buffers and system memory amounts
- A good bit of discussion was had here: https://gitlab.com/gitlab-com/infrastructure/issues/1555
- The fallout from this is that the events table really needs to be addressed and, once done, further analysis should be done of what's in shared_buffers. The effort to change the events table is being tracked in https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/12463 I believe
Recommendations regarding metrics for scalability and capacity planning
- There is https://gitlab.com/gitlab-com/infrastructure/issues/1579 which discusses a lot of the metrics which are interesting
- The shared_buffers discussion in https://gitlab.com/gitlab-com/infrastructure/issues/1555 is quite relevant to scalability questions as well.
- Unfortunately, the grafana graphs no longer appear to go back more than a month, so it's not very easy to gauge what the long-term growth rate in transactions/CPU-load/IO-load are, but during the time which I've been involved, my general feeling has been that we've reduced all of the above through the query optimization efforts and will continue to do so with the events rework. There looks to be a very long path forward to be had by continuing to improve queries and optimize data structures which strikes me as likely to out-pace load increases based on recent history. Of course, it's always possible there will be a "windfall" event which will drive up the activity significantly, though the current systems appear reasonably well positioned to be able to handle that and increased read load can be handled by spinning up additional read replicas. Addressing a serious increase in write load could be a concern, but we'll know much better how much of one when the changes to db1's I/O system are made (to match db3/db4) and the events table has been cleaned up to no longer include very large writes of data that's mostly/completely unused.

@sfrost Thanks for the input.

Regarding:

Addressing a serious increase in write load could be a concern, but we'll know much better how much of one when the changes to db1's I/O system are made (to match db3/db4) and the events table has been cleaned up to no longer include very large writes of data that's mostly/completely unused.

Imagine the write load is 10x what it is today. What would you foresee / propose?
Now repeat the exercise at 100x :-)

@ernstvn By and large, I think the existing systems could handle a 10x increase in write load, but there wouldn't be as much room for things like the current events migration, possibly leading to having to throttle such changes down. That said, it looks like the max_wal_size still needs to be adjusted farther up to avoid forced checkpoints, and the checkpoint_timeout could be raised from the current default of 5m to reduce the number of FPI writes which are happening in the WAL stream.

To address that concern, I continue to recommend improving the data design to reduce the amount of writing which is currently happening at the disk level. Quite a few very wide tables (users, projects) are being constantly updated to just change a subset of fields, requiring a great deal more writing than is necessary. Moving the frequently updated field to a narrower table would be a great step towards reducing the write load and increasing the scalability of the system when it comes to handling more write traffic.

At 100x, I doubt the current system and data design would be able to handle the load. The minor tweaks discussed above regarding max_wal_size and checkpoint_timeout would also help but I doubt they would be enough to address a 100x increase in write throughput. Improving the data design and system efficiency through reworking the tables would help a great deal and could be sufficient to handle a 100x increase but the question is if the amount of WAL being generated would be more than the system can handle. Moving off of a virtualized environment and on to physical hardware with SSDs would certainly help a great deal. While db3 appears to be better about managing the write load, the grafana graphs still have concerning spikes of latency that do not appear to correlate to an increased number of checkpoints or other indicators of actual increased write load.

If a 10x or 100x increase in write load is likely in the near future, I would strongly recommend expending more effort to migrate frequently updated columns out of wide tables and into their own tables where most of the columns are updated in one statement together whenever updates are done. The tweaks to max_wal_size and checkpoint_timeout should also be considered, but increasing checkpoint_timeout is likely to lead to longer recovery times in the event of a system crash or unclean system reboot, so that needs to be considered.

Thanks @sfrost

marked the checklist item Review system statistics and identify any areas of concern as completed

marked the checklist item Review replica stats and log files as completed

removed keep confidential label

There is some remaining work such as enabling certain settings (mostly for monitoring purposes), but there are separate issues for those. As a result I'll close this one and we'll open it up later today once we have confirmed all sensitive data has been removed.

closed

reopened

closed

made the issue visible to everyone

changed the description

Admin message

Admin message

crunchy-postgres-health-check

Deliverables

Health Check

Backup and Restore

PGBouncer

Read replicas

HA / Failover

Monitoring

Tuning / Settings

Application binding to replicas

Application improvements

Staging with realistic load by running tests

Capacity planning

Activity

Schema, database, tablespace info

Extract out the schema definition

Pull information about all databases, as those are not included by pg_dump

Pull information about all tablespaces, as those are not included by pg_dump

Current configuration information

Get the current settings

Pull information about how PostgreSQL was built and installed

Pull information about what is currently configured in postgresql.conf

Statistical information

Pull table statistics

Pull block-level table statistics

Pull index statistics

Pull block-level index statistics

Pull background writer statistics

Pull archiver statistics

Pull in-progress VACUUM information

Pull status of replicas

Pull the size of all tables, indexes, materialized views, and toast tables

Query data

Pull query information from pg_stat_statements (will fail if pg_stat_statements has not been installed)

Current activity information- ongoing queries and transaction lengths

Current locks being held

Detailed process information across 5 minutes

Detailed process summary information, ctrl-c after 5 minutes

Detailed i/o information, ctrl-c after 5 minutes

Detailed memory information

CSV files

System Info

Collect pg_class oids