Testing and tuning PGD clusters v5.6

You can test PGD applications using the following approaches:

Trusted Postgres Architect

Trusted Postgres Architect is the system used by EDB to deploy reference architectures, including those based on EDB Postgres Distributed.

Trusted Postgres Architect includes test suites for each reference architecture. It also simplifies creating and managing a local collection of tests to run against a TPA cluster, using a syntax like the following:

tpaexec test mycluster mytest

We strongly recommend that developers write their own multi-node suite of Trusted Postgres Architect tests that verify the main expected properties of the application.

pgd_bench

The Postgres benchmarking application pgbench was extended in PGD 5.0 in the form of a new application: pgd_bench.

pgd_bench is a regular command-line utility that's added to the PostgreSQL bin directory. The utility is based on the PostgreSQL pgbench tool but supports benchmarking CAMO transactions and PGD-specific workloads.

Functionality of pgd_bench is a superset of pgbench functionality but requires the BDR extension to be installed to work properly.

Key differences include:

  • Adjustments to the initialization (-i flag) with the standard pgbench scenario to prevent global lock timeouts in certain cases.
  • VACUUM command in the standard scenario is executed on all nodes.
  • pgd_bench releases are tied to the releases of the BDR extension and are built against the corresponding Postgres distribution. This information is reflected in the output of the --version flag.

The current version allows you to run failover tests while using CAMO or regular PGD deployments.

The following options were added:

-m, --mode=regular|camo|failover
mode in which pgbench should run (default: regular)
  • Use -m camo or -m failover to specify the mode for pgd_bench. You can use the -m failover specification to test failover in regular PGD deployments.
--retry
retry transactions on failover
  • Use --retry to specify whether to retry transactions when failover happens with -m failover mode. This option is enabled by default for -m camo mode.

In addition to these options, you must specify the connection information about the peer node for failover in DSN form.

Here's an example in a CAMO environment:

    pgd_bench -m camo -p $node1_port -h $node1_host bdrdemo \
        "host=$node2_host user=postgres port=$node2_port dbname=bdrdemo"

This command runs in CAMO mode. It connects to node1 and runs the tests. If the connection to node1 is lost, then pgd_bench connects to node2. It queries node2 to get the status of in-flight transactions. Aborted and in-flight transactions are retried in CAMO mode.

In failover mode, if you specify --retry, then in-flight transactions are retried. In this scenario, there's no way to find the status of in-flight transactions.

Notes on pgd_bench usage

  • When using custom init-scripts, it's important to understand implications behind the DDL commands. We generally recommend waiting for the secondary nodes to catch up on the data-load steps before proceeding with DDL operations such as CREATE INDEX. The latter acquire global locks that can't be acquired until the data load is complete and thus might time out.

  • No extra steps are taken to suppress client messages, such as NOTICE and WARNING messages emitted by PostgreSQL and or any possible extensions, including the BDR extension. It's your responsibility to suppress them by setting appropriate variables, such as client_min_messages, bdr.camo_enable_client_warnings, and so on.

Performance testing and tuning

PGD allows you to issue write transactions onto multiple nodes. Bringing those writes back together onto each node has a performance cost.

First, replaying changes from another node has a CPU cost and an I/O cost, and it generates WAL records. The resource use is usually less than in the original transaction since CPU overhead is lower as a result of not needing to reexecute SQL. In the case of UPDATE and DELETE transactions, there might be I/O costs on replay if data isn't cached.

Second, replaying changes holds table-level and row-level locks that can produce contention against local workloads. The conflict-free replicated data types (CRDT) and column-level conflict detection (CLCD) features ensure you get the correct answers even for concurrent updates, but they don't remove the normal locking overheads. If you get locking contention, try to avoid conflicting updates, or keep transactions as short as possible. A heavily updated row in a larger transaction causes a bottleneck on performance for that transaction. Complex applications require some thought to maintain scalability.

If you think you're having performance problems, develop performance tests using the benchmarking tools. pgd_bench allows you to write custom test scripts specific to your use case so you can understand the overhead of your SQL and measure the impact of concurrent execution.

If PGD is running slow, then we suggest the following:

  1. Write a custom test script for pgd_bench, as close as you can make it to the production system's problem case.
  2. Run the script on one node to give you a baseline figure.
  3. Run the script on as many nodes as occur in production, using the same number of sessions in total as you did on one node. This technique shows you the effect of moving to multiple nodes.
  4. Increase the number of sessions for these two tests so you can plot the effect of increased contention on your application.
  5. Make sure your tests are long enough to account for replication delays.
  6. Ensure that replication delay isn't growing during your tests.

Use all of the normal Postgres tuning features to improve the speed of critical parts of your application.