Skip to main content

CDC on RDS

PolyScale can employ Change Data Capture (CDC) to perform cache invalidation. General documentation to setup CDC with PolyScale exists here. This section addresses configuration and setup issues specifically for AWS RDS databases.

Configuring the PolyScale CDC-engine

The PolyScale CDC engine is a stand alone application that runs alongside the database. The CDC engine is a thin wrapper around Debezium a highly regarded open source project for doing CDC. The CDC engine will work with Postgres, Mysql, and MS Sql Server databases.

In order to interact with an AWS RDS database instance the CDC engine should be running in the same region, and where possible, in the same availability zone. It can be run as a task from AWS Elastic Container Service (ECS). The ECS cluster should be configured to use EC2 for its Infrastructure. The EC2 instance should run a plain version of Linux (eg. Amazon Linux 2). An instance size of medium or large should be sufficient. Capacity should be 1. An example configuration of the ECS cluster would be:

During the cluster creation process, the network settings are also specified. These should be set so as to allow the cluster to access the RDS instance, as well as to access the public internet.

Once the cluster is created, associate with it a new task. The task infrastructure requirements should be set to EC2 Instances, operating system Linux, and bridge network. All of the cpu and memory of the cluster can be assigned to the task since only this one task will be running. An example of the infrastructure configuration for the task is:

The container associated with the task should point to the public github container registry address:

ghcr.io/polyscale/cdc-engine:latest

and mark the container as essential. The container configuration should look like:

CDC-Engine environment variables

Two environment variables need to be specified for the CDC-engine task. The first is CDC_CONNECTION_URI which tells the CDC engine how to connect to the database. The second is CDC_EVENT_QUEUE_URI which tells the CDC engine how to connect to the PolyScale.

CDC_CONNECTION_URI

The format will depend on the specific database type, but generally looks like standard connection string.

For a postgres database where the username is bob, password is 123, database url is xyz.amazonaws.com, the port is the default 5432, your PolyScale cache id is 444-555-666, and the database name is dbname, then the value of the URI would be:

postgres://bob:123@xyz.amazonaws.con:5432/444-555-666?database=dbname

For a Mysql or a MariaDb database the format is similar except that no database is specified (CDC events for all databases are captured). For example (using the same specifics as above):

mysql://bob:123@xyz.amazonaws.com:3306/444-555-666

For a MS Sql Server database the database name must be specified. An example would be:

sqlserver://bob:123@xyz.amazonaws.com:1433/444-555-666?database=db

CDC_EVENT_QUEUE_URI

The second environment variable that needs to be specified is the CDC_EVENT_QUEUE_URI. This tells the CDC engine how to connect to the PolyScale DDN. The format is:

amqps://<cache-id>:<cdc-key>@invalidations.polyscale.global:5671

The cache id and the cdc key can be obtained from PolyScale.

Configuring the database

Before starting the task, the database needs to be configured to support CDC. This configuration is database specific. These changes will require database administration privileges.

Postgres

Configuring Postgres involves setting the wal_level to logical. In RDS this is achieved by going to the database configuration, finding its parameter groups and setting the value of rds.logical_replication to the value 1, and restarting the database.

Each table that is to have CDC coverage needs individual configuration. From a psql client, for each table run:

ALTER TABLE <table-name> replica identity full;

The database user specified in the CDC_CONNECTION_URI environment variable discussed above does not need admin privileges. This user only requires LOGIN and REPLICATION permissions. These can be granted to an existing user or a new user can be created for the purpose:

CREATE ROLE bob LOGIN REPLICATION PASSWORD 'pwd123';

MySql

Configuring Mysql also involves some changes to the database configuration. First, make sure backups are enabled with some non-zero retention period. Then, in the parameter groups set binlog_format = ROW and set binlog_row_image = FULL, and restart the database.

The database user specified in the CDC_CONNECTION_URI environment variable discussed above needs a specific set of privileges. These can be granted to an existing user or a new user can be created for the purpose:

CREATE USER 'bob'@'%' IDENTIFIED BY '123';
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'bob'@'%';
FLUSH PRIVILEGES;

MariaDb

Configuration for MariaDB is the same as that for Mysql. However, two additional settings must be specified in the parameter groups. In addition to binlog_format = ROW and binlog_row_image = FULL set log_bin = mysql-bin and expire_logs_days = 1 (or greater).

MS SqlServer

Configuring MS Sql Server is best done from the command line. This requires logging in as an admin. First enable CDC for the database of interest:

USE db-of-interest
GO
EXEC sys.sp_cdc_enable_db
GO

Then each table for which CDC data is required must be enabled:

EXEC sys.sp_cdc_enable_table @source_schema = 'dbo', @source_name = 'table-name-here', @role_name = NULL, @supports_net_changes = 0
GO

The default settings for CDC in MS Sql Server are not ideal for cache invalidation. Set these additional settings to improve them:

EXEC sys.sp_cdc_change_job @pollinginterval = 1, @maxscans = 500, @maxtrans = 100
GO
EXEC sys.sp_cdc_stop_job
GO
EXEC sys.sp_cdc_start_job
GO

The database user specified in the CDC engine configuration must be able to read the tables covered by CDC, as well as the tables Sql Server uses to store CDC information. Login as the desired user and run:

USE db-of-interest
GO
EXEC sys.sp_cdc_help_change_data_capture
GO

This procedure returns configuration information for each table in the database that is enabled for CDC and CDC data that the caller is authorized to access. If the result is empty, verify that the user has privileges to access both the capture instance and the CDC tables.

Once that is done, and the ECS task is started, observe its log output. Any error should provide descriptions of what has gone wrong. Two things can go wrong. The CDC engine fails to connect to the database, or the CDC engine fails to connect to the PolyScale DDN. The logging output is designed to help diagnose any problems.