CDC on RDS
PolyScale can employ Change Data Capture (CDC) to perform cache invalidation. General documentation to setup CDC with PolyScale exists here. This section addresses configuration and setup issues specifically for AWS RDS databases.
Configuring the PolyScale CDC-engine
The PolyScale CDC engine is a stand alone application that runs alongside the database. The CDC engine is a thin wrapper around Debezium a highly regarded open source project for doing CDC. The CDC engine will work with Postgres, Mysql, and MS Sql Server databases.
In order to interact with an AWS RDS database instance the CDC engine should be running in the same region, and where possible, in the same availability zone. It can be run as a task from AWS Elastic Container Service (ECS). The ECS cluster should be configured to use EC2 for its Infrastructure. The EC2 instance should run a plain version of Linux (eg. Amazon Linux 2). An instance size of medium or large should be sufficient. Capacity should be 1. An example configuration of the ECS cluster would be:

During the cluster creation process, the network settings are also specified. These should be set so as to allow the cluster to access the RDS instance, as well as to access the public internet.
Once the cluster is created, associate with it a new task. The task infrastructure requirements should be set to EC2 Instances, operating system Linux, and bridge network. All of the cpu and memory of the cluster can be assigned to the task since only this one task will be running. An example of the infrastructure configuration for the task is:

The container associated with the task should point to the public github container registry address:
ghcr.io/polyscale/cdc-engine:latest
and mark the container as essential. The container configuration should look like:

CDC-Engine environment variables
Two environment variables need to be specified for the CDC-engine task. The first is CDC_CONNECTION_URI which tells the CDC engine how to connect to the database. The second is CDC_EVENT_QUEUE_URI which tells the CDC engine how to connect to the PolyScale.
CDC_CONNECTION_URI
The format will depend on the specific database type, but generally looks like standard connection string.
For a postgres database where the username is bob, password is 123, database url is xyz.amazonaws.com, the port is the default 5432, your PolyScale cache id is 444-555-666, and the database name is dbname, then the value of the URI would be:
postgres://bob:123@xyz.amazonaws.con:5432/444-555-666?database=dbname
For a Mysql or a MariaDb database the format is similar except that no database is specified (CDC events for all databases are captured). For example (using the same specifics as above):
mysql://bob:123@xyz.amazonaws.com:3306/444-555-666
For a MS Sql Server database the database name must be specified. An example would be:
sqlserver://bob:123@xyz.amazonaws.com:1433/444-555-666?database=db
CDC_EVENT_QUEUE_URI
The second environment variable that needs to be specified is the CDC_EVENT_QUEUE_URI. This tells the CDC engine how to connect to the PolyScale DDN. The format is:
amqps://<cache-id>:<cdc-key>@invalidations.polyscale.global:5671
The cache id and the cdc key can be obtained from PolyScale.
Configuring the database
Before starting the task, the database needs to be configured to support CDC. This configuration is database specific. These changes will require database administration privileges.
Postgres
Configuring Postgres involves setting the wal_level
to logical
. In RDS this is achieved by going to the database configuration, finding its parameter groups and setting the value of rds.logical_replication
to the value 1
, and restarting the database.
Each table that is to have CDC coverage needs individual configuration. From a psql
client, for each table run:
ALTER TABLE <table-name> replica identity full;
The database user specified in the CDC_CONNECTION_URI environment variable discussed above does not need admin privileges. This user only requires LOGIN and REPLICATION permissions. These can be granted to an existing user or a new user can be created for the purpose:
CREATE ROLE bob LOGIN REPLICATION PASSWORD 'pwd123';
MySql
Configuring Mysql also involves some changes to the database configuration. First, make sure backups are enabled with some non-zero retention period. Then, in the parameter groups set binlog_format = ROW
and set binlog_row_image = FULL
, and restart the database.
The database user specified in the CDC_CONNECTION_URI environment variable discussed above needs a specific set of privileges. These can be granted to an existing user or a new user can be created for the purpose:
CREATE USER 'bob'@'%' IDENTIFIED BY '123';
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'bob'@'%';
FLUSH PRIVILEGES;
MariaDb
Configuration for MariaDB is the same as that for Mysql. However, two additional settings must be specified in the parameter groups. In addition to binlog_format = ROW
and binlog_row_image = FULL
set log_bin = mysql-bin
and expire_logs_days = 1
(or greater).
MS SqlServer
Configuring MS Sql Server is best done from the command line. This requires logging in as an admin. First enable CDC for the database of interest:
USE db-of-interest
GO
EXEC sys.sp_cdc_enable_db
GO
Then each table for which CDC data is required must be enabled:
EXEC sys.sp_cdc_enable_table @source_schema = 'dbo', @source_name = 'table-name-here', @role_name = NULL, @supports_net_changes = 0
GO
The default settings for CDC in MS Sql Server are not ideal for cache invalidation. Set these additional settings to improve them:
EXEC sys.sp_cdc_change_job @pollinginterval = 1, @maxscans = 500, @maxtrans = 100
GO
EXEC sys.sp_cdc_stop_job
GO
EXEC sys.sp_cdc_start_job
GO
The database user specified in the CDC engine configuration must be able to read the tables covered by CDC, as well as the tables Sql Server uses to store CDC information. Login as the desired user and run:
USE db-of-interest
GO
EXEC sys.sp_cdc_help_change_data_capture
GO
This procedure returns configuration information for each table in the database that is enabled for CDC and CDC data that the caller is authorized to access. If the result is empty, verify that the user has privileges to access both the capture instance and the CDC tables.
Once that is done, and the ECS task is started, observe its log output. Any error should provide descriptions of what has gone wrong. Two things can go wrong. The CDC engine fails to connect to the database, or the CDC engine fails to connect to the PolyScale DDN. The logging output is designed to help diagnose any problems.