PolyScale can employ Change Data Capture (CDC) to perform cache invalidation. General documentation to setup CDC with PolyScale exists here. This section addresses configuration and setup issues specifically for AWS RDS databases.
Configuring the PolyScale CDC-engine
The PolyScale CDC engine is a stand alone application that runs alongside the database. The CDC engine is a thin wrapper around Debezium a highly regarded open source project for doing CDC. The CDC engine will work with Postgres, Mysql, and MS Sql Server databases.
In order to interact with an AWS RDS database instance the CDC engine should be running in the same region, and where possible, in the same availability zone. It can be run as a task from AWS Elastic Container Service (ECS). The ECS cluster should be configured to use EC2 for its Infrastructure. The EC2 instance should run a plain version of Linux (eg. Amazon Linux 2). An instance size of medium or large should be sufficient. Capacity should be 1. An example configuration of the ECS cluster would be:
During the cluster creation process, the network settings are also specified. These should be set so as to allow the cluster to access the RDS instance, as well as to access the public internet.
Once the cluster is created, associate with it a new task. The task infrastructure requirements should be set to EC2 Instances, operating system Linux, and bridge network. All of the cpu and memory of the cluster can be assigned to the task since only this one task will be running. An example of the infrastructure configuration for the task is:
The container associated with the task should point to the public github container registry address:
and mark the container as essential. The container configuration should look like:
CDC-Engine environment variables
Two environment variables need to be specified for the CDC-engine task. The first is CDC_CONNECTION_URI which tells the CDC engine how to connect to the database. The second is CDC_EVENT_QUEUE_URI which tells the CDC engine how to connect to the PolyScale.
The format will depend on the specific database type, but generally looks like standard connection string.
For a postgres database where the username is bob, password is 123, database url is xyz.amazonaws.com, the port is the default 5432, your PolyScale cache id is 444-555-666, and the database name is dbname, then the value of the URI would be:
For a Mysql or a MariaDb database the format is similar except that no database is specified (CDC events for all databases are captured). For example (using the same specifics as above):
For a MS Sql Server database the database name must be specified. An example would be:
The second environment variable that needs to be specified is the CDC_EVENT_QUEUE_URI. This tells the CDC engine how to connect to the PolyScale DDN. The format is:
The cache id and the cdc key can be obtained from PolyScale.
Configuring the database
Before starting the task, the database needs to be configured to support CDC. This configuration is database specific. These changes will require database administration privileges.
Configuring Postgres involves setting the
logical. In RDS this is achieved by going to the database configuration, finding its parameter groups and setting the value of
rds.logical_replication to the value
1, and restarting the database.
Each table that is to have CDC coverage needs individual configuration. From a
psql client, for each table run:
ALTER TABLE <table-name> replica identity full;
The database user specified in the CDC_CONNECTION_URI environment variable discussed above does not need admin privileges. This user only requires LOGIN and REPLICATION permissions. These can be granted to an existing user or a new user can be created for the purpose:
CREATE ROLE bob LOGIN REPLICATION PASSWORD 'pwd123';
Configuring Mysql also involves some changes to the database configuration. First, make sure backups are enabled with some non-zero retention period. Then, in the parameter groups set
binlog_format = ROW and set
binlog_row_image = FULL, and restart the database.
The database user specified in the CDC_CONNECTION_URI environment variable discussed above needs a specific set of privileges. These can be granted to an existing user or a new user can be created for the purpose:
CREATE USER 'bob'@'%' IDENTIFIED BY '123';
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'bob'@'%';
Configuration for MariaDB is the same as that for Mysql. However, two additional settings must be specified in the parameter groups. In addition to
binlog_format = ROW and
binlog_row_image = FULL set
log_bin = mysql-bin and
expire_logs_days = 1 (or greater).
Configuring MS Sql Server is best done from the command line. This requires logging in as an admin. First enable CDC for the database of interest:
Then each table for which CDC data is required must be enabled:
EXEC sys.sp_cdc_enable_table @source_schema = 'dbo', @source_name = 'table-name-here', @role_name = NULL, @supports_net_changes = 0
The default settings for CDC in MS Sql Server are not ideal for cache invalidation. Set these additional settings to improve them:
EXEC sys.sp_cdc_change_job @pollinginterval = 1, @maxscans = 500, @maxtrans = 100
The database user specified in the CDC engine configuration must be able to read the tables covered by CDC, as well as the tables Sql Server uses to store CDC information. Login as the desired user and run:
This procedure returns configuration information for each table in the database that is enabled for CDC and CDC data that the caller is authorized to access. If the result is empty, verify that the user has privileges to access both the capture instance and the CDC tables.
Once that is done, and the ECS task is started, observe its log output. Any error should provide descriptions of what has gone wrong. Two things can go wrong. The CDC engine fails to connect to the database, or the CDC engine fails to connect to the PolyScale DDN. The logging output is designed to help diagnose any problems.