This feature is being actively developed as a technology preview under
beta and is subject to change.
Change Data Capture (CDC) is a built-in feature of modern database systems that tracks the changes that occur in the database and streams a representation of those changes to consumer processes. CDC forms the basis for building read replicas and provides exact and targeted information for doing cache invalidation.
CDC Based Invalidation
CDC streams are supported as a primary source for global invalidations, that is, to notify the cache of highly targeted cache invalidations, as that data has changed.
This method of invalidation is ideal for use cases where updates to the database are out of band to PolyScale i.e. Smart Invalidation cannot be used as the data updates do not pass through PolyScale.
The PolyScale CDC-engine is based on the well regarded open source software Debezium. It is packaged as a docker container which is intended to run as close as possible to your database. The currently supported databases are: MySql, MariaDb, Postgres, and Microsoft Sql Server. The documentation on the Debezium site is excellent and a good understanding of the low level technical details of CDC can be had looking there.
The CDC-engine is packaged as a container and as such can be run in most environments. It works by reading from database logs and so should be placed in close proximity to the database. Either on the database server itself, or at least within the same network.
For the specific case of an AWS RDS database, additional documentation can be found here.
The CDC-engine requires two configuration parameters. The first of these is a connection URI to the database. This string specifies the database type, the database username and password, the database host and port, and the PolyScale cache id to associate the invalidation information with. An example would be:
The second configuration parameter is the connection URI to PolyScale. This needs to be provided by PolyScale, but will have a form similar to this example:
Configuring the CDC-engine is simple, however, the database must also be configured to support CDC logging. The method for configuring the database is, not surprisingly, database dependent.
MySql CDC configuration
In order to support CDC operations a MySql database has to have a
binlog_format set equal to
ROW, and also
binlog_row_image set equal to
FULL. These settings can be made by an admin user using the following
SET GLOBAL binlog_format='ROW';
SET GLOBAL binlog_row_image='FULL';
A server restart will be required for these settings to take effect. If your database is an AWS RDS instance, it is necessary to have backups enabled with a non-zero retention period in order to set these via the AWS interface.
The database user specified in the CDC-engine configuration requires specific permissions. These can be granted to an existing user, or a new user with the appropriate permissions can be created. An example of creating a new user would be:
CREATE USER 'cdcuser'@'%' IDENTIFIED BY 'pwd123';
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'cdcuser'@'%';
Configuring MariaDb is very similar to configuring MySql. However, the
must be set in a configuration file. Adding a file to the configuration file folder
cdc.cnf with the contents:
log_bin = mysql-bin
binlog_format = ROW
binlog_row_image = FULL
expire_logs_days = 1
accomplishes the necessary configuration. The database user configuration is the same as that for MySql.
Configuring postgres involves activating the appropriate logging and also specifying which tables should be monitored for changes. The logging is activated by a database admin running:
ALTER SYSTEM SET wal_level = logical;
For each table that is to have CDC coverage, and admin must run:
alter table <table-name> replica identity full;
for example to have CDC coverage for a table named
flowers, the command:
alter table flowers replica identity full;
must be run.
The database user specified in the CDC-engine configuration must
REPLICATION permissions. These can be granted to
an existing user or a new user can be created for the purpose:
CREATE ROLE joe LOGIN REPLICATION PASSWORD 'pwd123';
Configuring Sql Server for CDC is best accomplished using the Sql Server Management Studio. The instructions for doing so are described here.
Similarly to postgres, CDC must be enabled for each table that is to be covered. This also can be done using the Sql Server Management Studio as described here.
The default settings for CDC in Sql Server are not optimal for performing cache invalidations. In order to improve them, the following settings are recommended (which :
EXEC sys.sp_cdc_change_job @pollinginterval = 1, @maxscans = 500, @maxtrans = 100
These settings will speed up the rate at which the CDC information is sent to PolyScale.
The database user specified in the CDC-engine configuration must be able to read the tables that are covered by CDC as well as the Sql Server CDC tables themselves. Sql Server provides a stored procedure to test whether a user has sufficient permissions for CDC. Login as the user and run:
MyDB is the database of interest. This procedure returns configuration information for each table in the database that is enabled for CDC and that contains change data that the caller is authorized to access. If the result is empty, verify that the user has privileges to access both the capture instance and the CDC tables.