Change Data Capture (CDC)
This feature is being actively developed as a technology preview under beta
and is subject to change.
This documentation page is being actively developed and as such is work in progress. If you are interested in trying the PolyScale CDC integration, please contact us
Overview
Change Data Capture (CDC) is a built-in feature of modern database systems that tracks the changes that occur in the database and streams a representation of those changes to consumer processes. CDC forms the basis for building read replicas and provides exact and targeted information for doing cache invalidation.
CDC Based Invalidation
CDC streams are supported as a primary source for global invalidations, that is, to notify the PolyScale Data Delivery Network of highly targeted cache invalidations, as that data has changed.
This method of invalidation is ideal for use cases where updates to the database are out of band to PolyScale i.e. Smart Invalidation cannot be used as the data updates do not pass through PolyScale.
Debezium
The PolyScale CDC-engine is based on the well regarded open source software Debezium. It is packaged as a docker container which is intended to run as close as possible to your database. The currently supported databases are: MySql, MariaDb, Postgres, and Microsoft Sql Server.
Installation
The CDC-engine is packaged as a container and as such can be run in most environments. It works by reading from database logs and so should be placed in close proximity to the database. Either on the database server itself, or at least within the same network.
The CDC-engine requires two configuration parameters. The first of these is a connection URI to the database. This string specifies the database type, the database username and password, the database host and port, and the PolyScale cache id to associate the invalidation information with. An example would be:
CDC_CONNECTION_URI=postgres://ben:pwd123@192.168.0.12:5432/ab3d96fa-9823-0915-c654-fcc92a38c13f
The second configuration parameter is the connection URI to the PolyScale Data Delivery Network. This needs to be provided by PolyScale, but will have a form similar to this example:
CDC_EVENT_QUEUE_URI=amqps://pce_2342a33c342:fc92413ba@cdc.polyscale.ai:5643
Configuring the CDC-engine is simple, however, the database must also be configured to support CDC logging. The method for configuring the database is, not surprisingly, database dependent.
MySql CDC configuration
In order to support CDC operations a MySql database has to have a binlog_format
set equal to ROW
, and also
the binlog_row_image
set equal to FULL
. These settings can be made by an admin user using the following
commands:
SET GLOBAL binlog_format='ROW';
SET GLOBAL binlog_row_image='FULL';
A server restart will be required for these settings to take effect. If your database is an AWS RDS instance, it is necessary to have backups enabled with a non-zero retention period in order to set these via the AWS interface.
The database user specified in the CDC-engine configuration requires specific permissions. These can be granted to an existing user, or a new user with the appropriate permissions can be created. An example of creating a new user would be:
CREATE USER 'cdcuser'@'%' IDENTIFIED BY 'pwd123';
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'cdcuser'@'%';
FLUSH PRIVILEGES;
MariaDb
Configuring MariaDb is very similar to configuring MySql. However, the binlog_format
and binlog_row_image
must be set in a configuration file. Adding a file to the configuration file folder
(typically /etc/mysql/conf.d
), say cdc.cnf
with the contents:
[mariadb]
log_bin = mysql-bin
binlog_format = ROW
binlog_row_image = FULL
expire_logs_days = 1
accomplishes the necessary configuration. The database user configuration is the same as that for MySql.
Postgres
Configuring postgres involves activating the appropriate logging and also specifying which tables should be monitored for changes. The logging is activated by a database admin running:
ALTER SYSTEM SET wal_level = logical;
For each table that is to have CDC coverage, and admin must run:
alter table <table-name> replica identity full;
for example to have CDC coverage for a table named flowers
, the command:
alter table flowers replica identity full;
must be run.
The database user specified in the CDC-engine configuration must
have LOGIN
and REPLICATION
permissions. These can be granted to
an existing user or a new user can be created for the purpose:
CREATE ROLE joe LOGIN REPLICATION PASSWORD 'pwd123';
Sql Server
Configuring Sql Server for CDC is best accomplished using the Sql Server Management Studio. The instructions for doing so are described here.
Similarly to postgres, CDC must be enabled for each table that is to be covered. This also can be done using the Sql Server Management Studio as described here.
The default settings for CDC in Sql Server are not optimal for performing cache invalidations. In order to improve them, the following settings are recommended (which :
EXEC sys.sp_cdc_change_job @pollinginterval = 1, @maxscans = 500, @maxtrans = 100
GO
EXEC sys.sp_cdc_stop_job
GO
EXEC sys.sp_cdc_start_job
GO
These settings will speed up the rate at which the CDC information is sent to PolyScale.
The database user specified in the CDC-engine configuration must be able to read the tables that are covered by CDC as well as the Sql Server CDC tables themselves. Sql Server provides a stored procedure to test whether a user has sufficient permissions for CDC. Login as the user and run:
USE MyDB;
GO
EXEC sys.sp_cdc_help_change_data_capture
GO
where MyDB
is the database of interest. This procedure returns configuration information for each table in the database that is enabled for CDC and that contains change data that the caller is authorized to access. If the result is empty, verify that the user has privileges to access both the capture instance and the CDC tables.