In this article, you will be reading about capturing change data from DBMS.
In any database, the database is being changed to serve the needs of its user. For an example, to update a row, the data table’s respective row data get changed. In some complex cases, some transaction logs, redo logs also get updated. We can recognize inserts, updates deletes of the table data, table creating, altering, dropping, database creating, dropping, some database parameter changes as change data with respect to a database. Most of the database management systems(DBMS) such as MySQL, Oracle, MS SQL Server are having some logs to keep track of these changes (not necessarily all the changes mentioned above). These logs are kept for several purposes such as disaster recovering, change data capturing .
Change Data Capturing(CDC)
Let’s say the user wants to capture the changes of a database. There comes the change data capturing. There are several methods to achieve CDC.
- Polling based CDC.
- Database triggers based CDC.
- Use database features for CDC.
- Log reading based CDC.
Now that you can see the logs aren’t the only way to capture change data.
The simplest way to capture change data is to poll tables for the changes. The CDC program has to periodically read all the data(not necessarily) from the tables and keep track of the current state and produce change data. Using this method, only inserts and updates can be captured. Deletions cannot be captured. Also this is strictly depending on each table’s polling column which is an incremental number column or a timestamp column.
Most of the DBMS support triggers. Triggers can be set-up on tables to append data on another table on changes. The second table can be polled and produce change data including deletions as well. Find this question asked in stack-overflow to get an idea about trigger based data replication.
Although all the DBMS not support change data capturing as a inbuilt feature, some DBMS comes with change data capturing feature. For examples, MS SQL Server, Oracle are supporting CDC as their feature. Mostly transaction change data (inserts, updates and deletes) will be produced by these features.
In log based CDC, the logs which managed by the database are read by the user change data capturing mechanism(program). Logging should be enabled to achieve this method. The database automatically appends logs for each and every transaction, change done on the database. Therefore, enabling logs could reduce the performance of the database. The CDC program can read the append data on the logs and produce change data events. This method is very good approach; in fact this is the best method in my opinion for CDC. Almost every changes which are logged by the DBMS can be captured along with the details. Eg: A update performed on a table can be captured with the row-data before and after the update. Such change data event will look like follows.
Event[op = ‘u’, before = [‘S001’, ‘first name’, ‘last name’], after = [‘S001’, ‘changed first name’, ‘last name’]]
There are libraries to read the change data from different DBMS. For MySQL shyiko binlog reader can be used. Using such library to read logs can simplify the change data capturing program. For an example, Debezium is a platform which can produce change data events from several DBMS; in which it is using shyiko binlog reader to capture MySQL change data.
Change Data Capturing Concepts
In change data capturing, one change data should not be produced more than once. Any produced change data event should represent an actual database change. In order to achieve this non-redundancy, a “last read value pointer” can be maintained. In Debezium, this is called “last offset”. Find more details from the debezium documentation. To keep tract of the last produced change data, file based or database based storage can be used. A log based CDC system can keep track of the pointer to the last read log, a polling based CDC system can keep track of the last read polling columns value. Being a giant in CDC, Debezium supports non-redundant change data with configurable offset storage.
Although directly polling a table for changes is itself a way to capture changes, CDC should not be relying from the existing data and it should not consume the database connection much. It should be a log based approach. Trigger based approach can also be marked as a “bad” approach since the program has to ultimately poll for the changes. However if we don’t want the change data capturing to be perfect, get rid of the complex configuration for the databases and still want the data data to be captured, a polling approach can be used.
If the user is using transactions in the database operations, those changes should not be captured until the transaction is actually committed. In fact the changes are not made permanently on the database unless committed.