Overview
Change-Data-Capture (CDC) is a method for capturing committed changes from a database in real-time and propagating those changes to downstream consumers. It is increasingly utilized to synchronize multiple heterogeneous datastores, such as MySQL and ElasticSearch, addressing challenges associated with traditional techniques like dual-writes and distributed transactions.In databases like MySQL and PostgreSQL, transaction logs(DBLog) serve as the source of CDC events. However, these logs typically have limited retention, which means they may not contain the full history of changes. To address this limitation, full dumps of the database state are often necessary. Existing open-source CDC solutions frequently share underlying libraries and protocols but have limitations, such as stalling log processing until a dump is complete or blocking write traffic with table locks.This led to the development of DBLog, a framework designed to process both log and dump events efficiently without impacting database performance.
Key Features of DBLog
- In-Order Processing: Captured log events are processed in the order they occur.
- Flexible Dumping: Dumps can be taken at any time across all tables, specific tables, or primary keys.
- Interleaved Processing: Log and dump events are processed simultaneously in chunks, allowing for continuous log processing even during dumps.
- No Table Locks: DBLog avoids acquiring locks on tables, preventing interference with write traffic.
- High Availability: Designed to ensure that downstream consumers receive change events as they occur.
Requirements
DBLog is built on the premise that it must fulfill several requirements for effective data synchronization and event processing:
- Capturing Full State: Derived stores must eventually reflect the complete state of the source.
- On-Demand Repairs: Dumps should be triggerable at any time for repairs or new consumer setups.
- Real-Time Event Availability: The system must maintain high availability for real-time changes even during repairs.
- Minimal Database Impact: The connection to the database should minimally affect its performance.
- Flexible Output Options: Events can be directed to various outputs, including streams or APIs.
- Support for Relational Databases: DBLog is designed to work with common RDBMS like MySQL and PostgreSQL.
Existing Solutions
The team evaluated several open-source CDC solutions, including Maxwell and Debezium. Key limitations identified include:
- Stopping log event processing during dump operations.
- Lack of on-demand dump triggering.
- Blocking write traffic due to table locks.
- Dependency on database-specific features that limit cross-database compatibility.
DBLog addresses these issues by interleaving log and dump processing, allowing for continuous operation without locking tables.
DBLog Framework Architecture
DBLog is implemented in Java and captures changes in real-time while also facilitating dump operations. Dumps are taken in chunks that interleave with log events, ensuring that both processes can progress without stalling.
Log Processing
DBLog requires databases to emit events for each changed row in real-time. Each event includes a log sequence number, column state at the time of the operation, and applicable schema information. Events are serialized into a specific format and sent to an output writer in non-blocking operations.
Dump Processing
Dumps are necessary because transaction logs cannot fully reconstitute datasets due to their limited retention. DBLog takes dumps in chunks that interleave with log events:
- Briefly pause log event processing.
- Generate low watermark by updating a dedicated watermark table.
- Execute a SELECT statement for the next chunk of data.
- Generate high watermark after executing the SELECT.
- Resume sending log events while managing entries based on watermarks.
This approach ensures that chunk selections do not override recent changes from log events.
Database Support
DBLog currently supports MySQL and PostgreSQL by utilizing their respective replication protocols (binlog for MySQL and replication slots for PostgreSQL). The framework’s design allows it to integrate easily with other databases meeting its requirements.
High Availability
DBLog employs an active-passive architecture using Zookeeper for leader election among instances. This setup ensures minimal downtime if one instance fails by allowing another instance to take over seamlessly.
Production Usage
Since its introduction in 2018, DBLog has been foundational for Netflix‘s MySQL and PostgreSQL connectors used in Delta for datastore synchronization and event processing
Future Developments
Future enhancements planned for DBLog include:
- Capturing table schemas without locks.
- Schema store integration for event schema tracking.
- Monotonic writes mode to ensure state transitions only move forward.
DBLog is expected to be open-sourced with additional documentation in the future.
Acknowledgments
The development of DBLog involved contributions from several individuals whose efforts were crucial in bringing this framework to fruition.
Cyber Whale is a Moldovan agency specializing in building custom Business Intelligence (BI) systems that empower businesses with data-driven insights and strategic growth.
Let us help you with our BI systems, let us know at [email protected]