Data Capture - Cyber Whale

The need for dynamic data management has become increasingly critical in modern applications. This is particularly true for platforms like Airbnb, where real-time updates are essential for features such as dynamic pricing and availability management.

The Challenge of Evolution

As Airbnb’s services expanded, so did the complexity of managing it across multiple microservices. The challenge was to efficiently propagate meaningful data changes while keeping services decoupled.

Introducing SpinalTap

To address these challenges, Airbnb developed SpinalTap, a robust Change Data Capture (CDC) system designed to detect mutations across various sources with minimal latency. This system has become integral to Airbnb’s infrastructure.

Key Requirements

SpinalTap was built with several key requirements in mind:

Lossless: Ensuring zero loss.
Scalable: Capable of handling increasing loads.
Performant: Near real-time change propagation.
Consistent: Maintaining order and timeline consistency.
Fault Tolerant: High availability with redundancy.
Extensible: Adaptable to various sources.

Architecture Overview

SpinalTap’s architecture consists of three main components:

Source: Captures change events from data stores.
Destination: Acts as a sink for processed mutations.
Pipe: Manages the workflow between source and destination.

Implementation Strategies

Several strategies were considered for building the CDC system:

Polling
Triggers
Dual Writes
Audit Trail

SpinalTap ultimately leverages database changelogs for asynchronous change detection.

Use Cases

SpinalTap supports various use cases, including:

Cache invalidation
Real-time search indexing
Offline processing
Signaling between services

Conclusion

SpinalTap has significantly enhanced Airbnb’s ability to manage data changes efficiently. Its design allows for seamless integration with existing infrastructure while ensuring reliability and performance.

Cyber Whale is a Moldovan agency specializing in building custom Business Intelligence (BI) systems that empower businesses with data-driven insights and strategic growth.

Let us help you with our BI systems, let us know at hello@cyberwhale.tech

Overview

Change-Data-Capture (CDC) is a method for capturing committed changes from a database in real-time and propagating those changes to downstream consumers. It is increasingly utilized to synchronize multiple heterogeneous datastores, such as MySQL and ElasticSearch, addressing challenges associated with traditional techniques like dual-writes and distributed transactions.In databases like MySQL and PostgreSQL, transaction logs(DBLog) serve as the source of CDC events. However, these logs typically have limited retention, which means they may not contain the full history of changes. To address this limitation, full dumps of the database state are often necessary. Existing open-source CDC solutions frequently share underlying libraries and protocols but have limitations, such as stalling log processing until a dump is complete or blocking write traffic with table locks.This led to the development of DBLog, a framework designed to process both log and dump events efficiently without impacting database performance.

Key Features of DBLog

In-Order Processing: Captured log events are processed in the order they occur.
Flexible Dumping: Dumps can be taken at any time across all tables, specific tables, or primary keys.
Interleaved Processing: Log and dump events are processed simultaneously in chunks, allowing for continuous log processing even during dumps.
No Table Locks: DBLog avoids acquiring locks on tables, preventing interference with write traffic.
High Availability: Designed to ensure that downstream consumers receive change events as they occur.

Requirements

DBLog is built on the premise that it must fulfill several requirements for effective data synchronization and event processing:

Capturing Full State: Derived stores must eventually reflect the complete state of the source.
On-Demand Repairs: Dumps should be triggerable at any time for repairs or new consumer setups.
Real-Time Event Availability: The system must maintain high availability for real-time changes even during repairs.
Minimal Database Impact: The connection to the database should minimally affect its performance.
Flexible Output Options: Events can be directed to various outputs, including streams or APIs.
Support for Relational Databases: DBLog is designed to work with common RDBMS like MySQL and PostgreSQL.

Existing Solutions

The team evaluated several open-source CDC solutions, including Maxwell and Debezium. Key limitations identified include:

Stopping log event processing during dump operations.
Lack of on-demand dump triggering.
Blocking write traffic due to table locks.
Dependency on database-specific features that limit cross-database compatibility.

DBLog addresses these issues by interleaving log and dump processing, allowing for continuous operation without locking tables.

DBLog Framework Architecture

DBLog is implemented in Java and captures changes in real-time while also facilitating dump operations. Dumps are taken in chunks that interleave with log events, ensuring that both processes can progress without stalling.

Log Processing

DBLog requires databases to emit events for each changed row in real-time. Each event includes a log sequence number, column state at the time of the operation, and applicable schema information. Events are serialized into a specific format and sent to an output writer in non-blocking operations.

Dump Processing

Dumps are necessary because transaction logs cannot fully reconstitute datasets due to their limited retention. DBLog takes dumps in chunks that interleave with log events:

Briefly pause log event processing.
Generate low watermark by updating a dedicated watermark table.
Execute a SELECT statement for the next chunk of data.
Generate high watermark after executing the SELECT.
Resume sending log events while managing entries based on watermarks.

This approach ensures that chunk selections do not override recent changes from log events.

Database Support

DBLog currently supports MySQL and PostgreSQL by utilizing their respective replication protocols (binlog for MySQL and replication slots for PostgreSQL). The framework’s design allows it to integrate easily with other databases meeting its requirements.

High Availability

DBLog employs an active-passive architecture using Zookeeper for leader election among instances. This setup ensures minimal downtime if one instance fails by allowing another instance to take over seamlessly.

Production Usage

Since its introduction in 2018, DBLog has been foundational for Netflix‘s MySQL and PostgreSQL connectors used in Delta for datastore synchronization and event processing

Future Developments

Future enhancements planned for DBLog include:

Capturing table schemas without locks.
Schema store integration for event schema tracking.
Monotonic writes mode to ensure state transitions only move forward.

DBLog is expected to be open-sourced with additional documentation in the future.

Acknowledgments

The development of DBLog involved contributions from several individuals whose efforts were crucial in bringing this framework to fruition.