How Fresh Is Your Data? Rethinking Change Data Capture for Real-Time Systems-Tapdata

How Fresh is Your Data? Rethinking Change Data Capture for Real-Time Systems

Aug 20,2025

Introduction

The Hadoop ecosystem, born in 2006, fueled the big data boom for more than a decade. But times have changed—so have the scenarios and the technologies. The industry’s understanding of data has moved beyond T+1 batch processing and high-throughput, high-latency systems. In today’s real-world applications, real-time, accurate, and dynamic data is more important than ever.

To meet these emerging needs, new frameworks and middleware have proliferated like mushrooms after rain. Hive brought SQL-like accessibility to the otherwise rigid Hadoop ecosystem. HBase and Impala tried to make it faster. Spark and Flink emerged as real-time processing frameworks, enabling data to flow closer to business in real time. Presto and Dremio virtualized real-time access to multiple sources. New OLAP databases like ClickHouse began providing near real-time analysis for massive datasets. Specialized solutions also popped up in areas like time-series and feature data processing.

Unlike traditional commercial software, the real-time data ecosystem has embraced open source. In this world, talk is cheap—show me the code.

At TapData, our own journey implementing real-time solutions made us feel that existing tools often fell short in subtle but critical ways. After delivering many real-world projects and speaking with countless customers, we gradually formed the idea of building our own streaming engine—one that actually works across diverse, real-time business scenarios.

Beyond solving immediate customer problems, we thought it’d be more exciting to turn these insights into a product that could benefit the entire community.

That’s why I started writing this blog series to share our real-time engine insights. This series aims to share our practical insights on building real-time data engines. We welcome feedback and discussion as we continue to evolve this journey.

The Fresher, the Better

Every real-time computing process begins with data acquisition. While it’s easy to fetch batch data via JDBC or native database drivers, acquiring fresh, real-time data is far less standardized and intuitive.

This is where CDC (Change Data Capture) comes into play. The presence of a dedicated acronym usually implies that the task is anything but simple.

Common CDC Implementation Methods

Polling

The most straightforward method is to query the database at regular intervals. The advantages:

Works with virtually any database
Low development cost

But the drawbacks are just as clear:

Requires an incremental field or timestamp, making it invasive to business logic
Minimum latency equals the polling interval
Adds query load to the database
Can’t detect deletions or precisely track field-level updates

While easy to implement, polling is often a fallback rather than the first choice in modern real-time frameworks.

Triggers

Many databases support triggers—procedures that run when data is inserted, updated, or deleted. With custom triggers, data changes can be captured and:

Stored in a separate changelog table
Pushed to a message queue
Sent directly to a downstream system via API

Pros:

Offers better detail and lower latency than polling

Cons:

No standard implementation across databases
Some databases don’t support triggers at all
Performance overhead due to added logic during data writes, even if asynchronous

Triggers improve latency and completeness, but introduce new complexity and performance concerns.

Database Logs

Most databases maintain internal logs to track data changes, often used for replication or recovery. External tools can tap into these logs to obtain real-time changes, with sub-second latency and minimal performance overhead. Log-based CDC supports more databases than triggers and works as long as replication is enabled.

Log-based CDC has become the go-to method for real-time data frameworks due to its superior performance and completeness. However, its complexity and implementation cost remain high.

Message Queues

Some systems emit events directly through message queues (Kafka, various MQs), especially for custom business logic. But use cases vary so widely that standardization is difficult.

The Challenges of Log-Based CDC

Despite its advantages, log-based CDC faces some tough challenges:

Database Diversity

Each database implements logging differently. APIs and log formats vary widely. Supporting a broad range of databases means custom development for each one—there are no shortcuts. Covering the most common ~50 databases is hard enough. Covering all ~200+? Practically impossible, especially for proprietary systems like DB2, GaussDB, or HANA.

Version Incompatibility

Even within the same database product, different versions often have incompatible log formats. For example, Oracle versions 8 to 20 or MongoDB 2 to 5 have significant internal differences.

Deployment Variability

The same database and version can be deployed in different cluster architectures (e.g., MySQL with PXC, Myshard, Mycat; PostgreSQL with GP, XL, XC, Citus; Oracle with DG, RAC; MongoDB with replica sets or sharding), each requiring unique handling.

These dimensions—database type, version, deployment—multiply the effort needed to implement robust log-based CDC.

Non-Standard Log Formats

Many databases were never designed for real-time streaming; their logs were built for replication and recovery. As a result, key information may be missing.

rs0:PRIMARY> use mock

switched to db mock

rs0:PRIMARY> db.t.insert({a:1, b:1})

WriteResult({ “nInserted” : 1 })

rs0:PRIMARY> db.t.remove({})

WriteResult({ “nRemoved” : 1 })

rs0:PRIMARY> use local

switched to db local

rs0:PRIMARY> db.oplog.rs.find({ns:”mock.t”}).pretty()

{

“op” : “i”,

“ns” : “mock.t”,

“ui” : UUID(“9bf0197e-0e59-45d6-b5a1-21726c281afd”),

“o” : { “_id” : ObjectId(“610eba317d24f05b0e9fdb3b”), “a” : 1, “b” : 1 },

“ts” : Timestamp(1628355121, 2),

“t” : NumberLong(1),

“wall” : ISODate(“2021-08-07T16:52:01.890Z”),

“v” : NumberLong(2) } { “op” : “d”, “ns” : “mock.t”,

“ui” : UUID(“9bf0197e-0e59-45d6-b5a1-21726c281afd”),

“o” : { “_id” : ObjectId(“610eba317d24f05b0e9fdb3b”) },

“ts” : Timestamp(1628355126, 1),

“t” : NumberLong(1),

“wall” : ISODate(“2021-08-07T16:52:06.191Z”),

“v” : NumberLong(2)

}

Example: In MongoDB, a deletion log only records the document ID. If you’re doing a join based on another field (e.g., a), the value is lost once the document is deleted. Real-time join operations break because the stream lacks the necessary context.

Logs built for replication ensure consistency—but not completeness. Real-time processing needs full change records, not just minimal diffs.

Existing Solutions

Despite the challenges, the advantages of log-based CDC are too great to ignore. Different approaches have emerged:

Specialized tools like Oracle GoldenGate or MySQL Canal focus deeply on a single database.
General-purpose tools like Debezium use a plugin system to support many databases.
Integration frameworks like Flink CDC abstract over existing tools, blending specialized and general solutions (Flink CDC on GitHub).

To address incomplete logs, many systems use a data cache layer to reconstruct full records. While effective, it increases resource usage. So far, no unified product has emerged—most remain ad hoc solutions for specific use cases.

The TapData Approach

TapData combines broad compatibility with essential caching to provide a practical solution.

Compared to Debezium, we’ve heavily optimized performance and achieved several times faster parsing speed. TapData now supports over 30 database types.

To deal with non-standard logs, we also introduced a flexible storage abstraction. Here’s an example:

CacheConfig cacheConfig = TapCache.config(“source-cache”). .setSize(“1g”) .setTtl(“3d”); DataSource<Record> source = TapSources.mongodb(“mongodb-source”) .setHost(“127.0.0.1”) .setPort(27017) .setUser(“root”) .setPassword(“xxx”) .withCdc() .formatCdc(cacheConfig) .build()

This builds a full real-time data stream that includes both initial full load and incremental changes, using in-memory caching to normalize change logs.

To downstream systems, this appears as a fresh, clean stream of real-time data.

One Final Question

You might have noticed: although we emit both full and incremental data, our data model doesn’t follow the common split into BatchSource, Record, and ChangeRecord like Flink or Hazelcast Jet. Why?

Stay tuned. Follow Our blogs (https://tapdata.io) for future posts where I’ll explore more behind the scenes of TapData’s real-time engine architecture.

—By Berry, Co-Founder of TapData