A 3-Stage Synchronization Pipeline Ensuring Real-time MS-SQL↔MongoDB Consistency Without Commercial CDC
Due to budget constraints, we couldn't use a commercial CDC solution. Sharing the process of custom designing a 3-stage architecture: API-level sync → Kafka failure recovery → Batch final consistency verification.
Note: The code in this article has been conceptually rewritten based on actual work experience. It is not associated with the actual company code.
3-Stage Synchronization Pipeline Architecture
Background
While transitioning the ad ranking system from MS-SQL to MongoDB, maintaining data consistency during the transition period—when both systems coexisted—was a major challenge.
We had to ensure both databases always held identical data, but processing transactions atomically across two databases in a distributed environment is much harder than it sounds.
The Problem: Distributed Transactions are Difficult
The Initial Approach and its Limits
The simplest approach was to write to both simultaneously.
// Initial approach (Flawed)
public void updateRanking(RankingData data) {
sqlRepository.update(data); // Update MS-SQL
mongoRepository.update(data); // Update MongoDB
}
The flaw in this method is obvious. What if sqlRepository.update() succeeds but mongoRepository.update() fails? The data in the two DBs diverges. The reverse is also true.
A standard solution for distributed transactions is 2PC (Two-Phase Commit), but it's complex to implement and incurs high performance overhead. There's also the Outbox Pattern utilizing Kafka's transactional producer, but implementing it correctly requires substantial effort.
We Considered CDC, but...
The industry-standard solution is a CDC (Change Data Capture) solution. Tools like Debezium can capture MS-SQL's change logs in real-time and apply them to MongoDB.
However, we faced realistic constraints:
- License costs for commercial CDC solutions.
- Infrastructure constraints requiring the installation of additional CDC agents in the legacy MS-SQL environment.
- The time required for adoption and stabilization.
Ultimately, we decided to design a custom synchronization mechanism within our budget and schedule constraints.
The Solution: 3-Stage Tiered Synchronization Architecture
The core idea was to set up "multiple layers of defense" rather than a "perfect single mechanism."
[Stage 1] API-Level Real-time Sync ← Handles normal cases
↓ On Failure
[Stage 2] Kafka-based Failure Recovery ← Handles temporary outages
↓ If still unreflected
[Stage 3] Batch Final Consistency Check ← Ultimate safety net
Stage 1: API-Level Real-time Synchronization
When a data change API is called, it attempts to reflect the change immediately in MongoDB using the asynchronous processing of Spring Webflux.
@Service
public class RankingService {
public Mono<Void> updateRanking(RankingUpdateRequest request) {
return mongoRepository.update(request.toMongoEntity()) // Attempt MongoDB first
.onErrorResume(e -> {
// Fallback to Kafka on failure
return kafkaPublisher.publish(SYNC_TOPIC, request);
});
}
}
We designated MongoDB as the primary store rather than MS-SQL and attempted to update it first. Since ad ranking serving occurs from MongoDB, updating it first made sense from a business perspective as well.
Stage 2: Kafka-based Failure Recovery
If the MongoDB update fails, the data to be synchronized is published to a Kafka topic.
@KafkaListener(topics = "ranking-sync-retry")
public class SyncConsumer {
public void consume(RankingUpdateRequest request) {
mongoRepository.update(request.toMongoEntity())
.doOnSuccess(v -> log.info("Sync recovered: {}", request.getId()))
.doOnError(e -> log.error("Sync failed again: {}", request.getId(), e))
.subscribe();
}
}
We configured the message key based on a UUID so that processing for the same data routes to the same partition. This guarantees the order of change events for the same entity.
kafkaTemplate.send(ProducerRecord.<String, byte[]>(
SYNC_TOPIC,
request.getEntityId(), // Partition Key = Entity ID
serialize(request)
));
Stage 3: Batch Final Consistency Verification
Stages 1 and 2 handle the vast majority of cases, but in extremely rare scenarios (e.g., Kafka experiencing an outage, consumer down), data inconsistencies might remain.
For this, we developed a Batch Job that periodically compares the two DBs based on MS-SQL's change history table and corrects any discrepancies.
@Scheduled(cron = "0 */30 * * * *") // Runs every 30 minutes
public void reconcile() {
// 1. Query MS-SQL change history (Last 30 minutes)
List<ChangeHistory> recentChanges = changeHistoryRepository.findRecent(Duration.ofMinutes(30));
// 2. Check MongoDB status for each change
recentChanges.forEach(change -> {
mongoRepository.findById(change.getEntityId())
.subscribe(mongoEntity -> {
if (!isConsistent(change, mongoEntity)) {
// 3. Correct if inconsistency is found
log.warn("Inconsistency detected: {}", change.getEntityId());
mongoRepository.update(change.toMongoEntity()).subscribe();
}
});
});
}
Lessons Learned in Operations
Logs Are Crucial
Operational visibility is key in this architecture. By tracking the number of cases that failed in Stage 1 and moved to Kafka, and the number of cases corrected in Stage 3, we could grasp the health of the system.
We tracked the following metrics using Datadog:
ranking.sync.kafka_fallback.count: Number of cases shifted to Kafka due to Stage 1 failure.ranking.sync.reconcile.corrected.count: Number of cases corrected by the Stage 3 batch.
If the number of Stage 3 corrections consistently remains near zero, it means Stages 1 and 2 are functioning well.
The Importance of Idempotent Design
Both the Kafka Consumer and the Batch Job must guarantee idempotency. Processing the same event twice should yield the same result.
// Guaranteeing Idempotency: Designed with overwrite strategy
mongoRepository.update(entity) // upsert method
Results
After introducing this 3-stage architecture:
- A drastic reduction in advertiser CS inquiries caused by data inconsistency.
- Stage 3 correction counts remained almost at zero (Stages 1 and 2 handled most cases).
- Guaranteed near real-time consistency without a commercial CDC solution.
It is not a perfect distributed transaction. The two DBs can differ for an extremely brief moment. However, for a domain like ad rankings where "Eventual Consistency" is acceptable, it was a highly practical solution.