Migrating Ad Settlement Data to Databricks — Building a Fail-proof Pipeline
A journey of migrating the Hue/Hadoop-based ad settlement pipeline to Databricks to resolve recurring nightly batch failures, ensuring both data accuracy and operational stability.
Note: The code in this article has been conceptually rewritten based on actual work experience. It is not associated with the actual company code.
Pipeline Architecture
Introduction
Advertising settlement data represents actual money changing hands. Even a single cent of error can lead to a loss of trust with advertisers. However, the critical batch process handling this data was failing every single night.
This post shares the journey of migrating the CPC (Cost Per Click) ad Bill/Pay settlement data and ranking snapshot data from a Hue/Hadoop based environment to Databricks at the Gmarket AdTech team.
The Problem: Daily Manual Re-executions at Dawn
Legacy Environment
- Bill/Pay Settlement Data: Processed via
Hue/Hadoopbased batches - Ranking Snapshot Data: Processed separately via
MongoDBaggregations - Combining the two: Manually coordinated in a separate batch
The core issue was a lack of resources in the Hadoop infrastructure. Sharing cluster resources with other teams' batches meant that during certain time windows, the ad settlement batch couldn't secure enough resources and frequently failed due to timeouts.
When the batch failed, someone had to wake up at dawn to manually re-execute it.
Company-wide Transition
Around the same time, the company decided to adopt Databricks and gradually phase out Hadoop-based processing. The ad team's pipeline needed to align with this transition.
The Core Challenge: Verifying Settlement Data Accuracy
Simply migrating the same logic to Databricks is not inherently difficult. The real challenge was verifying that the migrated results exactly matched the legacy results.
Given the nature of settlement data, the following three aspects had to match perfectly:
- Record Count Match: Are the processed ad click/impression counts identical?
- Total Amount Match: Is the total sum of billing/deduction amounts the same?
- Key-level Match: Are the results identical per advertiser, per product, and per period?
Verification Strategy
[MS-SQL Raw Settlement Data]
↓
[Databricks Pipeline]
↓ ↓
[Intermediate] [Final Result]
↓ ↓
[Verify against Legacy Hadoop Results]
We designed step-by-step verification checkpoints so that if an issue occurred, we could immediately trace which stage caused the discrepancy.
Rebuilding the Pipeline
Step 1: Fully Understanding the Legacy Flow
We spent the most time on this initial phase before the actual migration.
- Analyzed existing Hadoop job SQL queries in the
Hueconsole - Identified relationships and data characteristics of the
MS-SQLsource tables - Traced how the
MongoDBranking snapshot data was used in settlement calculations - Documented the processing order and dependencies of billing and deduction data
Since the legacy code lacked documentation, we reverse-engineered the flow by executing actual queries.
Step 2: Constructing the Databricks Pipeline
# Load source data
raw_billing = spark.read \
.format("jdbc") \
.option("url", mssql_url) \
.option("dbtable", "CPC_BILLING_RAW") \
.load()
# Create intermediate aggregation table
agg_by_ad = raw_billing \
.groupBy("ad_id", "advertiser_id", "date") \
.agg(
F.sum("charge_amount").alias("total_charge"),
F.count("*").alias("click_count")
)
# Verification: Intermediate aggregation stage
validate_checkpoint(agg_by_ad, expected_total_charge)
# Generate final settlement result
final_bill = agg_by_ad \
.join(advertiser_budget, on="advertiser_id") \
.withColumn("net_charge", F.least("total_charge", "remaining_budget"))
Step 3: Step-by-step Verification Structure
def validate_checkpoint(df, checkpoint_name):
result = df.agg(
F.sum("total_charge").alias("sum_charge"),
F.count("*").alias("row_count")
).collect()[0]
expected = get_expected_from_legacy(checkpoint_name)
if abs(result.sum_charge - expected.sum_charge) > TOLERANCE:
raise DataValidationError(
f"[{checkpoint_name}] Amount mismatch: "
f"actual={result.sum_charge}, expected={expected.sum_charge}"
)
log.info(f"[{checkpoint_name}] Verification passed: {result.row_count} rows, {result.sum_charge} KRW")
Thanks to this structure, we could quickly catch and precisely locate various aggregation discrepancies discovered early in the migration (e.g., differences in timezone conversion logic, decimal point handling).
Migration Results
After migrating to Databricks:
- 0 Batch Failures: No resource contention thanks to dedicated cluster resources.
- Eliminated Manual Dawn Re-executions: Removed the burden of night-time standby for the person in charge.
- Improved Settlement Data Reliability via Checkpoints: Automated detection of aggregation errors.
For advertising settlement data, accuracy and stability are far more critical than speed. Through this migration, we successfully secured both.
Conclusion: The Most Important Aspect of Legacy Migration
The biggest lesson learned from migrating data pipelines is that a migration without a verification structure is not a migration; it's a gamble.
Especially for data directly tied to money like settlement data, you must be able to mathematically prove that the new system's results are identical to the legacy system. We spent more than half of our total migration effort writing that verification code, but it was the right investment.