Resolving Message Distribution Imbalance Caused by Kafka StickyPartitioner Bug
A bug in the StickyPartitioner in Kafka Client versions below 3.3 caused messages to concentrate in specific partitions. This post summarizes the journey from root cause analysis to resolution.
The Problem
While improving the CPC Ad Ranking System at the Gmarket AdTech team, we discovered that ranking update messages published to Kafka were heavily concentrated in specific partitions.
The expected behavior was an even distribution of messages across partitions. However, in reality, messages were concentrated in partitions 0 and 2, while the remaining partitions were almost empty.
Partition 0: ████████████ 45%
Partition 1: ██ 5%
Partition 2: ████████████ 48%
Partition 3: ▏ 2%
Root Cause Analysis
The StickyPartitioner Bug
Starting from Kafka Client 2.4, the default partitioner changed from RoundRobinPartitioner to StickyPartitioner. The StickyPartitioner improves batching efficiency by sending messages to the same partition until a batch is filled.
However, in Kafka Client versions below 3.3, the StickyPartitioner has a bug where it skews towards specific partitions under certain conditions (low linger.ms settings, slight broker latency).
// Simplified internal logic of the buggy StickyPartitioner
// The partition() method is called twice under certain conditions,
// causing messages to be distributed only to even/odd partitions.
Additionally, the RoundRobinPartitioner in Kafka Client 2.4 and above also has a bug where the partition() method is called twice, resulting in distribution only to even or odd partitions.
Checking the Configuration
# application.yml
spring:
kafka:
producer:
linger-ms: 0 # Low linger.ms setting → Bug trigger condition
properties:
partitioner.class: org.apache.kafka.clients.producer.internals.DefaultPartitioner
Our environment was configured with a low linger.ms, which acted as the trigger condition for the bug.
Solution
Upgrading the Kafka Client Version
The most fundamental solution is to upgrade the Kafka Client version to 3.3 or higher.
<!-- pom.xml -->
<dependency>
<groupId>org.springframework.kafka</groupId>
<artifactId>spring-kafka</artifactId>
<version>3.1.0</version> <!-- Includes Kafka Client 3.3+ -->
</dependency>
Starting from Kafka Client 3.3, an improved DefaultPartitioner is applied, replacing the legacy StickyPartitioner and RoundRobinPartitioner.
Verification
Partition distribution status after the upgrade:
Partition 0: ██████ 25%
Partition 1: ██████ 25%
Partition 2: ██████ 25%
Partition 3: ██████ 25%
An even distribution was confirmed.
Results
- Complete resolution of Kafka message distribution imbalance.
- Solved the issue of specific consumer overload caused by partition skew.
- Secured uniform and predictable processing performance for ranking data updates.
Lessons Learned
When upgrading library versions, it is crucial to always check the release notes and bug fix history. As seen in this case, when there is a bug in the default behavior, finding the root cause can be very difficult.
Furthermore, because we were monitoring the message distribution status in real-time via a Datadog dashboard, we were able to detect the issue early. I felt the importance of a proper monitoring system once again.