Blog

Resolving Message Distribution Imbalance Caused by Kafka StickyPartitioner Bug

A bug in the StickyPartitioner in Kafka Client versions below 3.3 caused messages to concentrate in specific partitions. This post summarizes the journey from root cause analysis to resolution.

KafkaSpringBackendGmarket

The Problem

While improving the CPC Ad Ranking System at the Gmarket AdTech team, we discovered that ranking update messages published to Kafka were heavily concentrated in specific partitions.

The expected behavior was an even distribution of messages across partitions. However, in reality, messages were concentrated in partitions 0 and 2, while the remaining partitions were almost empty.

Partition 0: ████████████ 45%
Partition 1: ██ 5%
Partition 2: ████████████ 48%
Partition 3: ▏ 2%

Root Cause Analysis

The StickyPartitioner Bug

Starting from Kafka Client 2.4, the default partitioner changed from RoundRobinPartitioner to StickyPartitioner. The StickyPartitioner improves batching efficiency by sending messages to the same partition until a batch is filled.

However, in Kafka Client versions below 3.3, the StickyPartitioner has a bug where it skews towards specific partitions under certain conditions (low linger.ms settings, slight broker latency).

// Simplified internal logic of the buggy StickyPartitioner
// The partition() method is called twice under certain conditions,
// causing messages to be distributed only to even/odd partitions.

Additionally, the RoundRobinPartitioner in Kafka Client 2.4 and above also has a bug where the partition() method is called twice, resulting in distribution only to even or odd partitions.

Checking the Configuration

# application.yml
spring:
  kafka:
    producer:
      linger-ms: 0  # Low linger.ms setting → Bug trigger condition
      properties:
        partitioner.class: org.apache.kafka.clients.producer.internals.DefaultPartitioner

Our environment was configured with a low linger.ms, which acted as the trigger condition for the bug.

Solution

Upgrading the Kafka Client Version

The most fundamental solution is to upgrade the Kafka Client version to 3.3 or higher.

<!-- pom.xml -->
<dependency>
  <groupId>org.springframework.kafka</groupId>
  <artifactId>spring-kafka</artifactId>
  <version>3.1.0</version> <!-- Includes Kafka Client 3.3+ -->
</dependency>

Starting from Kafka Client 3.3, an improved DefaultPartitioner is applied, replacing the legacy StickyPartitioner and RoundRobinPartitioner.

Verification

Partition distribution status after the upgrade:

Partition 0: ██████ 25%
Partition 1: ██████ 25%
Partition 2: ██████ 25%
Partition 3: ██████ 25%

An even distribution was confirmed.

Results

  • Complete resolution of Kafka message distribution imbalance.
  • Solved the issue of specific consumer overload caused by partition skew.
  • Secured uniform and predictable processing performance for ranking data updates.

Lessons Learned

When upgrading library versions, it is crucial to always check the release notes and bug fix history. As seen in this case, when there is a bug in the default behavior, finding the root cause can be very difficult.

Furthermore, because we were monitoring the message distribution status in real-time via a Datadog dashboard, we were able to detect the issue early. I felt the importance of a proper monitoring system once again.