Blog

RedisTemplate vs Lettuce, Kafka StickyPartitioner Bug — Two Issues Encountered While Improving the Ranking System

Documenting how we discovered and resolved two issues using Datadog: Kafka messages skewing to specific partitions, and Redis taking 4 minutes to process 1 million records.

RedisLettuceKafkaSpring WebfluxGmarketTroubleshooting

Note: The code in this article has been conceptually rewritten based on actual work experience. It is not associated with the actual company code.

Overview: The Resolution Flow of Two Issues

Kafka Redis StickyPartitioner Bug(Below 3.3) Concentrated in Specific Partitions Upgrade to Kafka 3.4 Restored Even Distribution RedisTemplate Default Config(New Connection per Call) 1 Million Records = 240 sec Lettuce Pipelining 1 Million Records = 60 sec

Introduction

During the process of redesigning the legacy ad ranking system to Spring Webflux, I encountered two unexpected issues. One was Kafka messages heavily concentrating in specific partitions, and the other was Redis bulk processing speeds operating at only a quarter of our expectations.

On the surface, both issues looked like "performance degradation," but digging into the root causes revealed highly specific bugs and internal library behavior issues.


Issue 1: Kafka Messages Skewing to Specific Partitions

Discovering the Symptom

After switching the ranking update trigger to be handled by Kafka, we spotted a strange pattern through monitoring.

Partition 0: ████████████████████ 48%
Partition 1: ██ 4%
Partition 2: ████████████████████ 46%
Partition 3: ▏ 2%

Out of 4 partitions, messages were concentrated almost entirely in partitions 0 and 2, while the others were nearly empty. If partitions are used unevenly, the load becomes concentrated on specific consumers, leading to processing delays.

Root Cause Analysis: The StickyPartitioner Bug

Initially, we suspected a problem with the partition key configuration. However, the same phenomenon occurred even when the partition key was not explicitly specified.

Digging into the Kafka client code, we found the cause.

Change in Default Partitioner Since Kafka 2.4

Starting from Kafka 2.4, the default partitioner changed from RoundRobinPartitioner to StickyPartitioner. The StickyPartitioner is designed to continuously send messages to the same partition until a batch is filled, thereby increasing batching efficiency.

However, there was a bug in the StickyPartitioner in Kafka Client versions below 3.3. When combined with low linger.ms settings and specific broker latency conditions, the partition selection logic would erroneously skew towards even-numbered partitions.

Furthermore, the RoundRobinPartitioner in Kafka 2.4 and above also had a bug where the partition() method was called twice, resulting in messages alternating exclusively between even or odd partitions.

The version we were using was below 3.3, and linger.ms was set low. It was the exact condition needed to trigger the bug.

The Solution

The solution was relatively simple. We upgraded the Kafka Client version to 3.3 or higher.

<!-- Legacy -->
<dependency>
    <groupId>org.apache.kafka</groupId>
    <artifactId>kafka-clients</artifactId>
    <version>3.2.x</version>  <!-- Buggy version -->
</dependency>

<!-- Upgrade -->
<dependency>
    <groupId>org.apache.kafka</groupId>
    <artifactId>kafka-clients</artifactId>
    <version>3.4.0</version>  <!-- Bug fixed version -->
</dependency>

Starting from 3.3, the StickyPartitioner and RoundRobinPartitioner were consolidated into the DefaultPartitioner, and the bug was fixed.

Partition distribution after the upgrade:

Partition 0: █████████████ 26%
Partition 1: ████████████ 24%
Partition 2: █████████████ 25%
Partition 3: ████████████ 25%

They were evenly distributed.


Issue 2: Why Did Redis Take 4 Minutes for 1 Million Records?

Discovering the Symptom

There was a task in the ranking update batch to update approximately 1 million ranking records in Redis. The expected processing time was under 1 minute, but actual measurements showed it taking 240 seconds (4 minutes).

While monitoring the number of Redis connections via Datadog, we spotted something strange.

Before batch starts: ~5 connections
During batch execution: Exponential explosion in connections (thousands)
After batch completes: Returns to ~5 connections

The number of Redis connections was abnormally skyrocketing while the batch was running.

Root Cause Analysis: Internal Behavior of RedisTemplate

We found the problem by looking into the implementation of RedisTemplate.

The Trap of RedisTemplate's Default Configuration

// What happens if this code runs 1 million times?
redisTemplate.opsForValue().set(key, value);

By default, RedisTemplate establishes and tears down a new TCP connection for every command invocation. This means:

  1. Call set(key, value)
  2. Establish TCP connection (3-way handshake)
  3. Send Redis command
  4. Receive response
  5. Tear down TCP connection

If this process is repeated 1 million times, an enormous amount of time is consumed just by the TCP connection/teardown overhead.

The Solution: Lettuce Native API

The default client for Spring Data Redis is Lettuce. RedisTemplate places an abstraction layer on top of it for convenience, but this abstraction was causing the performance issue.

By directly using the Lettuce native API, we could utilize connection pooling and pipelining.

// RedisTemplate approach (Inefficient)
redisTemplate.opsForValue().set(key, value); // New connection every time

// Lettuce Native approach (Efficient)
@Autowired
private RedisConnectionFactory connectionFactory;

public void batchUpdate(Map<String, String> data) {
    try (StatefulRedisConnection<String, String> connection = 
         ((LettuceConnectionFactory) connectionFactory).getNativeClient().connect()) {
        
        RedisAsyncCommands<String, String> commands = connection.async();
        commands.setAutoFlushCommands(false); // Start Pipelining
        
        List<RedisFuture<?>> futures = new ArrayList<>();
        for (Map.Entry<String, String> entry : data.entrySet()) {
            futures.add(commands.set(entry.getKey(), entry.getValue()));
        }
        
        commands.flushCommands(); // Send all at once
        LettuceFutures.awaitAll(5, TimeUnit.SECONDS, futures.toArray(new RedisFuture[0]));
    }
}

The key is the setAutoFlushCommands(false) configuration. This buffers the commands and sends them all at once when flushCommands() is called. This is Pipelining.

Performance Comparison

Method Time to Process 1 Million Records
RedisTemplate (Default) 240 seconds
Lettuce Native API 60 seconds

It became 4 times faster. This was achieved on the same data and same Redis server, just by changing the code.


Lessons Learned

There were common lessons learned from both issues.

1. Recognize the Cost of Abstraction Layers

RedisTemplate is convenient to use, but using it without understanding its internal mechanics leads to unexpected performance issues. When bulk processing is required, you must consider stripping away abstractions and directly using native APIs.

2. Track Library Versions and Bugs

The Kafka StickyPartitioner bug was a known bug registered in the official issue tracker. It's important to periodically check the changelogs and known issues of your dependent libraries.

3. You Can't Find Problems Without Monitoring

For the Redis issue in particular, if we didn't have Datadog's connection count graph, finding the cause would have taken much longer. It is crucial to have monitoring attached from the very beginning.