Resilient Java client with Lettuce in Container for Azure Cache for Redis

Jay Lee
4 min readSep 20, 2022

I recently saw an incident where the team suffered almost 20m of downtime due to the unexpected network failure on the Redis cluster. Redis cluster was recovered reasonably quickly by 10s seconds, but the service outage lasted for more than 20min, which indicates that something was going on from the client's side. Eventually, they restarted the entire app, and things returned back to normal. Actual troubleshooting of the incident is not what I intended to start this article. Instead, I want to focus on how to make lettuce clients more resilient in the case of unexpected failure.

Troubleshooting this type of incident usually requires a tcpdump to see what's going on at the network level as TCP Timeout on Linux has always been one of the first to look at when it comes to unexpected network outages. Unsurprisingly, Microsoft documentation has a section for it in the best practice guide.

https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-best-practices-connection#tcp-settings-for-linux-hosted-client-applications

If you're keen to understand how changing this kernel parameter helps to reduce downtime, There is an excellent article from the Cloudflare team titled "When TCP sockets refuse to die." I highly recommend reading it before you proceed.

Could the incident have been avoided, as this is a known problem then? Unfortunately, it's not that simple as this app is running on Azure Kubernetes Service. On Kubernetes, "net.ipv4.tcp_retries2" is a namespaced and unsafe sysctl that prevents users from changing it in the container. You can quickly test it out as below. Below is the test.yml file

apiVersion: apps/v1
kind: Deployment
metadata:
name: sysctl-test
spec:
replicas: 1
selector:
matchLabels:
app: sysctl-test
template:
metadata:
labels:
app: sysctl-test
spec:
securityContext:
sysctls:
- name: net.ipv4.tcp_retries2
value: "5"
containers:
- name: sysctl-test
image: ubuntu
ports:
- containerPort: 80
$ kubectl apply -f test.yml
deployment.apps/sysctl-test created
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
postgres-cc7b6d88c-rcdng 1/1 Running 0 2d17h
sysctl-test-5987cd5cc8-2fdfp 0/1 SysctlForbidden 0 1s
sysctl-test-5987cd5cc8-54h9k 0/1 SysctlForbidden 0 1s
sysctl-test-5987cd5cc8-54xfq 0/1 Pending 0 0s
sysctl-test-5987cd5cc8-58cp6 0/1 SysctlForbidden 0 1s
sysctl-test-5987cd5cc8-5mbq6 0/1 SysctlForbidden 0 1s
sysctl-test-5987cd5cc8-7n7zc 0/1 SysctlForbidden 0 2s
sysctl-test-5987cd5cc8-8q79n 0/1 SysctlForbidden 0 3s
sysctl-test-5987cd5cc8-97n6w 0/1 SysctlForbidden 0 1s
sysctl-test-5987cd5cc8-f4g79 0/1 SysctlForbidden 0 3s
sysctl-test-5987cd5cc8-fw74p 0/1 SysctlForbidden 0 3s
sysctl-test-5987cd5cc8-jgq8z 0/1 SysctlForbidden 0 3s
sysctl-test-5987cd5cc8-k6xsw 0/1 SysctlForbidden 0 3s
sysctl-test-5987cd5cc8-kcrbz 0/1 SysctlForbidden 0 1s
sysctl-test-5987cd5cc8-lkwhh 0/1 SysctlForbidden 0 3s
sysctl-test-5987cd5cc8-mr5l6 0/1 SysctlForbidden 0 0s
sysctl-test-5987cd5cc8-n87d7 0/1 SysctlForbidden 0 3s
sysctl-test-5987cd5cc8-nm6ws 0/1 SysctlForbidden 0 3s
sysctl-test-5987cd5cc8-q67sn 0/1 SysctlForbidden 0 1s

Lettuce on Kubernetes

Changing it with sysctl inside the container would be the most straightforward approach without changing the code, but it can not be done on AKS at this moment as AKS Custom Node configuration doesn't allow it. The next thing we think of is to enhance the client code which is written in Java. More specifically we should look at Lettuce which is the Redis client being used by Spring Data Redis. Lettuce can be optimized by taking the recommendation from the Cloudflare team's excellent analysis, which is 1. Enable TCP Keepalive, and 2. Set TCP_USER_TIMEOUT.

The first one, Enable TCP keepalive is easy with Java, but the second one is tricky as JDK doesn't support it natively. So what can we do? The answer lies with Lettuce, as Lettuce is built with Netty, and Netty provides the native transport for Linux that supports the socket options. There is a good amount of information regarding this from Trustin Lee himself, the creator of the Netty project in the GitHub issue.

Let's start with a dependency. We must add epoll-based native transport in the pom.xml

<dependency>
<groupId>io.netty</groupId>
<artifactId>netty-transport-native-epoll</artifactId>
<version>${netty.version}</version>
<classifier>linux-x86_64</classifier>
</dependency>

Then add NettyCustomizer to ClientResources.

ClientResources.builder().nettyCustomizer(new NettyCustomizer() {
@Override
public void afterBootstrapInitialized(Bootstrap bootstrap) {
bootstrap.option(EpollChannelOption.TCP_KEEPIDLE, 15);
bootstrap.option(EpollChannelOption.TCP_KEEPINTVL, 5);
bootstrap.option(EpollChannelOption.TCP_KEEPCNT, 3);
bootstrap.option(EpollChannelOption.TCP_USER_TIMEOUT, 30000);
bootstrap.option(ChannelOption.SO_KEEPALIVE,true);
}
}).socketAddressResolver(resolver).build();

TCP_USER_TIMEOUT value is based on the Cloudflare article that recommends as :

Set TCP_USER_TIMEOUT to TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT

NOTE: This value is not meant for general purposes and should be tested and validated before using it.

Complete Source code of Lettuce Configuration

package io.jaylee.redis.clientdemo.config;

import io.lettuce.core.ClientOptions;
import io.lettuce.core.cluster.ClusterClientOptions;
import io.lettuce.core.cluster.ClusterTopologyRefreshOptions;
import io.lettuce.core.internal.HostAndPort;
import io.lettuce.core.resource.*;
import io.netty.bootstrap.Bootstrap;
import io.netty.channel.ChannelOption;
import io.netty.channel.epoll.EpollChannelOption;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.Profile;
import org.springframework.data.redis.connection.RedisClusterConfiguration;
import org.springframework.data.redis.connection.RedisConnectionFactory;
import org.springframework.data.redis.connection.RedisNode;
import org.springframework.data.redis.connection.lettuce.LettuceClientConfiguration;
import org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory;
import org.springframework.data.redis.core.StringRedisTemplate;

import java.net.InetAddress;
import java.net.UnknownHostException;
import java.time.Duration;
import java.util.function.UnaryOperator;

@Configuration
@Profile("new")
public class LettuceNewConfig {

@Value("${redisHost}")
String redisHost;

@Value("${redisPort}")
String redisPort;

@Value("${redisPassword}")
String redisPassword;

@Bean(destroyMethod = "shutdown")
ClientResources clientResources() {
UnaryOperator<HostAndPort> mappingFunction = hostAndPort -> {
InetAddress[] addresses = new InetAddress[0];
try {
addresses = DnsResolvers.JVM_DEFAULT.resolve(redisHost);
}
catch (UnknownHostException e) {
e.printStackTrace();
}
String cacheIP = addresses[0].getHostAddress();
HostAndPort finalAddress = hostAndPort;

if (hostAndPort.hostText.equals(cacheIP))
finalAddress = HostAndPort.of(redisHost, hostAndPort.getPort());
return finalAddress;
};

MappingSocketAddressResolver resolver = MappingSocketAddressResolver.create(DnsResolvers.JVM_DEFAULT,
mappingFunction);

return ClientResources.builder().nettyCustomizer(new NettyCustomizer() {
@Override
public void afterBootstrapInitialized(Bootstrap bootstrap) {
bootstrap.option(EpollChannelOption.TCP_KEEPIDLE, 15);
bootstrap.option(EpollChannelOption.TCP_KEEPINTVL, 5);
bootstrap.option(EpollChannelOption.TCP_KEEPCNT, 3);
bootstrap.option(EpollChannelOption.TCP_USER_TIMEOUT, 30000);
bootstrap.option(ChannelOption.SO_KEEPALIVE,true);
}
}).socketAddressResolver(resolver).build();
}

@Bean
public RedisConnectionFactory redisConnectionFactory() {
RedisNode redisNode = RedisNode.newRedisNode()
.listeningAt(redisHost, Integer.parseInt(redisPort))
.build();

RedisClusterConfiguration config = new RedisClusterConfiguration();
config.addClusterNode(redisNode);
config.setPassword(redisPassword);

LettuceClientConfiguration clientConfig = LettuceClientConfiguration.builder().clientOptions(clientOptions())
.clientResources(clientResources()).useSsl().build();
return new LettuceConnectionFactory(config, clientConfig);
}

@Bean
public ClusterClientOptions clientOptions() {

ClusterTopologyRefreshOptions refreshOptions = ClusterTopologyRefreshOptions.builder()
.enablePeriodicRefresh(Duration.ofSeconds(5)).dynamicRefreshSources(false)
.adaptiveRefreshTriggersTimeout(Duration.ofSeconds(5)).enableAllAdaptiveRefreshTriggers().build();

return ClusterClientOptions.builder().disconnectedBehavior(ClientOptions.DisconnectedBehavior.REJECT_COMMANDS)
.autoReconnect(true).pingBeforeActivateConnection(false).topologyRefreshOptions(refreshOptions).build();
}

@Bean
public StringRedisTemplate stringRedisTemplate(
@Qualifier("redisConnectionFactory") RedisConnectionFactory redisConnectionFactory) {

StringRedisTemplate template = new StringRedisTemplate();
template.setConnectionFactory(redisConnectionFactory);

return template;
}

}

Wrapping Up

I'm planning to write a few articles on using Azure Cache For Redis with Java, for the topics like Spring Session Data Redis, RedisJSon, etc. Stay tuned!

If you like my article, please leave some claps here or maybe even start following me. You can hit me up on Linkedin. Thanks!

--

--

Jay Lee

Cloud Native Enthusiast. Java, Spring, Python, Golang, Kubernetes.