Hazelcast Kubernetes Migration - Implementation Plan¶
Document Version: 1.0 Date: 2024-11-23 Priority: HIGH (Critical for Multi-Instance Deployment) Estimated Effort: 3-5 days
Table of Contents¶
- Executive Summary
- Current State Analysis
- Migration Requirements
- Implementation Steps
- Testing Strategy
- Rollback Plan
- Monitoring & Validation
1. Executive Summary¶
Problem Statement¶
The current Hazelcast implementation uses hardcoded IP addresses and multicast discovery, which are incompatible with Kubernetes dynamic pod networking. This prevents proper distributed caching in multi-instance deployments.
Current Issues:
- ❌ Hardcoded IP: 172.16.55.1 (line 43 in TlinqClusterCache.java)
- ❌ Multicast discovery (doesn't work in K8s)
- ❌ No eviction policies (memory leak risk)
- ❌ No TTL configuration (unlimited cache growth)
Impact of NOT Fixing: - User sessions lost when requests hit different pods - Anonymous shopping carts lost - Users forced to re-login frequently - Cannot scale horizontally
Solution Overview¶
Migrate Hazelcast from multicast discovery to Kubernetes service-based discovery while maintaining backward compatibility with bare-metal deployments.
Benefits: - ✅ Distributed cache works across all pods - ✅ User sessions shared between instances - ✅ Shopping carts persistent across pod switches - ✅ Proper memory management (TTL + eviction) - ✅ Horizontal scaling enabled
2. Current State Analysis¶
2.1 Hazelcast Usage Inventory¶
Caches in Use (3 total):
| Cache Name | Purpose | Data Type | Critical | DB Fallback |
|---|---|---|---|---|
cartsCache |
Shopping carts | Map |
YES | Partial* |
userSessions |
Odoo user sessions | Map |
YES | NO |
apiRolesCache |
API authorization | Map |
MEDIUM | File reload |
*Logged-in user carts have DB fallback; anonymous carts do not
2.2 Current Configuration¶
File: /home/nino/src/tqpro/tqcommon/src/main/java/com/perun/tlinq/entity/cache/TlinqClusterCache.java
// Lines 40-50: Problematic configuration
private TlinqClusterCache() {
Config cfg = new Config();
ArrayList<String> nifs = new ArrayList<>(){{
add("172.16.55.1"); // ❌ HARDCODED IP
}};
InterfacesConfig ifc = new InterfacesConfig();
ifc.setInterfaces(nifs);
JoinConfig jcfg = cfg.getAdvancedNetworkConfig().getJoin();
jcfg.getMulticastConfig().setMulticastPort(55478); // ❌ MULTICAST
_hc = Hazelcast.newHazelcastInstance(cfg);
}
Problems: 1. Hardcoded IP won't work with dynamic pod IPs 2. Multicast disabled in most Kubernetes CNI plugins 3. No map-specific configurations 4. No eviction or TTL settings
2.3 Dependencies¶
Current Dependency (build.gradle.kts:57):
Issues:
- Version 4.2.4 is outdated (current stable: 5.3.x)
- hazelcast-all is deprecated (use modular dependencies)
- Missing Kubernetes plugin
3. Migration Requirements¶
3.1 Functional Requirements¶
FR-1: Kubernetes Discovery - Must discover cluster members via Kubernetes service - Must support headless service for member discovery - Must work with DNS-based service discovery
FR-2: Backward Compatibility - Must support bare-metal deployment (multicast) - Must auto-detect deployment mode via environment variable - Must not break existing bare-metal installations
FR-3: Cache Configuration - Must configure TTL for session caches (30-60 min) - Must configure eviction policies (LRU, max size) - Must prevent memory leaks from unlimited growth
FR-4: Cluster Health - Must expose cluster member count - Must support health checks for readiness probe - Must log cluster formation events
3.2 Non-Functional Requirements¶
NFR-1: Performance - Cache operations must complete in < 10ms (p95) - Cluster formation must complete in < 30 seconds - Network overhead must be minimal
NFR-2: Reliability - Must handle pod restarts gracefully - Must maintain cache consistency during scaling - Must support rolling updates without cache loss
NFR-3: Security - Must support encrypted inter-member communication (optional) - Must validate cluster membership - Must not expose cache data externally
3.3 Kubernetes Requirements¶
Required Resources: 1. ServiceAccount with RBAC permissions 2. Headless Service for member discovery 3. ConfigMap for Hazelcast configuration (optional) 4. NetworkPolicy for pod-to-pod communication (optional)
Required Environment Variables:
- DEPLOYMENT_MODE: kubernetes or baremetal
- K8S_NAMESPACE: Kubernetes namespace name
- HAZELCAST_SERVICE: Service name for discovery
4. Implementation Steps¶
Step 1: Update Dependencies (30 minutes)¶
File: build.gradle.kts
Changes:
dependencies {
// REMOVE old dependency
// implementation("com.hazelcast:hazelcast-all:4.2.4")
// ADD new modular dependencies
implementation("com.hazelcast:hazelcast:5.3.6")
implementation("com.hazelcast:hazelcast-kubernetes:2.2.3")
// Existing dependencies...
}
Verification:
./gradlew dependencies | grep hazelcast
# Should show:
# +--- com.hazelcast:hazelcast:5.3.6
# +--- com.hazelcast:hazelcast-kubernetes:2.2.3
Step 2: Modify TlinqClusterCache.java (4 hours)¶
File: /home/nino/src/tqpro/tqcommon/src/main/java/com/perun/tlinq/entity/cache/TlinqClusterCache.java
Complete New Implementation:
package com.perun.tlinq.entity.cache;
import com.hazelcast.config.*;
import com.hazelcast.core.Hazelcast;
import com.hazelcast.core.HazelcastInstance;
import com.hazelcast.map.IMap;
import java.util.Map;
import java.util.logging.Logger;
/**
* Singleton cache manager using Hazelcast distributed cache.
* Supports both Kubernetes (service discovery) and bare-metal (multicast) deployments.
*/
public class TlinqClusterCache {
private static final Logger logger = Logger.getLogger(TlinqClusterCache.class.getName());
private static TlinqClusterCache _instance;
private HazelcastInstance _hc;
private TlinqClusterCache() {
logger.info("Initializing Hazelcast cluster cache...");
Config cfg = createHazelcastConfig();
_hc = Hazelcast.newHazelcastInstance(cfg);
logger.info("Hazelcast instance created successfully. Cluster size: " +
_hc.getCluster().getMembers().size());
}
/**
* Creates Hazelcast configuration based on deployment mode.
*/
private Config createHazelcastConfig() {
Config cfg = new Config();
cfg.setClusterName(getClusterName());
// Configure network based on deployment mode
String deploymentMode = getDeploymentMode();
logger.info("Deployment mode: " + deploymentMode);
if ("kubernetes".equalsIgnoreCase(deploymentMode)) {
configureKubernetesDiscovery(cfg);
} else {
configureBaremetalDiscovery(cfg);
}
// Configure cache maps with TTL and eviction
configureCacheMaps(cfg);
// Optional: Configure management center
configureManagementCenter(cfg);
return cfg;
}
/**
* Configures Kubernetes service-based discovery.
*/
private void configureKubernetesDiscovery(Config cfg) {
logger.info("Configuring Kubernetes discovery...");
NetworkConfig networkConfig = cfg.getNetworkConfig();
// Set port (default 5701)
networkConfig.setPort(5701);
networkConfig.setPortAutoIncrement(true);
networkConfig.setPortCount(100);
// Disable multicast and TCP/IP
JoinConfig joinConfig = networkConfig.getJoin();
joinConfig.getMulticastConfig().setEnabled(false);
joinConfig.getTcpIpConfig().setEnabled(false);
// Enable Kubernetes discovery
KubernetesConfig k8sConfig = joinConfig.getKubernetesConfig();
k8sConfig.setEnabled(true);
// Get configuration from environment
String namespace = System.getenv("K8S_NAMESPACE");
String serviceName = System.getenv("HAZELCAST_SERVICE");
if (namespace == null || namespace.isEmpty()) {
namespace = "default";
logger.warning("K8S_NAMESPACE not set, using default: " + namespace);
}
if (serviceName == null || serviceName.isEmpty()) {
serviceName = "tqpro-hazelcast";
logger.warning("HAZELCAST_SERVICE not set, using default: " + serviceName);
}
// Configure Kubernetes discovery properties
k8sConfig.setProperty("namespace", namespace);
k8sConfig.setProperty("service-name", serviceName);
k8sConfig.setProperty("service-port", "5701");
k8sConfig.setProperty("resolve-not-ready-addresses", "true");
logger.info(String.format("Kubernetes discovery configured: namespace=%s, service=%s",
namespace, serviceName));
}
/**
* Configures bare-metal multicast discovery (legacy mode).
*/
private void configureBaremetalDiscovery(Config cfg) {
logger.info("Configuring bare-metal multicast discovery...");
NetworkConfig networkConfig = cfg.getNetworkConfig();
// Get network interface from system property or use default
String hazelcastInterface = System.getProperty("hazelcast.interface", "172.16.55.1");
int hazelcastPort = Integer.parseInt(System.getProperty("hazelcast.port", "55478"));
networkConfig.setPort(hazelcastPort);
networkConfig.setPortAutoIncrement(true);
// Configure interfaces
InterfacesConfig interfacesConfig = networkConfig.getInterfaces();
interfacesConfig.setEnabled(true);
interfacesConfig.addInterface(hazelcastInterface);
// Enable multicast
JoinConfig joinConfig = networkConfig.getJoin();
MulticastConfig multicastConfig = joinConfig.getMulticastConfig();
multicastConfig.setEnabled(true);
multicastConfig.setMulticastPort(hazelcastPort);
multicastConfig.setMulticastTimeoutSeconds(5);
multicastConfig.setMulticastTimeToLive(32);
logger.info(String.format("Multicast discovery configured: interface=%s, port=%d",
hazelcastInterface, hazelcastPort));
}
/**
* Configures individual cache maps with TTL and eviction policies.
*/
private void configureCacheMaps(Config cfg) {
logger.info("Configuring cache maps...");
// Shopping carts cache - 30 minute TTL
MapConfig cartsConfig = new MapConfig("cartsCache");
cartsConfig.setTimeToLiveSeconds(1800); // 30 minutes
cartsConfig.setMaxIdleSeconds(1800);
cartsConfig.setEvictionConfig(new EvictionConfig()
.setSize(10000)
.setMaxSizePolicy(MaxSizePolicy.PER_NODE)
.setEvictionPolicy(EvictionPolicy.LRU));
cartsConfig.setBackupCount(1); // 1 backup copy
cartsConfig.setAsyncBackupCount(0);
cfg.addMapConfig(cartsConfig);
logger.info("Configured cartsCache: TTL=30min, maxSize=10000, backups=1");
// User sessions cache - 60 minute TTL
MapConfig sessionsConfig = new MapConfig("userSessions");
sessionsConfig.setTimeToLiveSeconds(3600); // 60 minutes
sessionsConfig.setMaxIdleSeconds(3600);
sessionsConfig.setEvictionConfig(new EvictionConfig()
.setSize(5000)
.setMaxSizePolicy(MaxSizePolicy.PER_NODE)
.setEvictionPolicy(EvictionPolicy.LRU));
sessionsConfig.setBackupCount(1);
sessionsConfig.setAsyncBackupCount(0);
cfg.addMapConfig(sessionsConfig);
logger.info("Configured userSessions: TTL=60min, maxSize=5000, backups=1");
// API roles cache - no TTL (static data)
MapConfig rolesConfig = new MapConfig("apiRolesCache");
rolesConfig.setEvictionConfig(new EvictionConfig()
.setSize(1000)
.setMaxSizePolicy(MaxSizePolicy.PER_NODE)
.setEvictionPolicy(EvictionPolicy.NONE)); // No eviction for static data
rolesConfig.setBackupCount(1);
cfg.addMapConfig(rolesConfig);
logger.info("Configured apiRolesCache: no TTL, maxSize=1000, backups=1");
}
/**
* Optional: Configure Hazelcast Management Center.
*/
private void configureManagementCenter(Config cfg) {
String mcUrl = System.getenv("HAZELCAST_MANAGEMENT_CENTER_URL");
if (mcUrl != null && !mcUrl.isEmpty()) {
ManagementCenterConfig mcConfig = cfg.getManagementCenterConfig();
mcConfig.setScriptingEnabled(true);
logger.info("Management Center configured: " + mcUrl);
}
}
/**
* Gets deployment mode from environment variable.
* @return "kubernetes" or "baremetal"
*/
private String getDeploymentMode() {
String mode = System.getenv("DEPLOYMENT_MODE");
if (mode == null || mode.isEmpty()) {
mode = System.getProperty("deployment.mode", "baremetal");
}
return mode.toLowerCase();
}
/**
* Gets cluster name from environment or uses default.
*/
private String getClusterName() {
String clusterName = System.getenv("HAZELCAST_CLUSTER_NAME");
if (clusterName == null || clusterName.isEmpty()) {
clusterName = "tqpro-cluster";
}
return clusterName;
}
/**
* Singleton instance accessor.
*/
public static synchronized TlinqClusterCache instance() {
if (_instance == null) {
_instance = new TlinqClusterCache();
}
return _instance;
}
/**
* Gets a distributed map by name.
*/
@SuppressWarnings("unchecked")
public <K, V> Map<K, V> getCache(String cacheName) {
IMap<K, V> map = _hc.getMap(cacheName);
logger.fine("Retrieved cache: " + cacheName + ", size: " + map.size());
return map;
}
/**
* Adds a cache with initial data (used by ApiRoleManager).
*/
public void addCache(String cacheName, Map<?, ?> data) {
IMap<Object, Object> map = _hc.getMap(cacheName);
map.putAll(data);
logger.info("Added cache: " + cacheName + ", entries: " + data.size());
}
/**
* Gets a single cache entry.
*/
public Object getCacheEntry(String cacheName, Object key) {
IMap<Object, Object> map = _hc.getMap(cacheName);
return map.get(key);
}
/**
* Clears a cache.
*/
public void clearCache(String cacheName) {
IMap<Object, Object> map = _hc.getMap(cacheName);
map.clear();
logger.info("Cleared cache: " + cacheName);
}
/**
* Gets cluster member count (for health checks).
*/
public int getClusterSize() {
return _hc.getCluster().getMembers().size();
}
/**
* Gets cache statistics.
*/
public String getCacheStats(String cacheName) {
IMap<Object, Object> map = _hc.getMap(cacheName);
return String.format("Cache[%s]: size=%d, localSize=%d",
cacheName, map.size(), map.getLocalMapStats().getOwnedEntryCount());
}
/**
* Graceful shutdown.
*/
public void shutdown() {
logger.info("Shutting down Hazelcast cluster cache...");
if (_hc != null) {
_hc.shutdown();
}
}
/**
* Gets Hazelcast instance (for advanced operations).
*/
public HazelcastInstance getHazelcastInstance() {
return _hc;
}
}
Step 3: Update Health Check API (1 hour)¶
File: /home/nino/src/tqpro/tqapi/src/main/java/com/perun/tlinq/api/HealthApi.java
Add cache cluster health check:
@GET
@Path("/cache")
@Produces(MediaType.APPLICATION_JSON)
public Response cacheHealth() {
Map<String, Object> status = new HashMap<>();
try {
TlinqClusterCache cache = TlinqClusterCache.instance();
int clusterSize = cache.getClusterSize();
status.put("clusterSize", clusterSize);
status.put("status", clusterSize >= 1 ? "UP" : "DOWN");
// Get stats for each cache
Map<String, String> cacheStats = new HashMap<>();
cacheStats.put("cartsCache", cache.getCacheStats("cartsCache"));
cacheStats.put("userSessions", cache.getCacheStats("userSessions"));
cacheStats.put("apiRolesCache", cache.getCacheStats("apiRolesCache"));
status.put("caches", cacheStats);
if (clusterSize < 1) {
return Response.status(503).entity(status).build();
}
return Response.ok(status).build();
} catch (Exception e) {
status.put("status", "ERROR");
status.put("error", e.getMessage());
return Response.status(503).entity(status).build();
}
}
Step 4: Create Kubernetes RBAC Resources (30 minutes)¶
File: /home/nino/src/tqpro/k8s/hazelcast-rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: tqpro-api
namespace: tqpro
labels:
app: tqpro
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: hazelcast-discovery
namespace: tqpro
labels:
app: tqpro
rules:
- apiGroups: [""]
resources: ["endpoints", "pods", "services"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: hazelcast-discovery-binding
namespace: tqpro
labels:
app: tqpro
subjects:
- kind: ServiceAccount
name: tqpro-api
namespace: tqpro
roleRef:
kind: Role
name: hazelcast-discovery
apiGroup: rbac.authorization.k8s.io
Step 5: Create Hazelcast Service (15 minutes)¶
File: /home/nino/src/tqpro/k8s/hazelcast-service.yaml
apiVersion: v1
kind: Service
metadata:
name: tqpro-hazelcast
namespace: tqpro
labels:
app: tqpro-api
component: hazelcast
spec:
clusterIP: None # Headless service for discovery
publishNotReadyAddresses: true # Important for Hazelcast discovery
selector:
app: tqpro-api
ports:
- name: hazelcast
port: 5701
targetPort: 5701
protocol: TCP
Step 6: Update Deployment Manifest (30 minutes)¶
File: /home/nino/src/tqpro/k8s/api-deployment.yaml
Add to deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: tqpro-api
namespace: tqpro
spec:
replicas: 3
template:
spec:
serviceAccountName: tqpro-api # Required for RBAC
containers:
- name: api
image: registry.company.com/tqpro/api:1.0.0
ports:
- name: http
containerPort: 11080
protocol: TCP
- name: hazelcast
containerPort: 5701
protocol: TCP
env:
# Deployment mode
- name: DEPLOYMENT_MODE
value: "kubernetes"
# Kubernetes namespace (auto-injected)
- name: K8S_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
# Hazelcast service name
- name: HAZELCAST_SERVICE
value: "tqpro-hazelcast"
# Cluster name (optional)
- name: HAZELCAST_CLUSTER_NAME
value: "tqpro-cluster"
# Pod name and IP (for logging)
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
# Existing environment variables...
- name: DB_HOST
valueFrom:
secretKeyRef:
name: tqpro-db-credentials
key: host
# ... etc
# Readiness probe - check Hazelcast cluster
readinessProbe:
httpGet:
path: /tlinq-api/health/cache
port: 11080
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 3
# Liveness probe
livenessProbe:
httpGet:
path: /tlinq-api/health/live
port: 11080
initialDelaySeconds: 90
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
Step 7: Update Graceful Shutdown (1 hour)¶
File: /home/nino/src/tqpro/tqapi/src/main/java/com/perun/tlinq/TQProApiServer.java
Add shutdown hook (after line 235):
// Graceful shutdown hook
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
log.info("Shutdown signal received, stopping server gracefully...");
try {
// Stop accepting new requests
log.info("Stopping Jetty server...");
server.stop();
// Close Hazelcast cluster
log.info("Shutting down Hazelcast cluster...");
TlinqClusterCache.instance().shutdown();
// Close database connections
log.info("Closing database connections...");
TlinqDBSession.close();
log.info("Server stopped gracefully");
} catch (Exception e) {
log.error("Error during shutdown", e);
}
}, "shutdown-thread"));
log.info("Shutdown hook registered");
5. Testing Strategy¶
5.1 Local Testing (Docker Compose)¶
Create test environment:
File: docker-compose-hazelcast-test.yml
version: '3.8'
services:
api-1:
build: ../../..
environment:
- DEPLOYMENT_MODE=baremetal
- hazelcast.interface=172.20.0.0/16
- hazelcast.port=5701
networks:
- tqpro-net
ports:
- "11080:11080"
api-2:
build: ../../..
environment:
- DEPLOYMENT_MODE=baremetal
- hazelcast.interface=172.20.0.0/16
- hazelcast.port=5701
networks:
- tqpro-net
ports:
- "11081:11080"
api-3:
build: ../../..
environment:
- DEPLOYMENT_MODE=baremetal
- hazelcast.interface=172.20.0.0/16
- hazelcast.port=5701
networks:
- tqpro-net
ports:
- "11082:11080"
networks:
tqpro-net:
driver: bridge
ipam:
config:
- subnet: 172.20.0.0/16
Test cluster formation:
# Start 3 instances
docker-compose -f docker-compose-hazelcast-test.yml up -d
# Check cluster size on each instance
curl http://localhost:11080/tlinq-api/health/cache
curl http://localhost:11081/tlinq-api/health/cache
curl http://localhost:11082/tlinq-api/health/cache
# All should report clusterSize: 3
5.2 Kubernetes Testing (Development Cluster)¶
Step-by-step testing:
# 1. Apply RBAC
kubectl apply -f k8s/hazelcast-rbac.yaml
# 2. Apply headless service
kubectl apply -f k8s/hazelcast-service.yaml
# 3. Deploy with 1 replica first
kubectl apply -f k8s/api-deployment.yaml
kubectl scale deployment tqpro-api --replicas=1
# 4. Wait for pod to be ready
kubectl wait --for=condition=ready pod -l app=tqpro-api --timeout=120s
# 5. Check logs for Hazelcast initialization
kubectl logs -f deployment/tqpro-api | grep -i hazelcast
# Expected output:
# Initializing Hazelcast cluster cache...
# Deployment mode: kubernetes
# Kubernetes discovery configured: namespace=tqpro, service=tqpro-hazelcast
# Hazelcast instance created successfully. Cluster size: 1
# 6. Check health endpoint
kubectl port-forward deployment/tqpro-api 11080:11080
curl http://localhost:11080/tlinq-api/health/cache
# Expected response:
# {
# "clusterSize": 1,
# "status": "UP",
# "caches": {
# "cartsCache": "Cache[cartsCache]: size=0, localSize=0",
# "userSessions": "Cache[userSessions]: size=0, localSize=0",
# "apiRolesCache": "Cache[apiRolesCache]: size=50, localSize=50"
# }
# }
# 7. Scale to 3 replicas
kubectl scale deployment tqpro-api --replicas=3
# 8. Wait for all pods ready
kubectl wait --for=condition=ready pod -l app=tqpro-api --timeout=120s
# 9. Check cluster formation
kubectl logs deployment/tqpro-api | grep "Cluster size"
# Should show: Cluster size: 3
# 10. Verify on all pods
for pod in $(kubectl get pods -l app=tqpro-api -o name); do
echo "Checking $pod"
kubectl exec $pod -- curl -s http://localhost:11080/tlinq-api/health/cache | jq '.clusterSize'
done
# All should return: 3
5.3 Cache Consistency Testing¶
Test shopping cart across pods:
# Create a cart on pod-1
SESSION_ID="test-session-$(date +%s)"
# Add item via pod-1
kubectl exec tqpro-api-pod-1 -- curl -X POST \
http://localhost:11080/tlinq-api/cart/addItem \
-H "Content-Type: application/json" \
-d "{\"session\":\"$SESSION_ID\",\"itemId\":\"ITEM123\"}"
# Retrieve cart via pod-2 (different pod!)
kubectl exec tqpro-api-pod-2 -- curl -X POST \
http://localhost:11080/tlinq-api/cart/load \
-H "Content-Type: application/json" \
-d "{\"session\":\"$SESSION_ID\"}"
# Should return the same cart with ITEM123
Test session across pods:
# Login via pod-1
LOGIN_RESPONSE=$(kubectl exec tqpro-api-pod-1 -- curl -X POST \
http://localhost:11080/tlinq-api/user/login \
-H "Content-Type: application/json" \
-d '{"username":"testuser","password":"testpass"}')
SESSION_TOKEN=$(echo $LOGIN_RESPONSE | jq -r '.sessionToken')
# Use session on pod-3
kubectl exec tqpro-api-pod-3 -- curl -X GET \
http://localhost:11080/tlinq-api/user/profile \
-H "Content-Type: application/json" \
-d "{\"session\":\"$SESSION_TOKEN\"}"
# Should return user profile without re-authentication
5.4 Load Testing¶
Use k6 or Apache JMeter:
// k6-hazelcast-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';
export let options = {
stages: [
{ duration: '2m', target: 50 }, // Ramp up to 50 users
{ duration: '5m', target: 50 }, // Stay at 50 users
{ duration: '2m', target: 100 }, // Ramp to 100 users
{ duration: '5m', target: 100 }, // Stay at 100 users
{ duration: '2m', target: 0 }, // Ramp down
],
};
export default function () {
const sessionId = `session-${__VU}-${__ITER}`;
// Add item to cart
let res = http.post('http://tqpro-api/tlinq-api/cart/addItem',
JSON.stringify({
session: sessionId,
itemId: `item-${Math.random()}`
}),
{ headers: { 'Content-Type': 'application/json' } }
);
check(res, {
'cart add successful': (r) => r.status === 200,
'response time < 200ms': (r) => r.timings.duration < 200,
});
sleep(1);
// Load cart
res = http.post('http://tqpro-api/tlinq-api/cart/load',
JSON.stringify({ session: sessionId }),
{ headers: { 'Content-Type': 'application/json' } }
);
check(res, {
'cart load successful': (r) => r.status === 200,
'cart not empty': (r) => JSON.parse(r.body).items.length > 0,
});
sleep(1);
}
Run test:
Expected results: - Cache operations < 100ms (p95) - No cache misses (100% hit rate after warmup) - Cluster remains stable under load - No pod restarts or OOM errors
6. Rollback Plan¶
6.1 Immediate Rollback (< 5 minutes)¶
If Hazelcast clustering fails in production:
# Option 1: Rollback deployment
kubectl rollout undo deployment/tqpro-api -n tqpro
# Option 2: Scale to 1 replica (bypass clustering)
kubectl scale deployment tqpro-api --replicas=1 -n tqpro
# Option 3: Set deployment mode to baremetal
kubectl set env deployment/tqpro-api DEPLOYMENT_MODE=baremetal -n tqpro
6.2 Code Rollback¶
If code changes cause issues:
# Revert Git commit
git revert <commit-hash>
git push origin codev
# Rebuild and redeploy
./gradlew clean build
docker build -t registry.company.com/tqpro/api:rollback .
docker push registry.company.com/tqpro/api:rollback
kubectl set image deployment/tqpro-api api=registry.company.com/tqpro/api:rollback
6.3 Fallback to Redis¶
If Hazelcast proves problematic:
See separate Redis migration plan (2-week timeline).
7. Monitoring & Validation¶
7.1 Metrics to Monitor¶
Application Metrics: - Hazelcast cluster size (should equal pod count) - Cache hit/miss ratio per cache - Cache operation latency (get, put) - Eviction rate (should be low with proper TTL) - Cache size (memory usage)
Kubernetes Metrics: - Pod restart count (should be 0) - Pod ready status - Network traffic between pods - Memory usage per pod
Business Metrics: - User session errors (SESSION_ERROR exceptions) - Cart loss incidents - Login frequency (high = session loss issue)
7.2 Prometheus Metrics (Optional Enhancement)¶
Add Hazelcast Prometheus exporter:
# In Deployment
env:
- name: JAVA_OPTS
value: "-Dhazelcast.jmx=true -javaagent:/app/jmx_prometheus_javaagent.jar=8080:/app/jmx-config.yaml"
JMX Config (jmx-config.yaml):
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
- pattern: 'com.hazelcast<instance=*, type=Metrics, name=cluster.clock>Timestamp'
name: hazelcast_cluster_time
type: GAUGE
- pattern: 'com.hazelcast<instance=*, type=Metrics, name=cluster.size>Value'
name: hazelcast_cluster_size
type: GAUGE
7.3 Grafana Dashboard¶
Key panels: 1. Hazelcast Cluster Size (line chart) 2. Cache Operations/sec (rate chart) 3. Cache Hit Ratio (gauge) 4. Cache Memory Usage (stacked area) 5. Top Evicted Caches (table)
7.4 Alerts¶
Critical Alerts:
# AlertManager rules
groups:
- name: hazelcast
interval: 30s
rules:
- alert: HazelcastClusterDegraded
expr: hazelcast_cluster_size < 2
for: 2m
labels:
severity: critical
annotations:
summary: "Hazelcast cluster degraded"
description: "Cluster size is {{ $value }}, expected >= 2"
- alert: HazelcastCacheEvictionHigh
expr: rate(hazelcast_map_evictions_total[5m]) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High cache eviction rate"
description: "Eviction rate: {{ $value }}/sec"
- alert: HazelcastHighLatency
expr: histogram_quantile(0.95, hazelcast_map_get_latency_seconds) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Hazelcast cache latency high"
description: "P95 latency: {{ $value }}s"
8. Timeline & Milestones¶
Week 1: Development & Unit Testing¶
Day 1-2: - ✅ Update dependencies - ✅ Modify TlinqClusterCache.java - ✅ Add unit tests - ✅ Local testing (bare-metal mode)
Day 3-4: - ✅ Create Kubernetes manifests (RBAC, Service) - ✅ Update Deployment - ✅ Add health check endpoint - ✅ Update shutdown logic
Day 5: - ✅ Code review - ✅ Documentation updates - ✅ Merge to development branch
Week 2: Testing & Deployment¶
Day 1-2: - ✅ Deploy to dev K8s cluster - ✅ Verify cluster formation (1 pod) - ✅ Scale to 3 pods - ✅ Cache consistency testing
Day 3-4: - ✅ Load testing - ✅ Chaos testing (pod kills) - ✅ Performance validation - ✅ Fix any issues found
Day 5: - ✅ Deploy to staging - ✅ User acceptance testing - ✅ Production deployment plan
9. Success Criteria¶
Functional Success¶
- ✅ Hazelcast cluster forms with 3+ members in Kubernetes
- ✅ Shopping carts accessible from any pod
- ✅ User sessions persist across pod switches
- ✅ API roles cache shared across cluster
- ✅ Bare-metal deployment still works (backward compatibility)
Performance Success¶
- ✅ Cache operations < 100ms (p95)
- ✅ Cluster formation < 30 seconds
- ✅ No cache-related errors under load
- ✅ Memory usage within limits (< 2GB per pod)
Operational Success¶
- ✅ Health checks working (readiness/liveness)
- ✅ Graceful shutdown completes in < 30 seconds
- ✅ Monitoring dashboards showing metrics
- ✅ Alerts configured and tested
- ✅ Runbook documented
10. Risks & Mitigation¶
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Cluster formation fails | Medium | High | Test thoroughly in dev; fallback to single replica |
| Network latency too high | Low | Medium | Use same availability zone; monitor metrics |
| Memory leak from cache growth | Low | High | Configure TTL and eviction; monitor memory |
| Split-brain scenario | Low | Critical | Configure quorum; test network partitions |
| Backward compatibility broken | Low | High | Maintain environment variable switch; test bare-metal |
11. Documentation Requirements¶
Code Documentation¶
- ✅ Javadoc for all new methods in TlinqClusterCache
- ✅ Configuration comments for Kubernetes discovery
- ✅ Examples in comments
Operational Documentation¶
- ✅ Deployment guide (this document)
- ✅ Troubleshooting guide
- ✅ Monitoring runbook
- ✅ Rollback procedures
User Documentation¶
- ✅ Update KUBERNETES_DEPLOYMENT_PLAN.md
- ✅ Add Hazelcast section to README
- ✅ Document environment variables
12. Post-Deployment Tasks¶
Week 1 After Production: - Monitor cluster health daily - Review cache eviction rates - Analyze performance metrics - Gather user feedback
Week 2-4 After Production: - Optimize TTL settings based on usage - Fine-tune eviction policies - Consider Hazelcast Management Center for deep insights - Document lessons learned
Document Prepared By: DevOps Team Review Required By: Tech Lead, Platform Team Approval Required By: CTO
End of Implementation Plan