Skip to content

Hazelcast Kubernetes Migration - Implementation Plan

Document Version: 1.0 Date: 2024-11-23 Priority: HIGH (Critical for Multi-Instance Deployment) Estimated Effort: 3-5 days


Table of Contents

  1. Executive Summary
  2. Current State Analysis
  3. Migration Requirements
  4. Implementation Steps
  5. Testing Strategy
  6. Rollback Plan
  7. Monitoring & Validation

1. Executive Summary

Problem Statement

The current Hazelcast implementation uses hardcoded IP addresses and multicast discovery, which are incompatible with Kubernetes dynamic pod networking. This prevents proper distributed caching in multi-instance deployments.

Current Issues: - ❌ Hardcoded IP: 172.16.55.1 (line 43 in TlinqClusterCache.java) - ❌ Multicast discovery (doesn't work in K8s) - ❌ No eviction policies (memory leak risk) - ❌ No TTL configuration (unlimited cache growth)

Impact of NOT Fixing: - User sessions lost when requests hit different pods - Anonymous shopping carts lost - Users forced to re-login frequently - Cannot scale horizontally

Solution Overview

Migrate Hazelcast from multicast discovery to Kubernetes service-based discovery while maintaining backward compatibility with bare-metal deployments.

Benefits: - ✅ Distributed cache works across all pods - ✅ User sessions shared between instances - ✅ Shopping carts persistent across pod switches - ✅ Proper memory management (TTL + eviction) - ✅ Horizontal scaling enabled


2. Current State Analysis

2.1 Hazelcast Usage Inventory

Caches in Use (3 total):

Cache Name Purpose Data Type Critical DB Fallback
cartsCache Shopping carts Map YES Partial*
userSessions Odoo user sessions Map YES NO
apiRolesCache API authorization Map MEDIUM File reload

*Logged-in user carts have DB fallback; anonymous carts do not

2.2 Current Configuration

File: /home/nino/src/tqpro/tqcommon/src/main/java/com/perun/tlinq/entity/cache/TlinqClusterCache.java

// Lines 40-50: Problematic configuration
private TlinqClusterCache() {
    Config cfg = new Config();
    ArrayList<String> nifs = new ArrayList<>(){{
        add("172.16.55.1");  // ❌ HARDCODED IP
    }};
    InterfacesConfig ifc = new InterfacesConfig();
    ifc.setInterfaces(nifs);
    JoinConfig jcfg = cfg.getAdvancedNetworkConfig().getJoin();
    jcfg.getMulticastConfig().setMulticastPort(55478);  // ❌ MULTICAST

    _hc = Hazelcast.newHazelcastInstance(cfg);
}

Problems: 1. Hardcoded IP won't work with dynamic pod IPs 2. Multicast disabled in most Kubernetes CNI plugins 3. No map-specific configurations 4. No eviction or TTL settings

2.3 Dependencies

Current Dependency (build.gradle.kts:57):

implementation("com.hazelcast:hazelcast-all:4.2.4")

Issues: - Version 4.2.4 is outdated (current stable: 5.3.x) - hazelcast-all is deprecated (use modular dependencies) - Missing Kubernetes plugin


3. Migration Requirements

3.1 Functional Requirements

FR-1: Kubernetes Discovery - Must discover cluster members via Kubernetes service - Must support headless service for member discovery - Must work with DNS-based service discovery

FR-2: Backward Compatibility - Must support bare-metal deployment (multicast) - Must auto-detect deployment mode via environment variable - Must not break existing bare-metal installations

FR-3: Cache Configuration - Must configure TTL for session caches (30-60 min) - Must configure eviction policies (LRU, max size) - Must prevent memory leaks from unlimited growth

FR-4: Cluster Health - Must expose cluster member count - Must support health checks for readiness probe - Must log cluster formation events

3.2 Non-Functional Requirements

NFR-1: Performance - Cache operations must complete in < 10ms (p95) - Cluster formation must complete in < 30 seconds - Network overhead must be minimal

NFR-2: Reliability - Must handle pod restarts gracefully - Must maintain cache consistency during scaling - Must support rolling updates without cache loss

NFR-3: Security - Must support encrypted inter-member communication (optional) - Must validate cluster membership - Must not expose cache data externally

3.3 Kubernetes Requirements

Required Resources: 1. ServiceAccount with RBAC permissions 2. Headless Service for member discovery 3. ConfigMap for Hazelcast configuration (optional) 4. NetworkPolicy for pod-to-pod communication (optional)

Required Environment Variables: - DEPLOYMENT_MODE: kubernetes or baremetal - K8S_NAMESPACE: Kubernetes namespace name - HAZELCAST_SERVICE: Service name for discovery


4. Implementation Steps

Step 1: Update Dependencies (30 minutes)

File: build.gradle.kts

Changes:

dependencies {
    // REMOVE old dependency
    // implementation("com.hazelcast:hazelcast-all:4.2.4")

    // ADD new modular dependencies
    implementation("com.hazelcast:hazelcast:5.3.6")
    implementation("com.hazelcast:hazelcast-kubernetes:2.2.3")

    // Existing dependencies...
}

Verification:

./gradlew dependencies | grep hazelcast
# Should show:
# +--- com.hazelcast:hazelcast:5.3.6
# +--- com.hazelcast:hazelcast-kubernetes:2.2.3

Step 2: Modify TlinqClusterCache.java (4 hours)

File: /home/nino/src/tqpro/tqcommon/src/main/java/com/perun/tlinq/entity/cache/TlinqClusterCache.java

Complete New Implementation:

package com.perun.tlinq.entity.cache;

import com.hazelcast.config.*;
import com.hazelcast.core.Hazelcast;
import com.hazelcast.core.HazelcastInstance;
import com.hazelcast.map.IMap;

import java.util.Map;
import java.util.logging.Logger;

/**
 * Singleton cache manager using Hazelcast distributed cache.
 * Supports both Kubernetes (service discovery) and bare-metal (multicast) deployments.
 */
public class TlinqClusterCache {

    private static final Logger logger = Logger.getLogger(TlinqClusterCache.class.getName());
    private static TlinqClusterCache _instance;
    private HazelcastInstance _hc;

    private TlinqClusterCache() {
        logger.info("Initializing Hazelcast cluster cache...");
        Config cfg = createHazelcastConfig();
        _hc = Hazelcast.newHazelcastInstance(cfg);
        logger.info("Hazelcast instance created successfully. Cluster size: " +
                    _hc.getCluster().getMembers().size());
    }

    /**
     * Creates Hazelcast configuration based on deployment mode.
     */
    private Config createHazelcastConfig() {
        Config cfg = new Config();
        cfg.setClusterName(getClusterName());

        // Configure network based on deployment mode
        String deploymentMode = getDeploymentMode();
        logger.info("Deployment mode: " + deploymentMode);

        if ("kubernetes".equalsIgnoreCase(deploymentMode)) {
            configureKubernetesDiscovery(cfg);
        } else {
            configureBaremetalDiscovery(cfg);
        }

        // Configure cache maps with TTL and eviction
        configureCacheMaps(cfg);

        // Optional: Configure management center
        configureManagementCenter(cfg);

        return cfg;
    }

    /**
     * Configures Kubernetes service-based discovery.
     */
    private void configureKubernetesDiscovery(Config cfg) {
        logger.info("Configuring Kubernetes discovery...");

        NetworkConfig networkConfig = cfg.getNetworkConfig();

        // Set port (default 5701)
        networkConfig.setPort(5701);
        networkConfig.setPortAutoIncrement(true);
        networkConfig.setPortCount(100);

        // Disable multicast and TCP/IP
        JoinConfig joinConfig = networkConfig.getJoin();
        joinConfig.getMulticastConfig().setEnabled(false);
        joinConfig.getTcpIpConfig().setEnabled(false);

        // Enable Kubernetes discovery
        KubernetesConfig k8sConfig = joinConfig.getKubernetesConfig();
        k8sConfig.setEnabled(true);

        // Get configuration from environment
        String namespace = System.getenv("K8S_NAMESPACE");
        String serviceName = System.getenv("HAZELCAST_SERVICE");

        if (namespace == null || namespace.isEmpty()) {
            namespace = "default";
            logger.warning("K8S_NAMESPACE not set, using default: " + namespace);
        }

        if (serviceName == null || serviceName.isEmpty()) {
            serviceName = "tqpro-hazelcast";
            logger.warning("HAZELCAST_SERVICE not set, using default: " + serviceName);
        }

        // Configure Kubernetes discovery properties
        k8sConfig.setProperty("namespace", namespace);
        k8sConfig.setProperty("service-name", serviceName);
        k8sConfig.setProperty("service-port", "5701");
        k8sConfig.setProperty("resolve-not-ready-addresses", "true");

        logger.info(String.format("Kubernetes discovery configured: namespace=%s, service=%s",
                                  namespace, serviceName));
    }

    /**
     * Configures bare-metal multicast discovery (legacy mode).
     */
    private void configureBaremetalDiscovery(Config cfg) {
        logger.info("Configuring bare-metal multicast discovery...");

        NetworkConfig networkConfig = cfg.getNetworkConfig();

        // Get network interface from system property or use default
        String hazelcastInterface = System.getProperty("hazelcast.interface", "172.16.55.1");
        int hazelcastPort = Integer.parseInt(System.getProperty("hazelcast.port", "55478"));

        networkConfig.setPort(hazelcastPort);
        networkConfig.setPortAutoIncrement(true);

        // Configure interfaces
        InterfacesConfig interfacesConfig = networkConfig.getInterfaces();
        interfacesConfig.setEnabled(true);
        interfacesConfig.addInterface(hazelcastInterface);

        // Enable multicast
        JoinConfig joinConfig = networkConfig.getJoin();
        MulticastConfig multicastConfig = joinConfig.getMulticastConfig();
        multicastConfig.setEnabled(true);
        multicastConfig.setMulticastPort(hazelcastPort);
        multicastConfig.setMulticastTimeoutSeconds(5);
        multicastConfig.setMulticastTimeToLive(32);

        logger.info(String.format("Multicast discovery configured: interface=%s, port=%d",
                                  hazelcastInterface, hazelcastPort));
    }

    /**
     * Configures individual cache maps with TTL and eviction policies.
     */
    private void configureCacheMaps(Config cfg) {
        logger.info("Configuring cache maps...");

        // Shopping carts cache - 30 minute TTL
        MapConfig cartsConfig = new MapConfig("cartsCache");
        cartsConfig.setTimeToLiveSeconds(1800);  // 30 minutes
        cartsConfig.setMaxIdleSeconds(1800);
        cartsConfig.setEvictionConfig(new EvictionConfig()
            .setSize(10000)
            .setMaxSizePolicy(MaxSizePolicy.PER_NODE)
            .setEvictionPolicy(EvictionPolicy.LRU));
        cartsConfig.setBackupCount(1);  // 1 backup copy
        cartsConfig.setAsyncBackupCount(0);
        cfg.addMapConfig(cartsConfig);
        logger.info("Configured cartsCache: TTL=30min, maxSize=10000, backups=1");

        // User sessions cache - 60 minute TTL
        MapConfig sessionsConfig = new MapConfig("userSessions");
        sessionsConfig.setTimeToLiveSeconds(3600);  // 60 minutes
        sessionsConfig.setMaxIdleSeconds(3600);
        sessionsConfig.setEvictionConfig(new EvictionConfig()
            .setSize(5000)
            .setMaxSizePolicy(MaxSizePolicy.PER_NODE)
            .setEvictionPolicy(EvictionPolicy.LRU));
        sessionsConfig.setBackupCount(1);
        sessionsConfig.setAsyncBackupCount(0);
        cfg.addMapConfig(sessionsConfig);
        logger.info("Configured userSessions: TTL=60min, maxSize=5000, backups=1");

        // API roles cache - no TTL (static data)
        MapConfig rolesConfig = new MapConfig("apiRolesCache");
        rolesConfig.setEvictionConfig(new EvictionConfig()
            .setSize(1000)
            .setMaxSizePolicy(MaxSizePolicy.PER_NODE)
            .setEvictionPolicy(EvictionPolicy.NONE));  // No eviction for static data
        rolesConfig.setBackupCount(1);
        cfg.addMapConfig(rolesConfig);
        logger.info("Configured apiRolesCache: no TTL, maxSize=1000, backups=1");
    }

    /**
     * Optional: Configure Hazelcast Management Center.
     */
    private void configureManagementCenter(Config cfg) {
        String mcUrl = System.getenv("HAZELCAST_MANAGEMENT_CENTER_URL");
        if (mcUrl != null && !mcUrl.isEmpty()) {
            ManagementCenterConfig mcConfig = cfg.getManagementCenterConfig();
            mcConfig.setScriptingEnabled(true);
            logger.info("Management Center configured: " + mcUrl);
        }
    }

    /**
     * Gets deployment mode from environment variable.
     * @return "kubernetes" or "baremetal"
     */
    private String getDeploymentMode() {
        String mode = System.getenv("DEPLOYMENT_MODE");
        if (mode == null || mode.isEmpty()) {
            mode = System.getProperty("deployment.mode", "baremetal");
        }
        return mode.toLowerCase();
    }

    /**
     * Gets cluster name from environment or uses default.
     */
    private String getClusterName() {
        String clusterName = System.getenv("HAZELCAST_CLUSTER_NAME");
        if (clusterName == null || clusterName.isEmpty()) {
            clusterName = "tqpro-cluster";
        }
        return clusterName;
    }

    /**
     * Singleton instance accessor.
     */
    public static synchronized TlinqClusterCache instance() {
        if (_instance == null) {
            _instance = new TlinqClusterCache();
        }
        return _instance;
    }

    /**
     * Gets a distributed map by name.
     */
    @SuppressWarnings("unchecked")
    public <K, V> Map<K, V> getCache(String cacheName) {
        IMap<K, V> map = _hc.getMap(cacheName);
        logger.fine("Retrieved cache: " + cacheName + ", size: " + map.size());
        return map;
    }

    /**
     * Adds a cache with initial data (used by ApiRoleManager).
     */
    public void addCache(String cacheName, Map<?, ?> data) {
        IMap<Object, Object> map = _hc.getMap(cacheName);
        map.putAll(data);
        logger.info("Added cache: " + cacheName + ", entries: " + data.size());
    }

    /**
     * Gets a single cache entry.
     */
    public Object getCacheEntry(String cacheName, Object key) {
        IMap<Object, Object> map = _hc.getMap(cacheName);
        return map.get(key);
    }

    /**
     * Clears a cache.
     */
    public void clearCache(String cacheName) {
        IMap<Object, Object> map = _hc.getMap(cacheName);
        map.clear();
        logger.info("Cleared cache: " + cacheName);
    }

    /**
     * Gets cluster member count (for health checks).
     */
    public int getClusterSize() {
        return _hc.getCluster().getMembers().size();
    }

    /**
     * Gets cache statistics.
     */
    public String getCacheStats(String cacheName) {
        IMap<Object, Object> map = _hc.getMap(cacheName);
        return String.format("Cache[%s]: size=%d, localSize=%d",
                           cacheName, map.size(), map.getLocalMapStats().getOwnedEntryCount());
    }

    /**
     * Graceful shutdown.
     */
    public void shutdown() {
        logger.info("Shutting down Hazelcast cluster cache...");
        if (_hc != null) {
            _hc.shutdown();
        }
    }

    /**
     * Gets Hazelcast instance (for advanced operations).
     */
    public HazelcastInstance getHazelcastInstance() {
        return _hc;
    }
}

Step 3: Update Health Check API (1 hour)

File: /home/nino/src/tqpro/tqapi/src/main/java/com/perun/tlinq/api/HealthApi.java

Add cache cluster health check:

@GET
@Path("/cache")
@Produces(MediaType.APPLICATION_JSON)
public Response cacheHealth() {
    Map<String, Object> status = new HashMap<>();

    try {
        TlinqClusterCache cache = TlinqClusterCache.instance();
        int clusterSize = cache.getClusterSize();

        status.put("clusterSize", clusterSize);
        status.put("status", clusterSize >= 1 ? "UP" : "DOWN");

        // Get stats for each cache
        Map<String, String> cacheStats = new HashMap<>();
        cacheStats.put("cartsCache", cache.getCacheStats("cartsCache"));
        cacheStats.put("userSessions", cache.getCacheStats("userSessions"));
        cacheStats.put("apiRolesCache", cache.getCacheStats("apiRolesCache"));
        status.put("caches", cacheStats);

        if (clusterSize < 1) {
            return Response.status(503).entity(status).build();
        }

        return Response.ok(status).build();

    } catch (Exception e) {
        status.put("status", "ERROR");
        status.put("error", e.getMessage());
        return Response.status(503).entity(status).build();
    }
}

Step 4: Create Kubernetes RBAC Resources (30 minutes)

File: /home/nino/src/tqpro/k8s/hazelcast-rbac.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: tqpro-api
  namespace: tqpro
  labels:
    app: tqpro

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: hazelcast-discovery
  namespace: tqpro
  labels:
    app: tqpro
rules:
- apiGroups: [""]
  resources: ["endpoints", "pods", "services"]
  verbs: ["get", "list"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: hazelcast-discovery-binding
  namespace: tqpro
  labels:
    app: tqpro
subjects:
- kind: ServiceAccount
  name: tqpro-api
  namespace: tqpro
roleRef:
  kind: Role
  name: hazelcast-discovery
  apiGroup: rbac.authorization.k8s.io

Step 5: Create Hazelcast Service (15 minutes)

File: /home/nino/src/tqpro/k8s/hazelcast-service.yaml

apiVersion: v1
kind: Service
metadata:
  name: tqpro-hazelcast
  namespace: tqpro
  labels:
    app: tqpro-api
    component: hazelcast
spec:
  clusterIP: None  # Headless service for discovery
  publishNotReadyAddresses: true  # Important for Hazelcast discovery
  selector:
    app: tqpro-api
  ports:
  - name: hazelcast
    port: 5701
    targetPort: 5701
    protocol: TCP

Step 6: Update Deployment Manifest (30 minutes)

File: /home/nino/src/tqpro/k8s/api-deployment.yaml

Add to deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tqpro-api
  namespace: tqpro
spec:
  replicas: 3
  template:
    spec:
      serviceAccountName: tqpro-api  # Required for RBAC

      containers:
      - name: api
        image: registry.company.com/tqpro/api:1.0.0

        ports:
        - name: http
          containerPort: 11080
          protocol: TCP
        - name: hazelcast
          containerPort: 5701
          protocol: TCP

        env:
        # Deployment mode
        - name: DEPLOYMENT_MODE
          value: "kubernetes"

        # Kubernetes namespace (auto-injected)
        - name: K8S_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace

        # Hazelcast service name
        - name: HAZELCAST_SERVICE
          value: "tqpro-hazelcast"

        # Cluster name (optional)
        - name: HAZELCAST_CLUSTER_NAME
          value: "tqpro-cluster"

        # Pod name and IP (for logging)
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP

        # Existing environment variables...
        - name: DB_HOST
          valueFrom:
            secretKeyRef:
              name: tqpro-db-credentials
              key: host
        # ... etc

        # Readiness probe - check Hazelcast cluster
        readinessProbe:
          httpGet:
            path: /tlinq-api/health/cache
            port: 11080
          initialDelaySeconds: 60
          periodSeconds: 10
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3

        # Liveness probe
        livenessProbe:
          httpGet:
            path: /tlinq-api/health/live
            port: 11080
          initialDelaySeconds: 90
          periodSeconds: 30
          timeoutSeconds: 5
          failureThreshold: 3

Step 7: Update Graceful Shutdown (1 hour)

File: /home/nino/src/tqpro/tqapi/src/main/java/com/perun/tlinq/TQProApiServer.java

Add shutdown hook (after line 235):

// Graceful shutdown hook
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
    log.info("Shutdown signal received, stopping server gracefully...");
    try {
        // Stop accepting new requests
        log.info("Stopping Jetty server...");
        server.stop();

        // Close Hazelcast cluster
        log.info("Shutting down Hazelcast cluster...");
        TlinqClusterCache.instance().shutdown();

        // Close database connections
        log.info("Closing database connections...");
        TlinqDBSession.close();

        log.info("Server stopped gracefully");
    } catch (Exception e) {
        log.error("Error during shutdown", e);
    }
}, "shutdown-thread"));

log.info("Shutdown hook registered");

5. Testing Strategy

5.1 Local Testing (Docker Compose)

Create test environment:

File: docker-compose-hazelcast-test.yml

version: '3.8'

services:
  api-1:
    build: ../../..
    environment:
      - DEPLOYMENT_MODE=baremetal
      - hazelcast.interface=172.20.0.0/16
      - hazelcast.port=5701
    networks:
      - tqpro-net
    ports:
      - "11080:11080"

  api-2:
    build: ../../..
    environment:
      - DEPLOYMENT_MODE=baremetal
      - hazelcast.interface=172.20.0.0/16
      - hazelcast.port=5701
    networks:
      - tqpro-net
    ports:
      - "11081:11080"

  api-3:
    build: ../../..
    environment:
      - DEPLOYMENT_MODE=baremetal
      - hazelcast.interface=172.20.0.0/16
      - hazelcast.port=5701
    networks:
      - tqpro-net
    ports:
      - "11082:11080"

networks:
  tqpro-net:
    driver: bridge
    ipam:
      config:
        - subnet: 172.20.0.0/16

Test cluster formation:

# Start 3 instances
docker-compose -f docker-compose-hazelcast-test.yml up -d

# Check cluster size on each instance
curl http://localhost:11080/tlinq-api/health/cache
curl http://localhost:11081/tlinq-api/health/cache
curl http://localhost:11082/tlinq-api/health/cache

# All should report clusterSize: 3

5.2 Kubernetes Testing (Development Cluster)

Step-by-step testing:

# 1. Apply RBAC
kubectl apply -f k8s/hazelcast-rbac.yaml

# 2. Apply headless service
kubectl apply -f k8s/hazelcast-service.yaml

# 3. Deploy with 1 replica first
kubectl apply -f k8s/api-deployment.yaml
kubectl scale deployment tqpro-api --replicas=1

# 4. Wait for pod to be ready
kubectl wait --for=condition=ready pod -l app=tqpro-api --timeout=120s

# 5. Check logs for Hazelcast initialization
kubectl logs -f deployment/tqpro-api | grep -i hazelcast

# Expected output:
# Initializing Hazelcast cluster cache...
# Deployment mode: kubernetes
# Kubernetes discovery configured: namespace=tqpro, service=tqpro-hazelcast
# Hazelcast instance created successfully. Cluster size: 1

# 6. Check health endpoint
kubectl port-forward deployment/tqpro-api 11080:11080
curl http://localhost:11080/tlinq-api/health/cache

# Expected response:
# {
#   "clusterSize": 1,
#   "status": "UP",
#   "caches": {
#     "cartsCache": "Cache[cartsCache]: size=0, localSize=0",
#     "userSessions": "Cache[userSessions]: size=0, localSize=0",
#     "apiRolesCache": "Cache[apiRolesCache]: size=50, localSize=50"
#   }
# }

# 7. Scale to 3 replicas
kubectl scale deployment tqpro-api --replicas=3

# 8. Wait for all pods ready
kubectl wait --for=condition=ready pod -l app=tqpro-api --timeout=120s

# 9. Check cluster formation
kubectl logs deployment/tqpro-api | grep "Cluster size"
# Should show: Cluster size: 3

# 10. Verify on all pods
for pod in $(kubectl get pods -l app=tqpro-api -o name); do
  echo "Checking $pod"
  kubectl exec $pod -- curl -s http://localhost:11080/tlinq-api/health/cache | jq '.clusterSize'
done
# All should return: 3

5.3 Cache Consistency Testing

Test shopping cart across pods:

# Create a cart on pod-1
SESSION_ID="test-session-$(date +%s)"

# Add item via pod-1
kubectl exec tqpro-api-pod-1 -- curl -X POST \
  http://localhost:11080/tlinq-api/cart/addItem \
  -H "Content-Type: application/json" \
  -d "{\"session\":\"$SESSION_ID\",\"itemId\":\"ITEM123\"}"

# Retrieve cart via pod-2 (different pod!)
kubectl exec tqpro-api-pod-2 -- curl -X POST \
  http://localhost:11080/tlinq-api/cart/load \
  -H "Content-Type: application/json" \
  -d "{\"session\":\"$SESSION_ID\"}"

# Should return the same cart with ITEM123

Test session across pods:

# Login via pod-1
LOGIN_RESPONSE=$(kubectl exec tqpro-api-pod-1 -- curl -X POST \
  http://localhost:11080/tlinq-api/user/login \
  -H "Content-Type: application/json" \
  -d '{"username":"testuser","password":"testpass"}')

SESSION_TOKEN=$(echo $LOGIN_RESPONSE | jq -r '.sessionToken')

# Use session on pod-3
kubectl exec tqpro-api-pod-3 -- curl -X GET \
  http://localhost:11080/tlinq-api/user/profile \
  -H "Content-Type: application/json" \
  -d "{\"session\":\"$SESSION_TOKEN\"}"

# Should return user profile without re-authentication

5.4 Load Testing

Use k6 or Apache JMeter:

// k6-hazelcast-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export let options = {
  stages: [
    { duration: '2m', target: 50 },   // Ramp up to 50 users
    { duration: '5m', target: 50 },   // Stay at 50 users
    { duration: '2m', target: 100 },  // Ramp to 100 users
    { duration: '5m', target: 100 },  // Stay at 100 users
    { duration: '2m', target: 0 },    // Ramp down
  ],
};

export default function () {
  const sessionId = `session-${__VU}-${__ITER}`;

  // Add item to cart
  let res = http.post('http://tqpro-api/tlinq-api/cart/addItem',
    JSON.stringify({
      session: sessionId,
      itemId: `item-${Math.random()}`
    }),
    { headers: { 'Content-Type': 'application/json' } }
  );

  check(res, {
    'cart add successful': (r) => r.status === 200,
    'response time < 200ms': (r) => r.timings.duration < 200,
  });

  sleep(1);

  // Load cart
  res = http.post('http://tqpro-api/tlinq-api/cart/load',
    JSON.stringify({ session: sessionId }),
    { headers: { 'Content-Type': 'application/json' } }
  );

  check(res, {
    'cart load successful': (r) => r.status === 200,
    'cart not empty': (r) => JSON.parse(r.body).items.length > 0,
  });

  sleep(1);
}

Run test:

k6 run k6-hazelcast-test.js

Expected results: - Cache operations < 100ms (p95) - No cache misses (100% hit rate after warmup) - Cluster remains stable under load - No pod restarts or OOM errors


6. Rollback Plan

6.1 Immediate Rollback (< 5 minutes)

If Hazelcast clustering fails in production:

# Option 1: Rollback deployment
kubectl rollout undo deployment/tqpro-api -n tqpro

# Option 2: Scale to 1 replica (bypass clustering)
kubectl scale deployment tqpro-api --replicas=1 -n tqpro

# Option 3: Set deployment mode to baremetal
kubectl set env deployment/tqpro-api DEPLOYMENT_MODE=baremetal -n tqpro

6.2 Code Rollback

If code changes cause issues:

# Revert Git commit
git revert <commit-hash>
git push origin codev

# Rebuild and redeploy
./gradlew clean build
docker build -t registry.company.com/tqpro/api:rollback .
docker push registry.company.com/tqpro/api:rollback
kubectl set image deployment/tqpro-api api=registry.company.com/tqpro/api:rollback

6.3 Fallback to Redis

If Hazelcast proves problematic:

See separate Redis migration plan (2-week timeline).


7. Monitoring & Validation

7.1 Metrics to Monitor

Application Metrics: - Hazelcast cluster size (should equal pod count) - Cache hit/miss ratio per cache - Cache operation latency (get, put) - Eviction rate (should be low with proper TTL) - Cache size (memory usage)

Kubernetes Metrics: - Pod restart count (should be 0) - Pod ready status - Network traffic between pods - Memory usage per pod

Business Metrics: - User session errors (SESSION_ERROR exceptions) - Cart loss incidents - Login frequency (high = session loss issue)

7.2 Prometheus Metrics (Optional Enhancement)

Add Hazelcast Prometheus exporter:

# In Deployment
env:
- name: JAVA_OPTS
  value: "-Dhazelcast.jmx=true -javaagent:/app/jmx_prometheus_javaagent.jar=8080:/app/jmx-config.yaml"

JMX Config (jmx-config.yaml):

lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
- pattern: 'com.hazelcast<instance=*, type=Metrics, name=cluster.clock>Timestamp'
  name: hazelcast_cluster_time
  type: GAUGE
- pattern: 'com.hazelcast<instance=*, type=Metrics, name=cluster.size>Value'
  name: hazelcast_cluster_size
  type: GAUGE

7.3 Grafana Dashboard

Key panels: 1. Hazelcast Cluster Size (line chart) 2. Cache Operations/sec (rate chart) 3. Cache Hit Ratio (gauge) 4. Cache Memory Usage (stacked area) 5. Top Evicted Caches (table)

7.4 Alerts

Critical Alerts:

# AlertManager rules
groups:
- name: hazelcast
  interval: 30s
  rules:
  - alert: HazelcastClusterDegraded
    expr: hazelcast_cluster_size < 2
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Hazelcast cluster degraded"
      description: "Cluster size is {{ $value }}, expected >= 2"

  - alert: HazelcastCacheEvictionHigh
    expr: rate(hazelcast_map_evictions_total[5m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High cache eviction rate"
      description: "Eviction rate: {{ $value }}/sec"

  - alert: HazelcastHighLatency
    expr: histogram_quantile(0.95, hazelcast_map_get_latency_seconds) > 0.1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Hazelcast cache latency high"
      description: "P95 latency: {{ $value }}s"

8. Timeline & Milestones

Week 1: Development & Unit Testing

Day 1-2: - ✅ Update dependencies - ✅ Modify TlinqClusterCache.java - ✅ Add unit tests - ✅ Local testing (bare-metal mode)

Day 3-4: - ✅ Create Kubernetes manifests (RBAC, Service) - ✅ Update Deployment - ✅ Add health check endpoint - ✅ Update shutdown logic

Day 5: - ✅ Code review - ✅ Documentation updates - ✅ Merge to development branch

Week 2: Testing & Deployment

Day 1-2: - ✅ Deploy to dev K8s cluster - ✅ Verify cluster formation (1 pod) - ✅ Scale to 3 pods - ✅ Cache consistency testing

Day 3-4: - ✅ Load testing - ✅ Chaos testing (pod kills) - ✅ Performance validation - ✅ Fix any issues found

Day 5: - ✅ Deploy to staging - ✅ User acceptance testing - ✅ Production deployment plan


9. Success Criteria

Functional Success

  • ✅ Hazelcast cluster forms with 3+ members in Kubernetes
  • ✅ Shopping carts accessible from any pod
  • ✅ User sessions persist across pod switches
  • ✅ API roles cache shared across cluster
  • ✅ Bare-metal deployment still works (backward compatibility)

Performance Success

  • ✅ Cache operations < 100ms (p95)
  • ✅ Cluster formation < 30 seconds
  • ✅ No cache-related errors under load
  • ✅ Memory usage within limits (< 2GB per pod)

Operational Success

  • ✅ Health checks working (readiness/liveness)
  • ✅ Graceful shutdown completes in < 30 seconds
  • ✅ Monitoring dashboards showing metrics
  • ✅ Alerts configured and tested
  • ✅ Runbook documented

10. Risks & Mitigation

Risk Probability Impact Mitigation
Cluster formation fails Medium High Test thoroughly in dev; fallback to single replica
Network latency too high Low Medium Use same availability zone; monitor metrics
Memory leak from cache growth Low High Configure TTL and eviction; monitor memory
Split-brain scenario Low Critical Configure quorum; test network partitions
Backward compatibility broken Low High Maintain environment variable switch; test bare-metal

11. Documentation Requirements

Code Documentation

  • ✅ Javadoc for all new methods in TlinqClusterCache
  • ✅ Configuration comments for Kubernetes discovery
  • ✅ Examples in comments

Operational Documentation

  • ✅ Deployment guide (this document)
  • ✅ Troubleshooting guide
  • ✅ Monitoring runbook
  • ✅ Rollback procedures

User Documentation

  • ✅ Update KUBERNETES_DEPLOYMENT_PLAN.md
  • ✅ Add Hazelcast section to README
  • ✅ Document environment variables

12. Post-Deployment Tasks

Week 1 After Production: - Monitor cluster health daily - Review cache eviction rates - Analyze performance metrics - Gather user feedback

Week 2-4 After Production: - Optimize TTL settings based on usage - Fine-tune eviction policies - Consider Hazelcast Management Center for deep insights - Document lessons learned


Document Prepared By: DevOps Team Review Required By: Tech Lead, Platform Team Approval Required By: CTO


End of Implementation Plan