Skip to content

TQPro Kubernetes Deployment Plan

Document Version: 1.0 Date: 2024-11-23 Status: Analysis Complete - Development Plan Target Environment: Kubernetes (AWS EKS / Azure AKS / GKE)


Executive Summary

The TQPro application demonstrates GOOD containerization feasibility with moderate configuration changes required. The application is a stateless REST API built on embedded Jetty, making it well-suited for Kubernetes deployment.

Overall Assessment: ✅ FEASIBLE - Estimated 4-6 weeks development effort

Key Findings: - ✅ Stateless design (no server-side sessions) - ✅ Embedded server (Jetty 12) - no external app server needed - ✅ RESTful API architecture - ⚠️ Configuration requires externalization - ⚠️ Hazelcast needs Kubernetes discovery - ⚠️ Hardcoded secrets must be moved to Secrets - ❌ No health check endpoints (must be added)


Table of Contents

  1. Current Architecture Analysis
  2. Containerization Strategy
  3. Required Code Changes
  4. Kubernetes Resources
  5. Development Roadmap
  6. Configuration Migration
  7. Risk Assessment
  8. Testing Strategy
  9. Rollout Plan
  10. Appendix A: Hazelcast vs Alternatives Analysis

1. Current Architecture Analysis

1.1 Application Components

Backend API Server: - Entry Point: tqapi/src/main/java/com/perun/tlinq/TQProApiServer.java:258 - Server: Embedded Jetty 12.0.10 - Runtime: Java 17 (Amazon Corretto) - Framework: JAX-RS with Jersey 3.1.6 - Context Path: /tlinq-api - Ports: HTTP 11080, HTTPS 11079 (configurable)

Module Structure:

tqapi (REST endpoints - 8,749 LOC)
  ├── tqapp (Business logic - 296 Java files)
  │   ├── tqcommon (Utilities, config, DB)
  │   ├── tqamds (Amadeus API integration)
  │   ├── tqodoo (Odoo ERP integration)
  │   └── tqryb2b (Rayna B2B integration)
  └── tqcommon

Frontend Applications: - tqweb-adm (5.6MB) - Admin dashboard (HTML/JS/Foundation) - tqweb-b2b (8.4MB) - B2B portal - tqweb-pub (87MB) - Public website

1.2 External Dependencies

Database: - PostgreSQL (currently localhost:5432) - Hibernate ORM with SessionFactory - Connection configured in config/tourlinq-config.xml - No connection pooling (uses Hibernate defaults)

Distributed Cache: - Hazelcast 4.2.4 in-memory grid (will be upgraded to 5.3.6) - CRITICAL ISSUE: Hardcoded IP 172.16.55.1 in TlinqClusterCache.java:43 - Multicast port: 55478 (incompatible with K8s) - Usage: 3 caches - shopping carts, user sessions, API roles - Status: REQUIRED for multi-instance deployment - Action Required: Configure Kubernetes discovery (see detailed migration plan)

External APIs: - Amadeus (flights/hotels) - HTTPS - Odoo ERP - XML-RPC - Rayna B2B - HTTP REST - Mail server - SMTP:587 - Twilio - HTTPS

File System: - Document storage: /var/www/docimages (configurable in tlinqapi.properties:36) - Logs: /var/log/tqpro/ (hardcoded in log.properties:19) - Config: TLINQ_HOME environment variable

1.3 Configuration Management

Environment Variable Dependencies: - TLINQ_HOME - CRITICAL: All config loaded from this path - Set in TQProApiServer.java:259 - Used by ClientConfig.java:37 and AppConfig.java:21

Configuration Files (in config/): 1. tourlinq-config.xml (7.6KB) - Main config with DB credentials ⚠️ 2. tourlinq.properties (2KB) - App properties with secrets ⚠️ 3. tlinqapi.properties (1.7KB) - Server config with paths ⚠️ 4. log.properties (622 bytes) - Logging with hardcoded path ⚠️ 5. api-roles.properties (6.1KB) - RBAC configuration 6. amadeus-client.xml (3.5KB) - Amadeus service mappings 7. Entity files (15 XML files via XInclude)

Security Issues 🔴: - Database passwords in plaintext (tourlinq-config.xml:50-55) - Mail password: KP8ZH8zwKeQ0 (tourlinq.properties:9) - Payment gateway key: JLM6^MpHfH@K8MpP (tourlinq.properties:19) - RaynaB2B JWT token (tourlinq.properties:32-33) - Twilio credentials (tourlinq.properties:35-36) - Odoo credentials in odoo-client.properties

1.4 State Management

Session Design: ✅ Stateless - No server-side sessions (ServletContextHandler.NO_SESSIONS at TQProApiServer.java:218) - Authentication via OAuth2-Proxy + Keycloak - User identity in HTTP headers (X-User, X-Roles, X-Email, X-Name) - Security context created per-request (AuthenticationFilter.java:100-105)

Implications: - ✅ Perfect for horizontal scaling - ✅ No session affinity required - ✅ Can use any load balancing strategy

1.5 Critical Gaps

Missing for Kubernetes: 1. ❌ No health check endpoints (/health, /ready, /live) 2. ❌ No graceful shutdown handling 3. ❌ Hardcoded secrets in config files 4. ❌ File-based logging (should use stdout) 5. ❌ Hazelcast not Kubernetes-aware 6. ❌ No connection pool configuration 7. ❌ No metrics endpoint (Prometheus)


2. Containerization Strategy

2.1 Docker Image Strategy

Multi-Stage Build Approach:

# ============================================
# Stage 1: Build
# ============================================
FROM gradle:8.10-jdk17-alpine AS builder

WORKDIR /build

# Copy gradle wrapper and build files
COPY ../../../gradlew gradlew.bat ./
COPY ../../../gradle gradle/
COPY ../../../settings.gradle.kts build.gradle.kts ./

# Copy all module build files first (for layer caching)
COPY ../../../tqcommon/build.gradle.kts tqcommon/
COPY ../../../tqapp/build.gradle.kts tqapp/
COPY ../../../tqapi/build.gradle.kts tqapi/
COPY ../../../tqamds/build.gradle.kts tqamds/
COPY ../../../tqodoo/build.gradle.kts tqodoo/
COPY ../../../tqryb2b/build.gradle.kts tqryb2b/

# Download dependencies (cached layer if no build file changes)
RUN gradle dependencies --no-daemon || true

# Copy source code
COPY ../../../tqcommon/src tqcommon/src/
COPY ../../../tqapp/src tqapp/src/
COPY ../../../tqapi/src tqapi/src/
COPY ../../../tqamds/src tqamds/src/
COPY ../../../tqodoo/src tqodoo/src/
COPY ../../../tqryb2b/src tqryb2b/src/

# Build application
RUN gradle clean build -x test --no-daemon

# Copy dependencies to known location
RUN gradle :tqapi:copyDependencies --no-daemon

# ============================================
# Stage 2: Runtime
# ============================================
FROM amazoncorretto:17-alpine

# Install curl for health checks
RUN apk add --no-cache curl

# Create app user (non-root)
RUN addgroup -g 1000 tqpro && \
    adduser -D -u 1000 -G tqpro tqpro

WORKDIR /app

# Copy JARs from builder
COPY --from=builder /build/tqapi/build/libs/tqapi.jar ./
COPY --from=builder /build/tqapi/build/libs/lib/*.jar ./lib/
COPY --from=builder /build/tqapp/build/libs/tqapp.jar ./lib/
COPY --from=builder /build/tqcommon/build/libs/tqcommon.jar ./lib/
COPY --from=builder /build/tqamds/build/libs/tqamds.jar ./lib/
COPY --from=builder /build/tqodoo/build/libs/tqodoo.jar ./lib/
COPY --from=builder /build/tqryb2b/build/libs/tqryb2b.jar ./lib/

# Create directories
RUN mkdir -p /app/config /app/documents /app/logs && \
    chown -R tqpro:tqpro /app

# Copy default config templates (will be overridden by ConfigMaps)
COPY --chown=tqpro:tqpro ../../../config /app/config/

# Switch to non-root user
USER tqpro

# Environment variables
ENV TLINQ_HOME=/app/config \
    JAVA_OPTS="-Xmx2g -Xms512m -XX:+UseG1GC -XX:MaxGCPauseMillis=200" \
    TZ=UTC

# Expose ports
EXPOSE 11080 11079

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:11080/tlinq-api/health || exit 1

# Start command
ENTRYPOINT ["sh", "-c", "java $JAVA_OPTS -DTLINQ_HOME=$TLINQ_HOME -jar tqapi.jar"]

Image Size Optimization: - Base: Alpine Linux (~40MB) - Java 17 Runtime: ~150MB - Application JARs: ~80MB - Total Estimated Size: ~270MB

2.2 Web Frontend Images

Nginx Image for Static Content:

FROM nginx:1.25-alpine

# Copy static content
COPY tqweb-adm /usr/share/nginx/html/adm
COPY tqweb-b2b /usr/share/nginx/html/b2b
COPY tqweb-pub /usr/share/nginx/html/pub

# Copy nginx config (from ConfigMap in K8s)
COPY nginx.conf /etc/nginx/nginx.conf

EXPOSE 80

CMD ["nginx", "-g", "daemon off;"]

Size: ~95MB (87MB static + 8MB nginx)

2.3 Image Naming Convention

registry.company.com/tqpro/api:1.0.0
registry.company.com/tqpro/api:1.0.0-sha-abc123
registry.company.com/tqpro/web:1.0.0

Tags: - Semantic version: 1.0.0 - Git commit SHA: 1.0.0-sha-abc123 (for traceability) - Environment: 1.0.0-dev, 1.0.0-staging, 1.0.0-prod


3. Required Code Changes

3.1 HIGH PRIORITY: Health Check Endpoints

New File: tqapi/src/main/java/com/perun/tlinq/api/HealthApi.java

package com.perun.tlinq.api;

import jakarta.ws.rs.*;
import jakarta.ws.rs.core.MediaType;
import jakarta.ws.rs.core.Response;
import com.perun.tlinq.util.TlinqDBSession;
import org.hibernate.Session;
import java.util.HashMap;
import java.util.Map;

@Path("/health")
public class HealthApi {

    // Liveness probe - is the process running?
    @GET
    @Path("/live")
    @Produces(MediaType.APPLICATION_JSON)
    public Response liveness() {
        Map<String, Object> status = new HashMap<>();
        status.put("status", "UP");
        status.put("timestamp", System.currentTimeMillis());
        return Response.ok(status).build();
    }

    // Readiness probe - can we serve traffic?
    @GET
    @Path("/ready")
    @Produces(MediaType.APPLICATION_JSON)
    public Response readiness() {
        Map<String, Object> status = new HashMap<>();
        boolean ready = true;

        // Check database connectivity
        try {
            Session session = TlinqDBSession.getSession();
            session.createNativeQuery("SELECT 1").getSingleResult();
            session.close();
            status.put("database", "UP");
        } catch (Exception e) {
            status.put("database", "DOWN");
            status.put("error", e.getMessage());
            ready = false;
        }

        // Check Hazelcast (if enabled)
        try {
            if (System.getProperty("hazelcast.enabled", "true").equals("true")) {
                // Add Hazelcast health check
                status.put("cache", "UP");
            }
        } catch (Exception e) {
            status.put("cache", "DOWN");
            ready = false;
        }

        status.put("status", ready ? "UP" : "DOWN");
        status.put("timestamp", System.currentTimeMillis());

        return ready ?
            Response.ok(status).build() :
            Response.status(503).entity(status).build();
    }

    // Health endpoint - detailed health info
    @GET
    @Produces(MediaType.APPLICATION_JSON)
    public Response health() {
        return readiness();
    }
}

Integration: Register in TQProApiServer.java alongside other API classes.

3.2 HIGH PRIORITY: Hazelcast Kubernetes Discovery

📋 DETAILED IMPLEMENTATION PLAN: See HAZELCAST_KUBERNETES_MIGRATION.md for complete step-by-step implementation guide including: - Full code implementation with TTL and eviction policies - Kubernetes RBAC configuration - Testing procedures - Monitoring and health checks - Rollback procedures

Critical Finding - Multi-Instance Requirement:

After deep code analysis, Hazelcast is REQUIRED for multi-pod deployment because: 1. User sessions (Odoo integration) - cache-only, NO database fallback 2. Shopping carts (anonymous users) - memory-only, lost on pod switch 3. API roles - shared authorization cache

Why not Caffeine: Local caching would break session/cart sharing across pods.

Modify: tqcommon/src/main/java/com/perun/tlinq/entity/cache/TlinqClusterCache.java

Summary of Required Changes:

// 1. Upgrade dependency (build.gradle.kts)
implementation("com.hazelcast:hazelcast:5.3.6")
implementation("com.hazelcast:hazelcast-kubernetes:2.2.3")

// 2. Add deployment mode detection
String deploymentMode = System.getenv("DEPLOYMENT_MODE");

if ("kubernetes".equals(deploymentMode)) {
    // Kubernetes deployment
    network.getJoin().getMulticastConfig().setEnabled(false);
    network.getJoin().getTcpIpConfig().setEnabled(false);

    // Enable Kubernetes discovery
    network.getJoin().getKubernetesConfig()
        .setEnabled(true)
        .setProperty("namespace", System.getenv("K8S_NAMESPACE"))
        .setProperty("service-name", "tqpro-hazelcast")
        .setProperty("service-port", "5701");

    log.info("Hazelcast configured for Kubernetes discovery");
} else {
    // Bare-metal deployment (existing logic)
    network.setPort(55478);
    network.getJoin().getMulticastConfig()
        .setEnabled(true)
        .addTrustedInterface(
            System.getProperty("hazelcast.interface", "172.16.55.1")
        );

    log.info("Hazelcast configured for multicast discovery");
}
// 3. Configure TTL and eviction policies (prevents memory leaks)
MapConfig cartsConfig = new MapConfig("cartsCache");
cartsConfig.setTimeToLiveSeconds(1800);  // 30 min TTL
cartsConfig.setEvictionConfig(new EvictionConfig()
    .setSize(10000)
    .setMaxSizePolicy(MaxSizePolicy.PER_NODE)
    .setEvictionPolicy(EvictionPolicy.LRU));
cfg.addMapConfig(cartsConfig);

// Similar for userSessions (60 min TTL) and apiRolesCache

Required Kubernetes Resources: - ServiceAccount with RBAC permissions (see k8s/hazelcast-rbac.yaml) - Headless Service for discovery (see k8s/hazelcast-service.yaml) - Environment variables: DEPLOYMENT_MODE=kubernetes, K8S_NAMESPACE, HAZELCAST_SERVICE

Effort: 3-5 days (includes testing and validation)

3.3 HIGH PRIORITY: Externalize Database Configuration

Modify: tqcommon/src/main/java/com/perun/tlinq/util/TlinqDBSession.java

Current (line 24):

String dbName = AppConfig.instance().getProperty("tlinq.dbname");

New:

// Priority: Environment variable > Config file > Default
String dbHost = System.getenv("DB_HOST");
String dbPort = System.getenv("DB_PORT");
String dbName = System.getenv("DB_NAME");
String dbUser = System.getenv("DB_USER");
String dbPassword = System.getenv("DB_PASSWORD");

// Fallback to config file if env vars not set (bare-metal mode)
if (dbHost == null) {
    String configDbName = AppConfig.instance().getProperty("tlinq.dbname");
    // Use existing logic with configured database
} else {
    // Build connection from environment variables
    String jdbcUrl = String.format(
        "jdbc:postgresql://%s:%s/%s",
        dbHost,
        dbPort != null ? dbPort : "5432",
        dbName
    );

    Configuration configuration = new Configuration();
    configuration.setProperty("hibernate.connection.url", jdbcUrl);
    configuration.setProperty("hibernate.connection.username", dbUser);
    configuration.setProperty("hibernate.connection.password", dbPassword);

    // Add connection pooling (HikariCP)
    configuration.setProperty("hibernate.connection.provider_class",
        "org.hibernate.hikaricp.internal.HikariCPConnectionProvider");
    configuration.setProperty("hibernate.hikari.minimumIdle", "5");
    configuration.setProperty("hibernate.hikari.maximumPoolSize", "20");
    configuration.setProperty("hibernate.hikari.idleTimeout", "300000");
}

Dependencies: Add HikariCP:

implementation("org.hibernate.orm:hibernate-hikaricp:6.5.1.Final")
implementation("com.zaxxer:HikariCP:5.0.1")

3.4 MEDIUM PRIORITY: Graceful Shutdown

Modify: tqapi/src/main/java/com/perun/tlinq/TQProApiServer.java

Add shutdown hook (after line 235 where server starts):

// Add graceful shutdown hook
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
    log.info("Shutdown signal received, stopping server gracefully...");
    try {
        // Stop accepting new requests
        server.stop();

        // Close Hazelcast
        if (TlinqClusterCache.getInstance() != null) {
            TlinqClusterCache.getInstance().shutdown();
        }

        // Close database connections
        TlinqDBSession.close();

        log.info("Server stopped gracefully");
    } catch (Exception e) {
        log.error("Error during shutdown", e);
    }
}));

3.5 MEDIUM PRIORITY: Logging to Stdout

Modify: config/log.properties

Current:

com.perun.tlinq.handlers=java.util.logging.ConsoleHandler, java.util.logging.FileHandler
java.util.logging.FileHandler.pattern=/var/log/tqpro/tlinqserver-%g-%u.log

For Kubernetes:

# Kubernetes mode - stdout only
com.perun.tlinq.handlers=java.util.logging.ConsoleHandler
com.perun.tlinq.level=INFO
java.util.logging.ConsoleHandler.level=INFO
java.util.logging.ConsoleHandler.formatter=java.util.logging.SimpleFormatter
java.util.logging.SimpleFormatter.format=%1$tY-%1$tm-%1$td %1$tH:%1$tM:%1$tS %4$-6s %2$s %5$s%6$s%n

Make it configurable: Use environment variable LOG_MODE=kubernetes or LOG_MODE=baremetal to switch.

3.6 LOW PRIORITY: Metrics Endpoint

New File: tqapi/src/main/java/com/perun/tlinq/api/MetricsApi.java

package com.perun.tlinq.api;

import jakarta.ws.rs.*;
import jakarta.ws.rs.core.MediaType;
import jakarta.ws.rs.core.Response;

@Path("/metrics")
public class MetricsApi {

    @GET
    @Produces(MediaType.TEXT_PLAIN)
    public Response metrics() {
        StringBuilder sb = new StringBuilder();

        // JVM metrics
        Runtime runtime = Runtime.getRuntime();
        sb.append("# HELP jvm_memory_used_bytes Used memory in bytes\n");
        sb.append("# TYPE jvm_memory_used_bytes gauge\n");
        sb.append("jvm_memory_used_bytes ")
          .append(runtime.totalMemory() - runtime.freeMemory())
          .append("\n");

        sb.append("# HELP jvm_memory_max_bytes Max memory in bytes\n");
        sb.append("# TYPE jvm_memory_max_bytes gauge\n");
        sb.append("jvm_memory_max_bytes ")
          .append(runtime.maxMemory())
          .append("\n");

        // Add more metrics as needed

        return Response.ok(sb.toString()).build();
    }
}

3.7 Summary of Code Changes

Priority Component File Change Type Effort
HIGH Health Checks api/HealthApi.java New file 4 hours
HIGH Hazelcast TlinqClusterCache.java Modify 6 hours
HIGH Database Config TlinqDBSession.java Modify 8 hours
MEDIUM Graceful Shutdown TQProApiServer.java Add code 4 hours
MEDIUM Logging log.properties Modify 2 hours
LOW Metrics api/MetricsApi.java New file 4 hours
LOW Config Externalization Multiple files Modify 8 hours

Total Development Effort: ~36 hours (1 week)


4. Kubernetes Resources

4.1 Namespace

apiVersion: v1
kind: Namespace
metadata:
  name: tqpro
  labels:
    name: tqpro
    environment: production

4.2 ConfigMaps

4.2.1 Application Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: tqpro-config
  namespace: tqpro
data:
  # Main application properties (secrets removed)
  tourlinq.properties: |
    # Mail configuration
    mail.host=smtp.gmail.com
    mail.port=587
    mail.from=noreply@peruntours.com
    mail.username=noreply@peruntours.com
    # mail.password - injected from Secret

    # Content location (mounted PVC)
    content.location=/app/documents

    # Database (injected from environment)
    # tlinq.dbname, tlinq.dbpass - from env vars

    # Feature flags
    tripmaker.enabled=true

  # API Server configuration
  tlinqapi.properties: |
    http-port=11080
    https-port=11079

    # SSL (if terminating in pod)
    keystore-path=/app/config/perun.jks

    # Document location (PVC mount)
    content.location=/app/documents

    # Auth server (internal K8s service)
    auth.server=http://oauth2-proxy:4180
    auth.validate-url=http://oauth2-proxy:4180/oauth2/auth

    # Development mode
    dev-mode=false
    dev-mode.bypass-auth=false

  # Logging configuration (stdout for K8s)
  log.properties: |
    com.perun.tlinq.handlers=java.util.logging.ConsoleHandler
    com.perun.tlinq.level=INFO
    java.util.logging.ConsoleHandler.level=INFO
    java.util.logging.ConsoleHandler.formatter=java.util.logging.SimpleFormatter
    java.util.logging.SimpleFormatter.format=%1$tY-%1$tm-%1$td %1$tH:%1$tM:%1$tS %4$-6s %2$s %5$s%6$s%n

  # API roles (RBAC)
  api-roles.properties: |
    # Copy entire content from config/api-roles.properties
    # ... (6KB of role mappings)

4.2.2 XML Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: tqpro-xml-config
  namespace: tqpro
data:
  # tourlinq-config.xml (database credentials removed)
  tourlinq-config.xml: |
    <?xml version="1.0" encoding="UTF-8"?>
    <PluginConfig xmlns:xi="http://www.w3.org/2001/XInclude">
      <!-- Database configs with env var placeholders -->
      <Database name="tlinq">
        <url>jdbc:postgresql://${DB_HOST}:${DB_PORT}/${DB_NAME}</url>
        <username>${DB_USER}</username>
        <password>${DB_PASSWORD}</password>
      </Database>

      <!-- Plugins, Services, Entities -->
      <Entities>
        <xi:include href="entities/flight-entities.xml" xpointer="xpointer(/Entities/*)"/>
        <!-- ... other includes -->
      </Entities>
    </PluginConfig>

  # amadeus-client.xml
  amadeus-client.xml: |
    <?xml version="1.0" encoding="UTF-8"?>
    <!-- Amadeus service configuration -->

  # Entity XML files (15 files)
  flight-entities.xml: |
    # ... content

4.2.3 Nginx Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-config
  namespace: tqpro
data:
  nginx.conf: |
    events {
        worker_connections 1024;
    }

    http {
        include /etc/nginx/mime.types;
        default_type application/octet-stream;

        # Logging
        access_log /dev/stdout;
        error_log /dev/stderr;

        # Compression
        gzip on;
        gzip_types text/plain text/css application/json application/javascript;

        upstream api_backend {
            server tqpro-api-service:11080;
        }

        server {
            listen 80;
            server_name _;

            # Admin app
            location /adm {
                alias /usr/share/nginx/html/adm;
                index index.html;
                try_files $uri $uri/ /adm/index.html;
            }

            # B2B app
            location /b2b {
                alias /usr/share/nginx/html/b2b;
                index index.html;
            }

            # Public site
            location / {
                root /usr/share/nginx/html/pub;
                index index.html;
            }

            # API proxy
            location /tlinq-api {
                proxy_pass http://api_backend;
                proxy_set_header Host $host;
                proxy_set_header X-Real-IP $remote_addr;
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                proxy_set_header X-Forwarded-Proto $scheme;
            }
        }
    }

4.3 Secrets

apiVersion: v1
kind: Secret
metadata:
  name: tqpro-db-credentials
  namespace: tqpro
type: Opaque
stringData:
  username: tlinq_user
  password: <REDACTED>
  host: postgres-service.database.svc.cluster.local
  port: "5432"
  database: tlinq

---
apiVersion: v1
kind: Secret
metadata:
  name: tqpro-api-keys
  namespace: tqpro
type: Opaque
stringData:
  # Amadeus
  amadeus.client.key: <REDACTED>
  amadeus.client.secret: <REDACTED>

  # Mail
  mail.password: <REDACTED>

  # Payment gateway
  telr.auth.key: <REDACTED>
  telr.merchant.id: <REDACTED>

  # RaynaB2B
  rayna.jwt.token: <REDACTED>

  # Twilio
  twilio.sid: <REDACTED>
  twilio.token: <REDACTED>

  # Odoo
  odoo.password: <REDACTED>

---
apiVersion: v1
kind: Secret
metadata:
  name: tqpro-ssl-keystore
  namespace: tqpro
type: Opaque
data:
  perun.jks: <BASE64_ENCODED_KEYSTORE>
  keystore.password: <BASE64_ENCODED_PASSWORD>

4.4 PersistentVolumeClaim

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: tqpro-documents
  namespace: tqpro
spec:
  accessModes:
    - ReadWriteMany  # Multiple pods can write
  storageClassName: efs-sc  # AWS EFS / Azure Files / NFS
  resources:
    requests:
      storage: 50Gi

4.5 Deployment - API Server

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tqpro-api
  namespace: tqpro
  labels:
    app: tqpro-api
    version: v1
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: tqpro-api
  template:
    metadata:
      labels:
        app: tqpro-api
        version: v1
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "11080"
        prometheus.io/path: "/tlinq-api/metrics"
    spec:
      # Security context
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000

      # Init container to validate config
      initContainers:
      - name: config-validator
        image: registry.company.com/tqpro/api:1.0.0
        command: ['sh', '-c', 'ls -la /app/config && echo Config mounted successfully']
        volumeMounts:
        - name: config
          mountPath: /app/config
          readOnly: true

      containers:
      - name: api
        image: registry.company.com/tqpro/api:1.0.0
        imagePullPolicy: IfNotPresent

        ports:
        - name: http
          containerPort: 11080
          protocol: TCP
        - name: https
          containerPort: 11079
          protocol: TCP

        # Environment variables
        env:
        - name: DEPLOYMENT_MODE
          value: "kubernetes"
        - name: K8S_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP

        # Database configuration
        - name: DB_HOST
          valueFrom:
            secretKeyRef:
              name: tqpro-db-credentials
              key: host
        - name: DB_PORT
          valueFrom:
            secretKeyRef:
              name: tqpro-db-credentials
              key: port
        - name: DB_NAME
          valueFrom:
            secretKeyRef:
              name: tqpro-db-credentials
              key: database
        - name: DB_USER
          valueFrom:
            secretKeyRef:
              name: tqpro-db-credentials
              key: username
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: tqpro-db-credentials
              key: password

        # API Keys
        - name: AMADEUS_CLIENT_KEY
          valueFrom:
            secretKeyRef:
              name: tqpro-api-keys
              key: amadeus.client.key
        - name: AMADEUS_CLIENT_SECRET
          valueFrom:
            secretKeyRef:
              name: tqpro-api-keys
              key: amadeus.client.secret
        - name: MAIL_PASSWORD
          valueFrom:
            secretKeyRef:
              name: tqpro-api-keys
              key: mail.password

        # JVM options
        - name: JAVA_OPTS
          value: "-Xmx2g -Xms512m -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/app/logs"

        # Resource limits
        resources:
          requests:
            cpu: 1000m
            memory: 2Gi
          limits:
            cpu: 2000m
            memory: 4Gi

        # Health checks
        livenessProbe:
          httpGet:
            path: /tlinq-api/health/live
            port: 11080
          initialDelaySeconds: 90
          periodSeconds: 30
          timeoutSeconds: 5
          failureThreshold: 3

        readinessProbe:
          httpGet:
            path: /tlinq-api/health/ready
            port: 11080
          initialDelaySeconds: 60
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3

        # Startup probe (for slow starts)
        startupProbe:
          httpGet:
            path: /tlinq-api/health/live
            port: 11080
          initialDelaySeconds: 0
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 30  # 5 minutes max

        # Volume mounts
        volumeMounts:
        - name: config
          mountPath: /app/config
          readOnly: true
        - name: documents
          mountPath: /app/documents
        - name: logs
          mountPath: /app/logs

      # Volumes
      volumes:
      - name: config
        projected:
          sources:
          - configMap:
              name: tqpro-config
          - configMap:
              name: tqpro-xml-config
      - name: documents
        persistentVolumeClaim:
          claimName: tqpro-documents
      - name: logs
        emptyDir: {}

      # Pod affinity - spread across nodes
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - tqpro-api
              topologyKey: kubernetes.io/hostname

4.6 Deployment - Web Frontend

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tqpro-web
  namespace: tqpro
spec:
  replicas: 2
  selector:
    matchLabels:
      app: tqpro-web
  template:
    metadata:
      labels:
        app: tqpro-web
    spec:
      containers:
      - name: nginx
        image: registry.company.com/tqpro/web:1.0.0
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
        volumeMounts:
        - name: nginx-config
          mountPath: /etc/nginx/nginx.conf
          subPath: nginx.conf
        livenessProbe:
          httpGet:
            path: /adm/index.html
            port: 80
          initialDelaySeconds: 10
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /adm/index.html
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 10
      volumes:
      - name: nginx-config
        configMap:
          name: nginx-config

4.7 Services

# API Service
apiVersion: v1
kind: Service
metadata:
  name: tqpro-api-service
  namespace: tqpro
  labels:
    app: tqpro-api
spec:
  type: ClusterIP
  ports:
  - name: http
    port: 11080
    targetPort: 11080
    protocol: TCP
  selector:
    app: tqpro-api
  sessionAffinity: None  # Stateless, no affinity needed

---
# Web Service
apiVersion: v1
kind: Service
metadata:
  name: tqpro-web-service
  namespace: tqpro
spec:
  type: ClusterIP
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: tqpro-web

---
# Hazelcast headless service (for discovery)
apiVersion: v1
kind: Service
metadata:
  name: tqpro-hazelcast
  namespace: tqpro
spec:
  type: ClusterIP
  clusterIP: None  # Headless service
  ports:
  - name: hazelcast
    port: 5701
  selector:
    app: tqpro-api

4.8 Ingress

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: tqpro-ingress
  namespace: tqpro
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - tqpro.company.com
    secretName: tqpro-tls
  rules:
  - host: tqpro.company.com
    http:
      paths:
      # API endpoints
      - path: /tlinq-api
        pathType: Prefix
        backend:
          service:
            name: tqpro-api-service
            port:
              number: 11080
      # Web frontend
      - path: /
        pathType: Prefix
        backend:
          service:
            name: tqpro-web-service
            port:
              number: 80

4.9 HorizontalPodAutoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tqpro-api-hpa
  namespace: tqpro
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tqpro-api
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 60

4.10 PodDisruptionBudget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: tqpro-api-pdb
  namespace: tqpro
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: tqpro-api

5. Development Roadmap

Phase 1: Foundation (Week 1-2)

Objectives: - Setup development environment - Create initial Docker images - Test containerization locally

Tasks: 1. Day 1-2: Docker Setup - Create Dockerfile for API server - Create Dockerfile for web frontend - Test local builds - Optimize image sizes

  1. Day 3-4: Code Changes - Health Checks
  2. Implement HealthApi.java
  3. Add liveness, readiness, startup endpoints
  4. Test health endpoints locally
  5. Update API registration

  6. Day 5-6: Code Changes - Database

  7. Modify TlinqDBSession.java for env vars
  8. Add HikariCP connection pooling
  9. Test with local PostgreSQL
  10. Validate connection fallback logic

  11. Day 7-8: Code Changes - Hazelcast

  12. Modify TlinqClusterCache.java
  13. Add Kubernetes discovery support
  14. Add bare-metal fallback
  15. Test multicast still works locally

  16. Day 9-10: Testing & Documentation

  17. End-to-end local testing
  18. Document all changes
  19. Create migration guide
  20. Code review

Deliverables: - ✅ Working Docker images - ✅ Health check endpoints functional - ✅ Database configuration externalized - ✅ Hazelcast dual-mode support - ✅ Local testing passed

Phase 2: Configuration & Secrets (Week 3)

Objectives: - Externalize all configuration - Migrate secrets to Kubernetes Secrets - Create ConfigMaps

Tasks: 1. Day 1-2: ConfigMap Creation - Create tqpro-config ConfigMap - Create tqpro-xml-config ConfigMap - Create nginx-config ConfigMap - Validate XML parsing

  1. Day 3-4: Secrets Migration
  2. Identify all secrets (20+ credentials)
  3. Create Secret manifests
  4. Create secret rotation plan
  5. Document secret access patterns

  6. Day 5: Environment Variable Injection

  7. Update code to read secrets from env vars
  8. Test secret injection
  9. Validate fallback to config files

  10. Day 6-7: Testing

  11. Test with Secrets mounted
  12. Test with ConfigMaps mounted
  13. Validate environment variable precedence
  14. Test bare-metal compatibility

Deliverables: - ✅ All ConfigMaps created - ✅ All Secrets identified and documented - ✅ Code reads from environment variables - ✅ Bare-metal mode still works

Phase 3: Kubernetes Deployment (Week 4)

Objectives: - Deploy to development K8s cluster - Configure storage, networking - Validate functionality

Tasks: 1. Day 1: Cluster Setup - Create namespace - Apply RBAC policies - Setup StorageClass (EFS/Azure Files) - Create PersistentVolumeClaim

  1. Day 2: Deploy Dependencies
  2. Deploy PostgreSQL (or configure external)
  3. Create database schema
  4. Deploy Hazelcast StatefulSet (optional)
  5. Test database connectivity

  6. Day 3: Deploy Application

  7. Apply ConfigMaps
  8. Apply Secrets
  9. Deploy API server (1 replica initially)
  10. Check logs for errors

  11. Day 4: Networking

  12. Create Services
  13. Configure Ingress
  14. Setup TLS certificates
  15. Test external access

  16. Day 5: Scale & Test

  17. Scale to 3 replicas
  18. Test Hazelcast clustering
  19. Test file uploads (shared storage)
  20. Load testing

Deliverables: - ✅ Application running in K8s - ✅ 3 replicas healthy - ✅ Ingress accessible - ✅ All health checks passing - ✅ Hazelcast cluster formed

Phase 4: Observability & Monitoring (Week 5)

Objectives: - Setup logging aggregation - Configure monitoring - Add alerts

Tasks: 1. Day 1-2: Logging - Deploy EFK stack (Elasticsearch, Fluentd, Kibana) - Configure log shipping - Create log dashboards - Test log queries

  1. Day 3-4: Monitoring
  2. Deploy Prometheus
  3. Configure ServiceMonitor
  4. Deploy Grafana
  5. Create dashboards (CPU, memory, requests)

  6. Day 5: Alerting

  7. Configure AlertManager
  8. Create alert rules (pod down, high CPU, DB errors)
  9. Test alert routing
  10. Document runbooks

Deliverables: - ✅ Centralized logging - ✅ Prometheus metrics collection - ✅ Grafana dashboards - ✅ Alerts configured

Phase 5: Production Hardening (Week 6)

Objectives: - Security hardening - Performance optimization - Disaster recovery

Tasks: 1. Day 1-2: Security - Network policies (restrict pod-to-pod) - Security context (non-root user) - Image scanning (Trivy/Aqua) - Secret rotation automation

  1. Day 3: Performance
  2. JVM tuning
  3. Connection pool optimization
  4. Cache configuration tuning
  5. Load testing (JMeter)

  6. Day 4: High Availability

  7. PodDisruptionBudget
  8. HorizontalPodAutoscaler
  9. Multi-zone deployment
  10. Database failover testing

  11. Day 5: Disaster Recovery

  12. Backup automation (Velero)
  13. Database backup/restore
  14. Disaster recovery plan
  15. Runbook documentation

Deliverables: - ✅ Security scan passed - ✅ Performance benchmarks met - ✅ HA configuration tested - ✅ Backup/restore validated

Phase 6: Production Deployment (Week 7-8)

Objectives: - Deploy to staging - User acceptance testing - Deploy to production

Tasks: 1. Week 7: Staging Deployment - Deploy to staging cluster - Run smoke tests - User acceptance testing - Performance testing - Security audit

  1. Week 8: Production Deployment
  2. Blue-green deployment setup
  3. Deploy to production (canary rollout)
  4. Monitor metrics closely
  5. Gradual traffic shift (10% → 50% → 100%)
  6. Rollback plan ready

Deliverables: - ✅ Staging validated - ✅ Production deployed - ✅ Zero downtime migration - ✅ Rollback tested


6. Configuration Migration

6.1 Configuration Files Mapping

File Bare-Metal Location Kubernetes Resource Notes
tourlinq-config.xml $TLINQ_HOME/tourlinq-config.xml ConfigMap tqpro-xml-config Remove DB credentials
tourlinq.properties $TLINQ_HOME/tourlinq.properties ConfigMap tqpro-config Remove secrets
tlinqapi.properties $TLINQ_HOME/tlinqapi.properties ConfigMap tqpro-config Update paths
log.properties $TLINQ_HOME/log.properties ConfigMap tqpro-config Stdout only
api-roles.properties $TLINQ_HOME/api-roles.properties ConfigMap tqpro-config No changes
amadeus-client.xml $TLINQ_HOME/amadeus-client.xml ConfigMap tqpro-xml-config No changes
Entity XMLs (15 files) $TLINQ_HOME/entities/*.xml ConfigMap tqpro-xml-config No changes
amadeus.idfile Hardcoded path Secret tqpro-api-keys Two-line CSV
perun.jks Hardcoded path Secret tqpro-ssl-keystore Binary file

6.2 Environment Variables Strategy

Deployment Mode:

# Kubernetes deployment
DEPLOYMENT_MODE=kubernetes

# Bare-metal deployment
DEPLOYMENT_MODE=baremetal

Database Configuration:

DB_HOST=postgres-service.database.svc.cluster.local
DB_PORT=5432
DB_NAME=tlinq
DB_USER=tlinq_user
DB_PASSWORD=<from-secret>

Hazelcast Configuration:

K8S_NAMESPACE=tqpro
HAZELCAST_SERVICE=tqpro-hazelcast
HAZELCAST_PORT=5701

Application Configuration:

TLINQ_HOME=/app/config
JAVA_OPTS=-Xmx2g -Xms512m -XX:+UseG1GC
LOG_MODE=kubernetes

6.3 Secrets Extraction

From tourlinq.properties:

# Extract to Secret
mail.password=KP8ZH8zwKeQ0
telr.auth-key=JLM6^MpHfH@K8MpP
telr.merchant-id=21401
rayna.jwt.token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9...
twilio.sid=ACb7e13c05a35d56650e4a8c528226fa13
twilio.token=d7b7cf33e9eb90177e5c4f8c58d0c065

From tourlinq-config.xml:

<!-- Extract to Secret -->
<Database name="tlinq">
  <username>TlinqUser</username>
  <password>TlinqAdmin</password>
</Database>

From odoo-client.properties:

# Extract to Secret
odoo.user=odoo@peruntours.com
odoo.password=<redacted>
odoo.session.id=<redacted>

Amadeus Credentials:

# amadeus.idfile content -> Secret
AMADEUS_CLIENT_KEY,AMADEUS_CLIENT_SECRET


7. Risk Assessment

7.1 High Risk Items

Risk Impact Probability Mitigation
Hazelcast clustering fails in K8s Cache not shared across pods Medium Test thoroughly; fallback to external Hazelcast
Shared storage performance Slow document upload/download Medium Use high-performance storage (EFS Provisioned Throughput)
Database connection pool exhaustion API failures under load Medium Implement HikariCP with proper sizing
Secret rotation breaks production Application crashes Low Implement graceful secret reload
Configuration file parsing errors App fails to start Medium Add init container to validate config

7.2 Medium Risk Items

Risk Impact Probability Mitigation
Memory leaks in long-running pods OOM kills Medium Monitor heap usage; set memory limits
External API rate limits Service degradation High Implement circuit breakers and retries
SSL certificate expiry HTTPS access lost Low Use cert-manager for auto-renewal
Log volume too high Storage costs increase Medium Implement log level filtering
Migration downtime longer than expected User impact Medium Extensive testing in staging

7.3 Low Risk Items

Risk Impact Probability Mitigation
Container image size too large Slow deployments Low Multi-stage builds optimize size
Pod startup time too slow Slow scaling Low Tune JVM startup; use CDS
Ingress configuration errors 404 errors Low Test routing before production

8. Testing Strategy

8.1 Unit Testing

Scope: Code changes for Kubernetes compatibility

Tests to Add: 1. HealthApi Tests - Liveness returns 200 - Readiness returns 200 when DB up - Readiness returns 503 when DB down

  1. TlinqDBSession Tests
  2. Reads from environment variables
  3. Falls back to config file
  4. Connection pool works

  5. TlinqClusterCache Tests

  6. Kubernetes mode uses K8s discovery
  7. Bare-metal mode uses multicast
  8. Environment variable parsing

Tools: JUnit 5, Mockito

8.2 Integration Testing

Scope: Container and Kubernetes integration

Test Scenarios: 1. Container Build - Docker build succeeds - Image size within limits (<300MB) - Security scan passes (no critical CVEs)

  1. Container Runtime
  2. Container starts successfully
  3. Health endpoints respond
  4. Logs appear on stdout

  5. Kubernetes Deployment

  6. Pods start and become ready
  7. ConfigMaps mounted correctly
  8. Secrets injected as env vars
  9. PVC mounted and writable

  10. Multi-Pod

  11. 3 pods run simultaneously
  12. Hazelcast cluster forms
  13. Cache shared across pods
  14. Load balanced correctly

8.3 Functional Testing

Scope: End-to-end business functionality

Test Cases: 1. Flight Search - Search flights (JFK → LHR) - Verify results returned - Confirm pricing - Create booking

  1. Hotel Search
  2. Search hotels in Paris
  3. View offers
  4. Verify pricing

  5. Document Management

  6. Upload PDF
  7. Verify file saved to PVC
  8. Download PDF from another pod

  9. User Management

  10. Login via OAuth2
  11. Access admin endpoints
  12. Role-based access control

8.4 Performance Testing

Tools: Apache JMeter, k6

Test Scenarios:

  1. Load Test
  2. 100 concurrent users
  3. Mixed API calls (search, booking)
  4. Duration: 30 minutes
  5. Success rate: >99%
  6. Avg response time: <500ms

  7. Stress Test

  8. Gradually increase load to 500 users
  9. Identify breaking point
  10. Monitor auto-scaling behavior
  11. Verify graceful degradation

  12. Soak Test

  13. 50 concurrent users
  14. Duration: 12 hours
  15. Check for memory leaks
  16. Verify no degradation over time

Acceptance Criteria: - Response time p95 < 1 second - Error rate < 0.1% - No memory leaks - Auto-scaling triggers appropriately

8.5 Disaster Recovery Testing

Test Scenarios:

  1. Pod Failure
  2. Delete random pod
  3. Verify auto-restart
  4. Verify no service interruption

  5. Node Failure

  6. Drain node
  7. Verify pods rescheduled
  8. Verify service continuity

  9. Database Failure

  10. Stop database
  11. Verify readiness probe fails
  12. Verify graceful error handling
  13. Restore database
  14. Verify automatic recovery

  15. Complete Cluster Failure

  16. Backup application state
  17. Destroy cluster
  18. Restore from backup
  19. Verify data integrity

9. Rollout Plan

9.1 Pre-Deployment Checklist

Infrastructure: - [ ] Kubernetes cluster provisioned (3+ nodes) - [ ] StorageClass configured (EFS/Azure Files/NFS) - [ ] PostgreSQL database deployed or accessible - [ ] Ingress controller installed (nginx-ingress) - [ ] Cert-manager installed (for TLS) - [ ] Monitoring stack deployed (Prometheus/Grafana) - [ ] Logging stack deployed (EFK/Loki)

Application: - [ ] Docker images built and pushed to registry - [ ] ConfigMaps created - [ ] Secrets created (encrypted in repo via sealed-secrets) - [ ] Database schema migrated - [ ] Health endpoints tested - [ ] Code changes merged to main branch

Documentation: - [ ] Deployment runbook completed - [ ] Rollback procedure documented - [ ] Monitoring dashboard created - [ ] Alert rules configured - [ ] On-call rotation established

9.2 Deployment Steps

Development Environment (Week 4):

  1. Create namespace

    kubectl create namespace tqpro-dev
    

  2. Apply secrets (from sealed-secrets or vault)

    kubectl apply -f k8s/dev/secrets/
    

  3. Apply ConfigMaps

    kubectl apply -f k8s/dev/configmaps/
    

  4. Create PVC

    kubectl apply -f k8s/dev/storage/
    

  5. Deploy database (if not external)

    kubectl apply -f k8s/dev/database/
    

  6. Deploy application (1 replica)

    kubectl apply -f k8s/dev/deployment.yaml
    

  7. Verify health

    kubectl get pods -n tqpro-dev
    kubectl logs -f deployment/tqpro-api -n tqpro-dev
    

  8. Create service

    kubectl apply -f k8s/dev/service.yaml
    

  9. Create ingress

    kubectl apply -f k8s/dev/ingress.yaml
    

  10. Test access

    curl https://tqpro-dev.company.com/tlinq-api/health
    

Staging Environment (Week 7):

  1. Repeat steps above with tqpro-staging namespace
  2. Run full test suite
  3. User acceptance testing
  4. Performance testing
  5. Security audit

Production Environment (Week 8):

  1. Blue-Green Strategy:
  2. Deploy "green" environment alongside existing "blue"
  3. Route 10% traffic to green
  4. Monitor metrics for 1 hour
  5. Gradually increase to 50%, then 100%
  6. Keep blue running for 24 hours as rollback option

  7. Deployment Command:

    # Deploy green
    kubectl apply -f k8s/prod/deployment-green.yaml
    
    # Update service to route to green (weighted)
    kubectl apply -f k8s/prod/service-weighted.yaml
    
    # Monitor
    kubectl get pods -n tqpro -l version=green -w
    
    # Full cutover
    kubectl apply -f k8s/prod/service.yaml
    
    # Cleanup blue after 24h
    kubectl delete deployment tqpro-api-blue -n tqpro
    

9.3 Rollback Procedure

If issues detected during rollout:

  1. Immediate Rollback (< 5 minutes):

    # Revert service to blue
    kubectl apply -f k8s/prod/service-blue.yaml
    
    # Or use rollback
    kubectl rollout undo deployment/tqpro-api -n tqpro
    

  2. Post-Rollback:

  3. Investigate root cause
  4. Fix issues in dev/staging
  5. Re-test thoroughly
  6. Schedule new deployment

9.4 Post-Deployment Validation

Automated Checks: - [ ] All pods healthy (3/3 running) - [ ] Health endpoints returning 200 - [ ] Ingress routing correctly - [ ] SSL certificate valid - [ ] Metrics being collected - [ ] Logs being aggregated

Manual Checks: - [ ] Login functionality works - [ ] Flight search works - [ ] Hotel search works - [ ] Document upload works - [ ] API responses correct - [ ] No errors in logs

Performance Validation: - [ ] Response times within SLA - [ ] No increase in error rate - [ ] Database connections stable - [ ] Memory usage normal - [ ] CPU usage normal

9.5 Monitoring & Alerts

Key Metrics to Monitor:

  1. Application Health:
  2. Pod restart count
  3. Health check failures
  4. Application errors (500s)

  5. Performance:

  6. Request rate
  7. Response time (p50, p95, p99)
  8. Error rate
  9. Database query time

  10. Resource Usage:

  11. CPU utilization
  12. Memory usage
  13. Disk I/O (PVC)
  14. Network traffic

  15. External Dependencies:

  16. Database connection pool
  17. Amadeus API latency
  18. Odoo API availability
  19. Cache hit ratio

Alert Thresholds: - Pod crash loop: Immediate alert - Error rate > 1%: Warning - Error rate > 5%: Critical - Response time p95 > 2s: Warning - Response time p95 > 5s: Critical - Memory > 90%: Warning - CPU > 80% for 5min: Warning


10. Cost Analysis

10.1 Infrastructure Costs (Monthly)

AWS EKS Example:

Component Specification Cost
EKS Cluster Control plane $73
EC2 Nodes 3x t3.xlarge (4 CPU, 16GB) $223
EBS Volumes 300GB gp3 $24
EFS Storage 50GB + 5 MB/s provisioned $65
Application Load Balancer 1 ALB $23
Data Transfer 500GB/month $45
RDS PostgreSQL db.t3.medium (2 CPU, 4GB) $123
CloudWatch Logs + Metrics $30
Subtotal $606/month

Azure AKS Example:

Component Specification Cost
AKS Cluster Control plane (free) $0
VM Nodes 3x Standard_D4s_v3 (4 CPU, 16GB) $347
Managed Disks 300GB Premium SSD $51
Azure Files 50GB Premium $110
Application Gateway Standard_v2 $267
Azure Database for PostgreSQL General Purpose, 2 vCores $182
Log Analytics 10GB/day $25
Subtotal $982/month

GCP GKE Example:

Component Specification Cost
GKE Cluster Control plane $73
Compute Nodes 3x n1-standard-4 (4 CPU, 15GB) $292
Persistent Disks 300GB SSD $51
Filestore 1TB Basic $204
Cloud Load Balancer External HTTPS $18
Cloud SQL PostgreSQL db-n1-standard-2 $158
Cloud Logging 50GB/month $25
Subtotal $821/month

10.2 Scaling Costs

Auto-Scaling Impact:

With HPA configured (3-10 replicas): - Minimum (3 replicas): Base cost - Average (5 replicas): +40% compute cost - Peak (10 replicas): +100% compute cost

Recommendation: Set HPA max based on budget constraints and actual traffic patterns.

10.3 Cost Optimization Strategies

  1. Right-Sizing:
  2. Start with smaller instances
  3. Use metrics to adjust
  4. Potentially save 30-40%

  5. Reserved Instances:

  6. 1-year reserved instances: ~30% savings
  7. 3-year reserved instances: ~50% savings

  8. Spot Instances:

  9. Use for non-critical workloads
  10. 70-90% savings
  11. Not recommended for production API

  12. Storage Optimization:

  13. Use object storage (S3) instead of EFS for documents
  14. Potential savings: $50/month

  15. Multi-Tenancy:

  16. Share cluster with other applications
  17. Reduce per-app overhead

Estimated Optimized Cost: $400-500/month (vs $600-900 unoptimized)


11. Conclusion

11.1 Summary

The TQPro application is well-suited for Kubernetes deployment with the following assessment:

Strengths ✅: - Stateless REST API design (perfect for K8s) - Embedded Jetty server (no external dependencies) - Multi-module architecture (clear separation) - Standard Java/Gradle stack (well-supported)

Challenges ⚠️: - Configuration externalization required - Hazelcast needs K8s discovery implementation - Secrets hardcoded in config files - No health check endpoints (must add) - Shared file storage needed

Overall Effort: 6-8 weeks (including testing)

Cost: $400-900/month depending on cloud provider and optimization

Risk Level: Medium (manageable with proper planning)

11.2 Recommendations

Immediate Actions: 1. ✅ Approve development plan and budget 2. ✅ Provision development Kubernetes cluster 3. ✅ Assign development team (2-3 engineers) 4. ✅ Start Phase 1 (code changes)

Critical Success Factors: 1. Thorough testing in staging before production 2. Gradual rollout with rollback capability 3. Comprehensive monitoring from day one 4. Clear runbooks for operations team

Future Enhancements (Post-Deployment): 1. Service mesh (Istio) for advanced traffic management 2. GitOps (ArgoCD/Flux) for declarative deployments 3. Chaos engineering for resilience testing 4. Multi-region deployment for disaster recovery

11.3 Decision Point

Proceed with Kubernetes deployment? - ☐ Yes - Begin Phase 1 immediately - ☐ No - Document reasons and revisit in 6 months - ☐ Partial - Start with dev/staging only


Appendix A: Hazelcast vs Alternatives Analysis

A.1 Question: Should Hazelcast be replaced with another distributed cache?

Initial Consideration: Replace Hazelcast with Caffeine (local cache) or Redis (managed service)

Analysis Performed: Deep code analysis of all cache usage patterns across the application

A.2 Cache Usage Discovery

Finding: Hazelcast is minimally used but CRITICAL for multi-instance deployment

Only 4 files use Hazelcast: 1. CartHolder.java - Shopping cart storage (session-based) 2. OdooServiceFactory.java - User session management 3. ApiRoleManager.java - API authorization cache 4. TlinqFrameworkInitializer.java - Test initialization

3 Active Caches: | Cache Name | Purpose | Critical | DB Fallback | Multi-Pod Requirement | |------------|---------|----------|-------------|----------------------| | cartsCache | Shopping carts | YES | Partial | REQUIRED | | userSessions | Odoo authentication sessions | YES | NO | CRITICAL* | | apiRolesCache | API RBAC | MEDIUM | File reload | Optional |

*Logged-in user carts have database fallback; anonymous carts do not

A.3 Multi-Instance Impact Analysis

User Sessions (CRITICAL):

// OdooServiceFactory.java:146-158
private UserLogin fetchSession(String sessionToken) {
    UserLogin session = sessions.get(sessionToken);  // Hazelcast lookup
    if(null == session) {
        // NO DATABASE FALLBACK - User gets logged out!
        throw new TlinqClientException(TlinqErr.SESSION_ERROR,"Session expired");
    }
}

Impact if cache not distributed: - ❌ User sessions lost when request hits different pod - ❌ Users forced to re-authenticate frequently - ❌ Poor user experience - ❌ Horizontal scaling impossible

Shopping Carts (HIGH PRIORITY): - Logged-in users: ✅ Database fallback works - Anonymous users: ❌ Cart lost on pod switch (requires sticky sessions)

A.4 Alternative Evaluation

Option 1: Caffeine (Local Cache)

Pros: - ✅ Lightweight (~1MB vs 4MB) - ✅ Better memory management (TTL + eviction built-in) - ✅ No network configuration needed - ✅ 1-2 day migration effort

Cons: - ❌ BREAKS multi-instance deployment - ❌ User sessions not shared between pods - ❌ Anonymous carts lost on pod switch - ❌ Requires sticky sessions (defeats K8s benefits)

Verdict: ❌ NOT SUITABLE for Kubernetes multi-pod deployment

Option 2: Redis (External Cache)

Pros: - ✅ True distributed caching - ✅ Managed service available (AWS ElastiCache, Azure Cache for Redis) - ✅ Persistence across restarts - ✅ Simpler K8s deployment (no in-app clustering) - ✅ Can share with other services

Cons: - ⚠️ Additional infrastructure ($50-100/month) - ⚠️ 2-week migration effort - ⚠️ New dependency to manage - ⚠️ Network latency for cache operations

Verdict: ✅ VIABLE ALTERNATIVE but higher effort/cost

Option 3: Hazelcast (Fixed for Kubernetes)

Pros: - ✅ Already integrated (4 files, 3 caches) - ✅ Designed for distributed caching - ✅ Zero additional infrastructure cost - ✅ Embedded in application (no external service) - ✅ 3-5 day migration effort - ✅ Backward compatible with bare-metal

Cons: - ⚠️ Requires Kubernetes discovery configuration - ⚠️ Current config has hardcoded IP (fixable) - ⚠️ No TTL/eviction configured (fixable) - ⚠️ Cluster management in-app

Verdict: ⭐ RECOMMENDED - Best fit for current architecture

A.5 Decision Matrix

Criteria Caffeine Hazelcast (Fixed) Redis
Multi-Instance Support ❌ NO ✅ YES ✅ YES
Effort 2 days 3-5 days 2 weeks
Cost $0 $0 $50-100/mo
Complexity Low Medium Medium-High
Code Changes 4 files 4 files + config 4 files + client
Infrastructure None None Redis cluster
Backward Compat ✅ Easy ✅ Easy ⚠️ Complex
Performance Best (local) Good (in-cluster) Good (network hop)
Persistence ❌ None ⚠️ In-memory ✅ Disk-backed
Overall Rating ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐

A.6 Final Recommendation

Keep Hazelcast and Fix Kubernetes Discovery ⭐⭐⭐⭐⭐

Rationale: 1. Minimal disruption - Already integrated, just needs K8s config 2. Cost effective - No additional infrastructure 3. Quick to implement - 3-5 days vs 2 weeks for Redis 4. Proven technology - Hazelcast designed for this exact use case 5. Backward compatible - Bare-metal deployment still works

Implementation: - Upgrade to Hazelcast 5.3.6 (from 4.2.4) - Add Kubernetes service discovery plugin - Configure TTL and eviction policies - Add health check endpoints - See HAZELCAST_KUBERNETES_MIGRATION.md for complete implementation plan

When to Consider Redis: - If Hazelcast clustering proves problematic in production - If you need persistence across cluster restarts - If you want managed service with support - If sharing cache with other applications - Budget allows for $50-100/month additional cost

A.7 Current Hazelcast Issues Fixed

Before:

// ❌ Hardcoded IP
network.getJoin().getMulticastConfig()
    .addTrustedInterface("172.16.55.1");

// ❌ No TTL - memory leak risk
// ❌ No eviction policies
// ❌ No max size limits

After:

// ✅ Kubernetes discovery
if ("kubernetes".equals(deploymentMode)) {
    network.getJoin().getKubernetesConfig()
        .setEnabled(true)
        .setProperty("namespace", System.getenv("K8S_NAMESPACE"))
        .setProperty("service-name", "tqpro-hazelcast");
}

// ✅ TTL configured
MapConfig cartsConfig = new MapConfig("cartsCache");
cartsConfig.setTimeToLiveSeconds(1800);  // 30 min

// ✅ Eviction policy
cartsConfig.setEvictionConfig(new EvictionConfig()
    .setSize(10000)
    .setMaxSizePolicy(MaxSizePolicy.PER_NODE)
    .setEvictionPolicy(EvictionPolicy.LRU));

A.8 Testing Validation

Multi-Pod Cache Consistency Test:

# Create session on pod-1
SESSION_ID="test-session-123"
curl -X POST http://pod-1:11080/api/cart/addItem \
  -d '{"session":"'$SESSION_ID'","item":"ITEM123"}'

# Retrieve cart from pod-2 (different pod!)
curl -X POST http://pod-2:11080/api/cart/load \
  -d '{"session":"'$SESSION_ID'"}'

# ✅ Should return cart with ITEM123 (cache shared)
# ❌ With Caffeine: Would return empty cart (cache not shared)

Expected Results with Hazelcast: - ✅ Sessions accessible from all pods - ✅ Carts consistent across pod switches - ✅ Cluster size matches pod count - ✅ No session loss during scaling events


Document Owner: DevOps Team Last Updated: 2024-11-23 Next Review: After Phase 1 completion


End of Document