TQPro Kubernetes Deployment Plan¶

Document Version: 1.0 Date: 2024-11-23 Status: Analysis Complete - Development Plan Target Environment: Kubernetes (AWS EKS / Azure AKS / GKE)

Executive Summary¶

The TQPro application demonstrates GOOD containerization feasibility with moderate configuration changes required. The application is a stateless REST API built on embedded Jetty, making it well-suited for Kubernetes deployment.

Overall Assessment: ✅ FEASIBLE - Estimated 4-6 weeks development effort

Key Findings: - ✅ Stateless design (no server-side sessions) - ✅ Embedded server (Jetty 12) - no external app server needed - ✅ RESTful API architecture - ⚠️ Configuration requires externalization - ⚠️ Hazelcast needs Kubernetes discovery - ⚠️ Hardcoded secrets must be moved to Secrets - ❌ No health check endpoints (must be added)

Table of Contents¶

Current Architecture Analysis
Containerization Strategy
Required Code Changes
Kubernetes Resources
Development Roadmap
Configuration Migration
Risk Assessment
Testing Strategy
Rollout Plan
Appendix A: Hazelcast vs Alternatives Analysis

1. Current Architecture Analysis¶

1.1 Application Components¶

Backend API Server: - Entry Point: tqapi/src/main/java/com/perun/tlinq/TQProApiServer.java:258 - Server: Embedded Jetty 12.0.10 - Runtime: Java 17 (Amazon Corretto) - Framework: JAX-RS with Jersey 3.1.6 - Context Path: /tlinq-api - Ports: HTTP 11080, HTTPS 11079 (configurable)

Module Structure:

tqapi (REST endpoints - 8,749 LOC)
  ├── tqapp (Business logic - 296 Java files)
  │   ├── tqcommon (Utilities, config, DB)
  │   ├── tqamds (Amadeus API integration)
  │   ├── tqodoo (Odoo ERP integration)
  │   └── tqryb2b (Rayna B2B integration)
  └── tqcommon

Frontend Applications: - tqweb-adm (5.6MB) - Admin dashboard (HTML/JS/Foundation) - tqweb-b2b (8.4MB) - B2B portal - tqweb-pub (87MB) - Public website

1.2 External Dependencies¶

Database: - PostgreSQL (currently localhost:5432) - Hibernate ORM with SessionFactory - Connection configured in config/tourlinq-config.xml - No connection pooling (uses Hibernate defaults)

Distributed Cache: - Hazelcast 4.2.4 in-memory grid (will be upgraded to 5.3.6) - CRITICAL ISSUE: Hardcoded IP 172.16.55.1 in TlinqClusterCache.java:43 - Multicast port: 55478 (incompatible with K8s) - Usage: 3 caches - shopping carts, user sessions, API roles - Status: REQUIRED for multi-instance deployment - Action Required: Configure Kubernetes discovery (see detailed migration plan)

External APIs: - Amadeus (flights/hotels) - HTTPS - Odoo ERP - XML-RPC - Rayna B2B - HTTP REST - Mail server - SMTP:587 - Twilio - HTTPS

File System: - Document storage: /var/www/docimages (configurable in tlinqapi.properties:36) - Logs: /var/log/tqpro/ (hardcoded in log.properties:19) - Config: TLINQ_HOME environment variable

1.3 Configuration Management¶

Environment Variable Dependencies: - TLINQ_HOME - CRITICAL: All config loaded from this path - Set in TQProApiServer.java:259 - Used by ClientConfig.java:37 and AppConfig.java:21

Configuration Files (in config/): 1. tourlinq-config.xml (7.6KB) - Main config with DB credentials ⚠️ 2. tourlinq.properties (2KB) - App properties with secrets ⚠️ 3. tlinqapi.properties (1.7KB) - Server config with paths ⚠️ 4. log.properties (622 bytes) - Logging with hardcoded path ⚠️ 5. api-roles.properties (6.1KB) - RBAC configuration 6. amadeus-client.xml (3.5KB) - Amadeus service mappings 7. Entity files (15 XML files via XInclude)

Security Issues 🔴: - Database passwords in plaintext (tourlinq-config.xml:50-55) - Mail password: KP8ZH8zwKeQ0 (tourlinq.properties:9) - Payment gateway key: JLM6^MpHfH@K8MpP (tourlinq.properties:19) - RaynaB2B JWT token (tourlinq.properties:32-33) - Twilio credentials (tourlinq.properties:35-36) - Odoo credentials in odoo-client.properties

1.4 State Management¶

Session Design: ✅ Stateless - No server-side sessions (ServletContextHandler.NO_SESSIONS at TQProApiServer.java:218) - Authentication via OAuth2-Proxy + Keycloak - User identity in HTTP headers (X-User, X-Roles, X-Email, X-Name) - Security context created per-request (AuthenticationFilter.java:100-105)

Implications: - ✅ Perfect for horizontal scaling - ✅ No session affinity required - ✅ Can use any load balancing strategy

1.5 Critical Gaps¶

Missing for Kubernetes: 1. ❌ No health check endpoints (/health, /ready, /live) 2. ❌ No graceful shutdown handling 3. ❌ Hardcoded secrets in config files 4. ❌ File-based logging (should use stdout) 5. ❌ Hazelcast not Kubernetes-aware 6. ❌ No connection pool configuration 7. ❌ No metrics endpoint (Prometheus)

2. Containerization Strategy¶

2.1 Docker Image Strategy¶

Multi-Stage Build Approach:

# ============================================
# Stage 1: Build
# ============================================
FROM gradle:8.10-jdk17-alpine AS builder

WORKDIR /build

# Copy gradle wrapper and build files
COPY ../../../gradlew gradlew.bat ./
COPY ../../../gradle gradle/
COPY ../../../settings.gradle.kts build.gradle.kts ./

# Copy all module build files first (for layer caching)
COPY ../../../tqcommon/build.gradle.kts tqcommon/
COPY ../../../tqapp/build.gradle.kts tqapp/
COPY ../../../tqapi/build.gradle.kts tqapi/
COPY ../../../tqamds/build.gradle.kts tqamds/
COPY ../../../tqodoo/build.gradle.kts tqodoo/
COPY ../../../tqryb2b/build.gradle.kts tqryb2b/

# Download dependencies (cached layer if no build file changes)
RUN gradle dependencies --no-daemon || true

# Copy source code
COPY ../../../tqcommon/src tqcommon/src/
COPY ../../../tqapp/src tqapp/src/
COPY ../../../tqapi/src tqapi/src/
COPY ../../../tqamds/src tqamds/src/
COPY ../../../tqodoo/src tqodoo/src/
COPY ../../../tqryb2b/src tqryb2b/src/

# Build application
RUN gradle clean build -x test --no-daemon

# Copy dependencies to known location
RUN gradle :tqapi:copyDependencies --no-daemon

# ============================================
# Stage 2: Runtime
# ============================================
FROM amazoncorretto:17-alpine

# Install curl for health checks
RUN apk add --no-cache curl

# Create app user (non-root)
RUN addgroup -g 1000 tqpro && \
    adduser -D -u 1000 -G tqpro tqpro

WORKDIR /app

# Copy JARs from builder
COPY --from=builder /build/tqapi/build/libs/tqapi.jar ./
COPY --from=builder /build/tqapi/build/libs/lib/*.jar ./lib/
COPY --from=builder /build/tqapp/build/libs/tqapp.jar ./lib/
COPY --from=builder /build/tqcommon/build/libs/tqcommon.jar ./lib/
COPY --from=builder /build/tqamds/build/libs/tqamds.jar ./lib/
COPY --from=builder /build/tqodoo/build/libs/tqodoo.jar ./lib/
COPY --from=builder /build/tqryb2b/build/libs/tqryb2b.jar ./lib/

# Create directories
RUN mkdir -p /app/config /app/documents /app/logs && \
    chown -R tqpro:tqpro /app

# Copy default config templates (will be overridden by ConfigMaps)
COPY --chown=tqpro:tqpro ../../../config /app/config/

# Switch to non-root user
USER tqpro

# Environment variables
ENV TLINQ_HOME=/app/config \
    JAVA_OPTS="-Xmx2g -Xms512m -XX:+UseG1GC -XX:MaxGCPauseMillis=200" \
    TZ=UTC

# Expose ports
EXPOSE 11080 11079

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=60s --retries=3 \
    CMD curl -f http://localhost:11080/tlinq-api/health || exit 1

# Start command
ENTRYPOINT ["sh", "-c", "java $JAVA_OPTS -DTLINQ_HOME=$TLINQ_HOME -jar tqapi.jar"]

Image Size Optimization: - Base: Alpine Linux (~40MB) - Java 17 Runtime: ~150MB - Application JARs: ~80MB - Total Estimated Size: ~270MB

2.2 Web Frontend Images¶

Nginx Image for Static Content:

FROM nginx:1.25-alpine

# Copy static content
COPY tqweb-adm /usr/share/nginx/html/adm
COPY tqweb-b2b /usr/share/nginx/html/b2b
COPY tqweb-pub /usr/share/nginx/html/pub

# Copy nginx config (from ConfigMap in K8s)
COPY nginx.conf /etc/nginx/nginx.conf

EXPOSE 80

CMD ["nginx", "-g", "daemon off;"]

Size: ~95MB (87MB static + 8MB nginx)

2.3 Image Naming Convention¶

registry.company.com/tqpro/api:1.0.0
registry.company.com/tqpro/api:1.0.0-sha-abc123
registry.company.com/tqpro/web:1.0.0

Tags: - Semantic version: 1.0.0 - Git commit SHA: 1.0.0-sha-abc123 (for traceability) - Environment: 1.0.0-dev, 1.0.0-staging, 1.0.0-prod

3. Required Code Changes¶

3.1 HIGH PRIORITY: Health Check Endpoints¶

New File: tqapi/src/main/java/com/perun/tlinq/api/HealthApi.java

package com.perun.tlinq.api;

import jakarta.ws.rs.*;
import jakarta.ws.rs.core.MediaType;
import jakarta.ws.rs.core.Response;
import com.perun.tlinq.util.TlinqDBSession;
import org.hibernate.Session;
import java.util.HashMap;
import java.util.Map;

@Path("/health")
public class HealthApi {

    // Liveness probe - is the process running?
    @GET
    @Path("/live")
    @Produces(MediaType.APPLICATION_JSON)
    public Response liveness() {
        Map<String, Object> status = new HashMap<>();
        status.put("status", "UP");
        status.put("timestamp", System.currentTimeMillis());
        return Response.ok(status).build();
    }

    // Readiness probe - can we serve traffic?
    @GET
    @Path("/ready")
    @Produces(MediaType.APPLICATION_JSON)
    public Response readiness() {
        Map<String, Object> status = new HashMap<>();
        boolean ready = true;

        // Check database connectivity
        try {
            Session session = TlinqDBSession.getSession();
            session.createNativeQuery("SELECT 1").getSingleResult();
            session.close();
            status.put("database", "UP");
        } catch (Exception e) {
            status.put("database", "DOWN");
            status.put("error", e.getMessage());
            ready = false;
        }

        // Check Hazelcast (if enabled)
        try {
            if (System.getProperty("hazelcast.enabled", "true").equals("true")) {
                // Add Hazelcast health check
                status.put("cache", "UP");
            }
        } catch (Exception e) {
            status.put("cache", "DOWN");
            ready = false;
        }

        status.put("status", ready ? "UP" : "DOWN");
        status.put("timestamp", System.currentTimeMillis());

        return ready ?
            Response.ok(status).build() :
            Response.status(503).entity(status).build();
    }

    // Health endpoint - detailed health info
    @GET
    @Produces(MediaType.APPLICATION_JSON)
    public Response health() {
        return readiness();
    }
}

Integration: Register in TQProApiServer.java alongside other API classes.

3.2 HIGH PRIORITY: Hazelcast Kubernetes Discovery¶

📋 DETAILED IMPLEMENTATION PLAN: See HAZELCAST_KUBERNETES_MIGRATION.md for complete step-by-step implementation guide including: - Full code implementation with TTL and eviction policies - Kubernetes RBAC configuration - Testing procedures - Monitoring and health checks - Rollback procedures

Critical Finding - Multi-Instance Requirement:

After deep code analysis, Hazelcast is REQUIRED for multi-pod deployment because: 1. User sessions (Odoo integration) - cache-only, NO database fallback 2. Shopping carts (anonymous users) - memory-only, lost on pod switch 3. API roles - shared authorization cache

Why not Caffeine: Local caching would break session/cart sharing across pods.

Modify: tqcommon/src/main/java/com/perun/tlinq/entity/cache/TlinqClusterCache.java

Summary of Required Changes:

// 1. Upgrade dependency (build.gradle.kts)
implementation("com.hazelcast:hazelcast:5.3.6")
implementation("com.hazelcast:hazelcast-kubernetes:2.2.3")

// 2. Add deployment mode detection
String deploymentMode = System.getenv("DEPLOYMENT_MODE");

if ("kubernetes".equals(deploymentMode)) {
    // Kubernetes deployment
    network.getJoin().getMulticastConfig().setEnabled(false);
    network.getJoin().getTcpIpConfig().setEnabled(false);

    // Enable Kubernetes discovery
    network.getJoin().getKubernetesConfig()
        .setEnabled(true)
        .setProperty("namespace", System.getenv("K8S_NAMESPACE"))
        .setProperty("service-name", "tqpro-hazelcast")
        .setProperty("service-port", "5701");

    log.info("Hazelcast configured for Kubernetes discovery");
} else {
    // Bare-metal deployment (existing logic)
    network.setPort(55478);
    network.getJoin().getMulticastConfig()
        .setEnabled(true)
        .addTrustedInterface(
            System.getProperty("hazelcast.interface", "172.16.55.1")
        );

    log.info("Hazelcast configured for multicast discovery");
}
// 3. Configure TTL and eviction policies (prevents memory leaks)
MapConfig cartsConfig = new MapConfig("cartsCache");
cartsConfig.setTimeToLiveSeconds(1800);  // 30 min TTL
cartsConfig.setEvictionConfig(new EvictionConfig()
    .setSize(10000)
    .setMaxSizePolicy(MaxSizePolicy.PER_NODE)
    .setEvictionPolicy(EvictionPolicy.LRU));
cfg.addMapConfig(cartsConfig);

// Similar for userSessions (60 min TTL) and apiRolesCache

Required Kubernetes Resources: - ServiceAccount with RBAC permissions (see k8s/hazelcast-rbac.yaml) - Headless Service for discovery (see k8s/hazelcast-service.yaml) - Environment variables: DEPLOYMENT_MODE=kubernetes, K8S_NAMESPACE, HAZELCAST_SERVICE

Effort: 3-5 days (includes testing and validation)

3.3 HIGH PRIORITY: Externalize Database Configuration¶

Modify: tqcommon/src/main/java/com/perun/tlinq/util/TlinqDBSession.java

Current (line 24):

String dbName = AppConfig.instance().getProperty("tlinq.dbname");

New:

// Priority: Environment variable > Config file > Default
String dbHost = System.getenv("DB_HOST");
String dbPort = System.getenv("DB_PORT");
String dbName = System.getenv("DB_NAME");
String dbUser = System.getenv("DB_USER");
String dbPassword = System.getenv("DB_PASSWORD");

// Fallback to config file if env vars not set (bare-metal mode)
if (dbHost == null) {
    String configDbName = AppConfig.instance().getProperty("tlinq.dbname");
    // Use existing logic with configured database
} else {
    // Build connection from environment variables
    String jdbcUrl = String.format(
        "jdbc:postgresql://%s:%s/%s",
        dbHost,
        dbPort != null ? dbPort : "5432",
        dbName
    );

    Configuration configuration = new Configuration();
    configuration.setProperty("hibernate.connection.url", jdbcUrl);
    configuration.setProperty("hibernate.connection.username", dbUser);
    configuration.setProperty("hibernate.connection.password", dbPassword);

    // Add connection pooling (HikariCP)
    configuration.setProperty("hibernate.connection.provider_class",
        "org.hibernate.hikaricp.internal.HikariCPConnectionProvider");
    configuration.setProperty("hibernate.hikari.minimumIdle", "5");
    configuration.setProperty("hibernate.hikari.maximumPoolSize", "20");
    configuration.setProperty("hibernate.hikari.idleTimeout", "300000");
}

Dependencies: Add HikariCP:

implementation("org.hibernate.orm:hibernate-hikaricp:6.5.1.Final")
implementation("com.zaxxer:HikariCP:5.0.1")

3.4 MEDIUM PRIORITY: Graceful Shutdown¶

Modify: tqapi/src/main/java/com/perun/tlinq/TQProApiServer.java

Add shutdown hook (after line 235 where server starts):

// Add graceful shutdown hook
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
    log.info("Shutdown signal received, stopping server gracefully...");
    try {
        // Stop accepting new requests
        server.stop();

        // Close Hazelcast
        if (TlinqClusterCache.getInstance() != null) {
            TlinqClusterCache.getInstance().shutdown();
        }

        // Close database connections
        TlinqDBSession.close();

        log.info("Server stopped gracefully");
    } catch (Exception e) {
        log.error("Error during shutdown", e);
    }
}));

3.5 MEDIUM PRIORITY: Logging to Stdout¶

Modify: config/log.properties

Current:

com.perun.tlinq.handlers=java.util.logging.ConsoleHandler, java.util.logging.FileHandler
java.util.logging.FileHandler.pattern=/var/log/tqpro/tlinqserver-%g-%u.log

For Kubernetes:

# Kubernetes mode - stdout only
com.perun.tlinq.handlers=java.util.logging.ConsoleHandler
com.perun.tlinq.level=INFO
java.util.logging.ConsoleHandler.level=INFO
java.util.logging.ConsoleHandler.formatter=java.util.logging.SimpleFormatter
java.util.logging.SimpleFormatter.format=%1$tY-%1$tm-%1$td %1$tH:%1$tM:%1$tS %4$-6s %2$s %5$s%6$s%n

Make it configurable: Use environment variable LOG_MODE=kubernetes or LOG_MODE=baremetal to switch.

3.6 LOW PRIORITY: Metrics Endpoint¶

New File: tqapi/src/main/java/com/perun/tlinq/api/MetricsApi.java

package com.perun.tlinq.api;

import jakarta.ws.rs.*;
import jakarta.ws.rs.core.MediaType;
import jakarta.ws.rs.core.Response;

@Path("/metrics")
public class MetricsApi {

    @GET
    @Produces(MediaType.TEXT_PLAIN)
    public Response metrics() {
        StringBuilder sb = new StringBuilder();

        // JVM metrics
        Runtime runtime = Runtime.getRuntime();
        sb.append("# HELP jvm_memory_used_bytes Used memory in bytes\n");
        sb.append("# TYPE jvm_memory_used_bytes gauge\n");
        sb.append("jvm_memory_used_bytes ")
          .append(runtime.totalMemory() - runtime.freeMemory())
          .append("\n");

        sb.append("# HELP jvm_memory_max_bytes Max memory in bytes\n");
        sb.append("# TYPE jvm_memory_max_bytes gauge\n");
        sb.append("jvm_memory_max_bytes ")
          .append(runtime.maxMemory())
          .append("\n");

        // Add more metrics as needed

        return Response.ok(sb.toString()).build();
    }
}

3.7 Summary of Code Changes¶

Priority	Component	File	Change Type	Effort
HIGH	Health Checks	`api/HealthApi.java`	New file	4 hours
HIGH	Hazelcast	`TlinqClusterCache.java`	Modify	6 hours
HIGH	Database Config	`TlinqDBSession.java`	Modify	8 hours
MEDIUM	Graceful Shutdown	`TQProApiServer.java`	Add code	4 hours
MEDIUM	Logging	`log.properties`	Modify	2 hours
LOW	Metrics	`api/MetricsApi.java`	New file	4 hours
LOW	Config Externalization	Multiple files	Modify	8 hours

Total Development Effort: ~36 hours (1 week)

4. Kubernetes Resources¶

4.1 Namespace¶

apiVersion: v1
kind: Namespace
metadata:
  name: tqpro
  labels:
    name: tqpro
    environment: production

4.2 ConfigMaps¶

4.2.1 Application Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: tqpro-config
  namespace: tqpro
data:
  # Main application properties (secrets removed)
  tourlinq.properties: |
    # Mail configuration
    mail.host=smtp.gmail.com
    mail.port=587
    mail.from=noreply@peruntours.com
    mail.username=noreply@peruntours.com
    # mail.password - injected from Secret

    # Content location (mounted PVC)
    content.location=/app/documents

    # Database (injected from environment)
    # tlinq.dbname, tlinq.dbpass - from env vars

    # Feature flags
    tripmaker.enabled=true

  # API Server configuration
  tlinqapi.properties: |
    http-port=11080
    https-port=11079

    # SSL (if terminating in pod)
    keystore-path=/app/config/perun.jks

    # Document location (PVC mount)
    content.location=/app/documents

    # Auth server (internal K8s service)
    auth.server=http://oauth2-proxy:4180
    auth.validate-url=http://oauth2-proxy:4180/oauth2/auth

    # Development mode
    dev-mode=false
    dev-mode.bypass-auth=false

  # Logging configuration (stdout for K8s)
  log.properties: |
    com.perun.tlinq.handlers=java.util.logging.ConsoleHandler
    com.perun.tlinq.level=INFO
    java.util.logging.ConsoleHandler.level=INFO
    java.util.logging.ConsoleHandler.formatter=java.util.logging.SimpleFormatter
    java.util.logging.SimpleFormatter.format=%1$tY-%1$tm-%1$td %1$tH:%1$tM:%1$tS %4$-6s %2$s %5$s%6$s%n

  # API roles (RBAC)
  api-roles.properties: |
    # Copy entire content from config/api-roles.properties
    # ... (6KB of role mappings)

4.2.2 XML Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: tqpro-xml-config
  namespace: tqpro
data:
  # tourlinq-config.xml (database credentials removed)
  tourlinq-config.xml: |
    <?xml version="1.0" encoding="UTF-8"?>
    <PluginConfig xmlns:xi="http://www.w3.org/2001/XInclude">
      <!-- Database configs with env var placeholders -->
      <Database name="tlinq">
        <url>jdbc:postgresql://${DB_HOST}:${DB_PORT}/${DB_NAME}</url>
        <username>${DB_USER}</username>
        <password>${DB_PASSWORD}</password>
      </Database>

      <!-- Plugins, Services, Entities -->
      <Entities>
        <xi:include href="entities/flight-entities.xml" xpointer="xpointer(/Entities/*)"/>
        <!-- ... other includes -->
      </Entities>
    </PluginConfig>

  # amadeus-client.xml
  amadeus-client.xml: |
    <?xml version="1.0" encoding="UTF-8"?>
    <!-- Amadeus service configuration -->

  # Entity XML files (15 files)
  flight-entities.xml: |
    # ... content

4.2.3 Nginx Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-config
  namespace: tqpro
data:
  nginx.conf: |
    events {
        worker_connections 1024;
    }

    http {
        include /etc/nginx/mime.types;
        default_type application/octet-stream;

        # Logging
        access_log /dev/stdout;
        error_log /dev/stderr;

        # Compression
        gzip on;
        gzip_types text/plain text/css application/json application/javascript;

        upstream api_backend {
            server tqpro-api-service:11080;
        }

        server {
            listen 80;
            server_name _;

            # Admin app
            location /adm {
                alias /usr/share/nginx/html/adm;
                index index.html;
                try_files $uri $uri/ /adm/index.html;
            }

            # B2B app
            location /b2b {
                alias /usr/share/nginx/html/b2b;
                index index.html;
            }

            # Public site
            location / {
                root /usr/share/nginx/html/pub;
                index index.html;
            }

            # API proxy
            location /tlinq-api {
                proxy_pass http://api_backend;
                proxy_set_header Host $host;
                proxy_set_header X-Real-IP $remote_addr;
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                proxy_set_header X-Forwarded-Proto $scheme;
            }
        }
    }

4.3 Secrets¶

apiVersion: v1
kind: Secret
metadata:
  name: tqpro-db-credentials
  namespace: tqpro
type: Opaque
stringData:
  username: tlinq_user
  password: <REDACTED>
  host: postgres-service.database.svc.cluster.local
  port: "5432"
  database: tlinq

---
apiVersion: v1
kind: Secret
metadata:
  name: tqpro-api-keys
  namespace: tqpro
type: Opaque
stringData:
  # Amadeus
  amadeus.client.key: <REDACTED>
  amadeus.client.secret: <REDACTED>

  # Mail
  mail.password: <REDACTED>

  # Payment gateway
  telr.auth.key: <REDACTED>
  telr.merchant.id: <REDACTED>

  # RaynaB2B
  rayna.jwt.token: <REDACTED>

  # Twilio
  twilio.sid: <REDACTED>
  twilio.token: <REDACTED>

  # Odoo
  odoo.password: <REDACTED>

---
apiVersion: v1
kind: Secret
metadata:
  name: tqpro-ssl-keystore
  namespace: tqpro
type: Opaque
data:
  perun.jks: <BASE64_ENCODED_KEYSTORE>
  keystore.password: <BASE64_ENCODED_PASSWORD>

4.4 PersistentVolumeClaim¶

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: tqpro-documents
  namespace: tqpro
spec:
  accessModes:
    - ReadWriteMany  # Multiple pods can write
  storageClassName: efs-sc  # AWS EFS / Azure Files / NFS
  resources:
    requests:
      storage: 50Gi

4.5 Deployment - API Server¶

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tqpro-api
  namespace: tqpro
  labels:
    app: tqpro-api
    version: v1
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: tqpro-api
  template:
    metadata:
      labels:
        app: tqpro-api
        version: v1
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "11080"
        prometheus.io/path: "/tlinq-api/metrics"
    spec:
      # Security context
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000

      # Init container to validate config
      initContainers:
      - name: config-validator
        image: registry.company.com/tqpro/api:1.0.0
        command: ['sh', '-c', 'ls -la /app/config && echo Config mounted successfully']
        volumeMounts:
        - name: config
          mountPath: /app/config
          readOnly: true

      containers:
      - name: api
        image: registry.company.com/tqpro/api:1.0.0
        imagePullPolicy: IfNotPresent

        ports:
        - name: http
          containerPort: 11080
          protocol: TCP
        - name: https
          containerPort: 11079
          protocol: TCP

        # Environment variables
        env:
        - name: DEPLOYMENT_MODE
          value: "kubernetes"
        - name: K8S_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP

        # Database configuration
        - name: DB_HOST
          valueFrom:
            secretKeyRef:
              name: tqpro-db-credentials
              key: host
        - name: DB_PORT
          valueFrom:
            secretKeyRef:
              name: tqpro-db-credentials
              key: port
        - name: DB_NAME
          valueFrom:
            secretKeyRef:
              name: tqpro-db-credentials
              key: database
        - name: DB_USER
          valueFrom:
            secretKeyRef:
              name: tqpro-db-credentials
              key: username
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: tqpro-db-credentials
              key: password

        # API Keys
        - name: AMADEUS_CLIENT_KEY
          valueFrom:
            secretKeyRef:
              name: tqpro-api-keys
              key: amadeus.client.key
        - name: AMADEUS_CLIENT_SECRET
          valueFrom:
            secretKeyRef:
              name: tqpro-api-keys
              key: amadeus.client.secret
        - name: MAIL_PASSWORD
          valueFrom:
            secretKeyRef:
              name: tqpro-api-keys
              key: mail.password

        # JVM options
        - name: JAVA_OPTS
          value: "-Xmx2g -Xms512m -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/app/logs"

        # Resource limits
        resources:
          requests:
            cpu: 1000m
            memory: 2Gi
          limits:
            cpu: 2000m
            memory: 4Gi

        # Health checks
        livenessProbe:
          httpGet:
            path: /tlinq-api/health/live
            port: 11080
          initialDelaySeconds: 90
          periodSeconds: 30
          timeoutSeconds: 5
          failureThreshold: 3

        readinessProbe:
          httpGet:
            path: /tlinq-api/health/ready
            port: 11080
          initialDelaySeconds: 60
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3

        # Startup probe (for slow starts)
        startupProbe:
          httpGet:
            path: /tlinq-api/health/live
            port: 11080
          initialDelaySeconds: 0
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 30  # 5 minutes max

        # Volume mounts
        volumeMounts:
        - name: config
          mountPath: /app/config
          readOnly: true
        - name: documents
          mountPath: /app/documents
        - name: logs
          mountPath: /app/logs

      # Volumes
      volumes:
      - name: config
        projected:
          sources:
          - configMap:
              name: tqpro-config
          - configMap:
              name: tqpro-xml-config
      - name: documents
        persistentVolumeClaim:
          claimName: tqpro-documents
      - name: logs
        emptyDir: {}

      # Pod affinity - spread across nodes
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - tqpro-api
              topologyKey: kubernetes.io/hostname

4.6 Deployment - Web Frontend¶

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tqpro-web
  namespace: tqpro
spec:
  replicas: 2
  selector:
    matchLabels:
      app: tqpro-web
  template:
    metadata:
      labels:
        app: tqpro-web
    spec:
      containers:
      - name: nginx
        image: registry.company.com/tqpro/web:1.0.0
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
        volumeMounts:
        - name: nginx-config
          mountPath: /etc/nginx/nginx.conf
          subPath: nginx.conf
        livenessProbe:
          httpGet:
            path: /adm/index.html
            port: 80
          initialDelaySeconds: 10
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /adm/index.html
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 10
      volumes:
      - name: nginx-config
        configMap:
          name: nginx-config

4.7 Services¶

# API Service
apiVersion: v1
kind: Service
metadata:
  name: tqpro-api-service
  namespace: tqpro
  labels:
    app: tqpro-api
spec:
  type: ClusterIP
  ports:
  - name: http
    port: 11080
    targetPort: 11080
    protocol: TCP
  selector:
    app: tqpro-api
  sessionAffinity: None  # Stateless, no affinity needed

---
# Web Service
apiVersion: v1
kind: Service
metadata:
  name: tqpro-web-service
  namespace: tqpro
spec:
  type: ClusterIP
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: tqpro-web

---
# Hazelcast headless service (for discovery)
apiVersion: v1
kind: Service
metadata:
  name: tqpro-hazelcast
  namespace: tqpro
spec:
  type: ClusterIP
  clusterIP: None  # Headless service
  ports:
  - name: hazelcast
    port: 5701
  selector:
    app: tqpro-api

4.8 Ingress¶

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: tqpro-ingress
  namespace: tqpro
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - tqpro.company.com
    secretName: tqpro-tls
  rules:
  - host: tqpro.company.com
    http:
      paths:
      # API endpoints
      - path: /tlinq-api
        pathType: Prefix
        backend:
          service:
            name: tqpro-api-service
            port:
              number: 11080
      # Web frontend
      - path: /
        pathType: Prefix
        backend:
          service:
            name: tqpro-web-service
            port:
              number: 80

4.9 HorizontalPodAutoscaler¶

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tqpro-api-hpa
  namespace: tqpro
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tqpro-api
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 60

4.10 PodDisruptionBudget¶

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: tqpro-api-pdb
  namespace: tqpro
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: tqpro-api

5. Development Roadmap¶

Phase 1: Foundation (Week 1-2)¶

Objectives: - Setup development environment - Create initial Docker images - Test containerization locally

Tasks: 1. Day 1-2: Docker Setup - Create Dockerfile for API server - Create Dockerfile for web frontend - Test local builds - Optimize image sizes

Day 3-4: Code Changes - Health Checks
Implement HealthApi.java
Add liveness, readiness, startup endpoints
Test health endpoints locally
Update API registration
Day 5-6: Code Changes - Database
Modify TlinqDBSession.java for env vars
Add HikariCP connection pooling
Test with local PostgreSQL
Validate connection fallback logic
Day 7-8: Code Changes - Hazelcast
Modify TlinqClusterCache.java
Add Kubernetes discovery support
Add bare-metal fallback
Test multicast still works locally
Day 9-10: Testing & Documentation
End-to-end local testing
Document all changes
Create migration guide
Code review

Deliverables: - ✅ Working Docker images - ✅ Health check endpoints functional - ✅ Database configuration externalized - ✅ Hazelcast dual-mode support - ✅ Local testing passed

Phase 2: Configuration & Secrets (Week 3)¶

Objectives: - Externalize all configuration - Migrate secrets to Kubernetes Secrets - Create ConfigMaps

Tasks: 1. Day 1-2: ConfigMap Creation - Create tqpro-config ConfigMap - Create tqpro-xml-config ConfigMap - Create nginx-config ConfigMap - Validate XML parsing

Day 3-4: Secrets Migration
Identify all secrets (20+ credentials)
Create Secret manifests
Create secret rotation plan
Document secret access patterns
Day 5: Environment Variable Injection
Update code to read secrets from env vars
Test secret injection
Validate fallback to config files
Day 6-7: Testing
Test with Secrets mounted
Test with ConfigMaps mounted
Validate environment variable precedence
Test bare-metal compatibility

Deliverables: - ✅ All ConfigMaps created - ✅ All Secrets identified and documented - ✅ Code reads from environment variables - ✅ Bare-metal mode still works

Phase 3: Kubernetes Deployment (Week 4)¶

Objectives: - Deploy to development K8s cluster - Configure storage, networking - Validate functionality

Tasks: 1. Day 1: Cluster Setup - Create namespace - Apply RBAC policies - Setup StorageClass (EFS/Azure Files) - Create PersistentVolumeClaim

Day 2: Deploy Dependencies
Deploy PostgreSQL (or configure external)
Create database schema
Deploy Hazelcast StatefulSet (optional)
Test database connectivity
Day 3: Deploy Application
Apply ConfigMaps
Apply Secrets
Deploy API server (1 replica initially)
Check logs for errors
Day 4: Networking
Create Services
Configure Ingress
Setup TLS certificates
Test external access
Day 5: Scale & Test
Scale to 3 replicas
Test Hazelcast clustering
Test file uploads (shared storage)
Load testing

Deliverables: - ✅ Application running in K8s - ✅ 3 replicas healthy - ✅ Ingress accessible - ✅ All health checks passing - ✅ Hazelcast cluster formed

Phase 4: Observability & Monitoring (Week 5)¶

Objectives: - Setup logging aggregation - Configure monitoring - Add alerts

Tasks: 1. Day 1-2: Logging - Deploy EFK stack (Elasticsearch, Fluentd, Kibana) - Configure log shipping - Create log dashboards - Test log queries

Day 3-4: Monitoring
Deploy Prometheus
Configure ServiceMonitor
Deploy Grafana
Create dashboards (CPU, memory, requests)
Day 5: Alerting
Configure AlertManager
Create alert rules (pod down, high CPU, DB errors)
Test alert routing
Document runbooks

Deliverables: - ✅ Centralized logging - ✅ Prometheus metrics collection - ✅ Grafana dashboards - ✅ Alerts configured

Phase 5: Production Hardening (Week 6)¶

Objectives: - Security hardening - Performance optimization - Disaster recovery

Tasks: 1. Day 1-2: Security - Network policies (restrict pod-to-pod) - Security context (non-root user) - Image scanning (Trivy/Aqua) - Secret rotation automation

Day 3: Performance
JVM tuning
Connection pool optimization
Cache configuration tuning
Load testing (JMeter)
Day 4: High Availability
PodDisruptionBudget
HorizontalPodAutoscaler
Multi-zone deployment
Database failover testing
Day 5: Disaster Recovery
Backup automation (Velero)
Database backup/restore
Disaster recovery plan
Runbook documentation

Deliverables: - ✅ Security scan passed - ✅ Performance benchmarks met - ✅ HA configuration tested - ✅ Backup/restore validated

Phase 6: Production Deployment (Week 7-8)¶

Objectives: - Deploy to staging - User acceptance testing - Deploy to production

Tasks: 1. Week 7: Staging Deployment - Deploy to staging cluster - Run smoke tests - User acceptance testing - Performance testing - Security audit

Week 8: Production Deployment
Blue-green deployment setup
Deploy to production (canary rollout)
Monitor metrics closely
Gradual traffic shift (10% → 50% → 100%)
Rollback plan ready

Deliverables: - ✅ Staging validated - ✅ Production deployed - ✅ Zero downtime migration - ✅ Rollback tested

6. Configuration Migration¶

6.1 Configuration Files Mapping¶

File	Bare-Metal Location	Kubernetes Resource	Notes
tourlinq-config.xml	`$TLINQ_HOME/tourlinq-config.xml`	ConfigMap `tqpro-xml-config`	Remove DB credentials
tourlinq.properties	`$TLINQ_HOME/tourlinq.properties`	ConfigMap `tqpro-config`	Remove secrets
tlinqapi.properties	`$TLINQ_HOME/tlinqapi.properties`	ConfigMap `tqpro-config`	Update paths
log.properties	`$TLINQ_HOME/log.properties`	ConfigMap `tqpro-config`	Stdout only
api-roles.properties	`$TLINQ_HOME/api-roles.properties`	ConfigMap `tqpro-config`	No changes
amadeus-client.xml	`$TLINQ_HOME/amadeus-client.xml`	ConfigMap `tqpro-xml-config`	No changes
Entity XMLs (15 files)	`$TLINQ_HOME/entities/*.xml`	ConfigMap `tqpro-xml-config`	No changes
amadeus.idfile	Hardcoded path	Secret `tqpro-api-keys`	Two-line CSV
perun.jks	Hardcoded path	Secret `tqpro-ssl-keystore`	Binary file

6.2 Environment Variables Strategy¶

Deployment Mode:

# Kubernetes deployment
DEPLOYMENT_MODE=kubernetes

# Bare-metal deployment
DEPLOYMENT_MODE=baremetal

Database Configuration:

DB_HOST=postgres-service.database.svc.cluster.local
DB_PORT=5432
DB_NAME=tlinq
DB_USER=tlinq_user
DB_PASSWORD=<from-secret>

Hazelcast Configuration:

K8S_NAMESPACE=tqpro
HAZELCAST_SERVICE=tqpro-hazelcast
HAZELCAST_PORT=5701

Application Configuration:

TLINQ_HOME=/app/config
JAVA_OPTS=-Xmx2g -Xms512m -XX:+UseG1GC
LOG_MODE=kubernetes

6.3 Secrets Extraction¶

From tourlinq.properties:

# Extract to Secret
mail.password=KP8ZH8zwKeQ0
telr.auth-key=JLM6^MpHfH@K8MpP
telr.merchant-id=21401
rayna.jwt.token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9...
twilio.sid=ACb7e13c05a35d56650e4a8c528226fa13
twilio.token=d7b7cf33e9eb90177e5c4f8c58d0c065

From tourlinq-config.xml:

<!-- Extract to Secret -->
<Database name="tlinq">
  <username>TlinqUser</username>
  <password>TlinqAdmin</password>
</Database>

From odoo-client.properties:

# Extract to Secret
odoo.user=odoo@peruntours.com
odoo.password=<redacted>
odoo.session.id=<redacted>

Amadeus Credentials:

# amadeus.idfile content -> Secret
AMADEUS_CLIENT_KEY,AMADEUS_CLIENT_SECRET

7. Risk Assessment¶

7.1 High Risk Items¶

Risk	Impact	Probability	Mitigation
Hazelcast clustering fails in K8s	Cache not shared across pods	Medium	Test thoroughly; fallback to external Hazelcast
Shared storage performance	Slow document upload/download	Medium	Use high-performance storage (EFS Provisioned Throughput)
Database connection pool exhaustion	API failures under load	Medium	Implement HikariCP with proper sizing
Secret rotation breaks production	Application crashes	Low	Implement graceful secret reload
Configuration file parsing errors	App fails to start	Medium	Add init container to validate config

7.2 Medium Risk Items¶

Risk	Impact	Probability	Mitigation
Memory leaks in long-running pods	OOM kills	Medium	Monitor heap usage; set memory limits
External API rate limits	Service degradation	High	Implement circuit breakers and retries
SSL certificate expiry	HTTPS access lost	Low	Use cert-manager for auto-renewal
Log volume too high	Storage costs increase	Medium	Implement log level filtering
Migration downtime longer than expected	User impact	Medium	Extensive testing in staging

7.3 Low Risk Items¶

Risk	Impact	Probability	Mitigation
Container image size too large	Slow deployments	Low	Multi-stage builds optimize size
Pod startup time too slow	Slow scaling	Low	Tune JVM startup; use CDS
Ingress configuration errors	404 errors	Low	Test routing before production

8. Testing Strategy¶

8.1 Unit Testing¶

Scope: Code changes for Kubernetes compatibility

Tests to Add: 1. HealthApi Tests - Liveness returns 200 - Readiness returns 200 when DB up - Readiness returns 503 when DB down

TlinqDBSession Tests
Reads from environment variables
Falls back to config file
Connection pool works
TlinqClusterCache Tests
Kubernetes mode uses K8s discovery
Bare-metal mode uses multicast
Environment variable parsing

Tools: JUnit 5, Mockito

8.2 Integration Testing¶

Scope: Container and Kubernetes integration

Test Scenarios: 1. Container Build - Docker build succeeds - Image size within limits (<300MB) - Security scan passes (no critical CVEs)

Container Runtime
Container starts successfully
Health endpoints respond
Logs appear on stdout
Kubernetes Deployment
Pods start and become ready
ConfigMaps mounted correctly
Secrets injected as env vars
PVC mounted and writable
Multi-Pod
3 pods run simultaneously
Hazelcast cluster forms
Cache shared across pods
Load balanced correctly

8.3 Functional Testing¶

Scope: End-to-end business functionality

Test Cases: 1. Flight Search - Search flights (JFK → LHR) - Verify results returned - Confirm pricing - Create booking

Hotel Search
Search hotels in Paris
View offers
Verify pricing
Document Management
Upload PDF
Verify file saved to PVC
Download PDF from another pod
User Management
Login via OAuth2
Access admin endpoints
Role-based access control

8.4 Performance Testing¶

Tools: Apache JMeter, k6

Test Scenarios:

Load Test
100 concurrent users
Mixed API calls (search, booking)
Duration: 30 minutes
Success rate: >99%
Avg response time: <500ms
Stress Test
Gradually increase load to 500 users
Identify breaking point
Monitor auto-scaling behavior
Verify graceful degradation
Soak Test
50 concurrent users
Duration: 12 hours
Check for memory leaks
Verify no degradation over time

Acceptance Criteria: - Response time p95 < 1 second - Error rate < 0.1% - No memory leaks - Auto-scaling triggers appropriately

8.5 Disaster Recovery Testing¶

Test Scenarios:

Pod Failure
Delete random pod
Verify auto-restart
Verify no service interruption
Node Failure
Drain node
Verify pods rescheduled
Verify service continuity
Database Failure
Stop database
Verify readiness probe fails
Verify graceful error handling
Restore database
Verify automatic recovery
Complete Cluster Failure
Backup application state
Destroy cluster
Restore from backup
Verify data integrity

9. Rollout Plan¶

9.1 Pre-Deployment Checklist¶

Infrastructure: - [ ] Kubernetes cluster provisioned (3+ nodes) - [ ] StorageClass configured (EFS/Azure Files/NFS) - [ ] PostgreSQL database deployed or accessible - [ ] Ingress controller installed (nginx-ingress) - [ ] Cert-manager installed (for TLS) - [ ] Monitoring stack deployed (Prometheus/Grafana) - [ ] Logging stack deployed (EFK/Loki)

Application: - [ ] Docker images built and pushed to registry - [ ] ConfigMaps created - [ ] Secrets created (encrypted in repo via sealed-secrets) - [ ] Database schema migrated - [ ] Health endpoints tested - [ ] Code changes merged to main branch

Documentation: - [ ] Deployment runbook completed - [ ] Rollback procedure documented - [ ] Monitoring dashboard created - [ ] Alert rules configured - [ ] On-call rotation established

9.2 Deployment Steps¶

Development Environment (Week 4):

Create namespace
```
kubectl create namespace tqpro-dev
```
Apply secrets (from sealed-secrets or vault)
```
kubectl apply -f k8s/dev/secrets/
```
Apply ConfigMaps
```
kubectl apply -f k8s/dev/configmaps/
```
Create PVC
```
kubectl apply -f k8s/dev/storage/
```
Deploy database (if not external)
```
kubectl apply -f k8s/dev/database/
```

Deploy application (1 replica)

kubectl apply -f k8s/dev/deployment.yaml

Verify health

kubectl get pods -n tqpro-dev
kubectl logs -f deployment/tqpro-api -n tqpro-dev

Create service
```
kubectl apply -f k8s/dev/service.yaml
```
Create ingress
```
kubectl apply -f k8s/dev/ingress.yaml
```

Test access

curl https://tqpro-dev.company.com/tlinq-api/health

Staging Environment (Week 7):

Repeat steps above with tqpro-staging namespace
Run full test suite
User acceptance testing
Performance testing
Security audit

Production Environment (Week 8):

Blue-Green Strategy:
Deploy "green" environment alongside existing "blue"
Route 10% traffic to green
Monitor metrics for 1 hour
Gradually increase to 50%, then 100%
Keep blue running for 24 hours as rollback option

Deployment Command:

# Deploy green
kubectl apply -f k8s/prod/deployment-green.yaml

# Update service to route to green (weighted)
kubectl apply -f k8s/prod/service-weighted.yaml

# Monitor
kubectl get pods -n tqpro -l version=green -w

# Full cutover
kubectl apply -f k8s/prod/service.yaml

# Cleanup blue after 24h
kubectl delete deployment tqpro-api-blue -n tqpro

9.3 Rollback Procedure¶

If issues detected during rollout:

Immediate Rollback (< 5 minutes):

# Revert service to blue
kubectl apply -f k8s/prod/service-blue.yaml

# Or use rollback
kubectl rollout undo deployment/tqpro-api -n tqpro

Post-Rollback:
Investigate root cause
Fix issues in dev/staging
Re-test thoroughly
Schedule new deployment

9.4 Post-Deployment Validation¶

Automated Checks: - [ ] All pods healthy (3/3 running) - [ ] Health endpoints returning 200 - [ ] Ingress routing correctly - [ ] SSL certificate valid - [ ] Metrics being collected - [ ] Logs being aggregated

Manual Checks: - [ ] Login functionality works - [ ] Flight search works - [ ] Hotel search works - [ ] Document upload works - [ ] API responses correct - [ ] No errors in logs

Performance Validation: - [ ] Response times within SLA - [ ] No increase in error rate - [ ] Database connections stable - [ ] Memory usage normal - [ ] CPU usage normal

9.5 Monitoring & Alerts¶

Key Metrics to Monitor:

Application Health:
Pod restart count
Health check failures
Application errors (500s)
Performance:
Request rate
Response time (p50, p95, p99)
Error rate
Database query time
Resource Usage:
CPU utilization
Memory usage
Disk I/O (PVC)
Network traffic
External Dependencies:
Database connection pool
Amadeus API latency
Odoo API availability
Cache hit ratio

Alert Thresholds: - Pod crash loop: Immediate alert - Error rate > 1%: Warning - Error rate > 5%: Critical - Response time p95 > 2s: Warning - Response time p95 > 5s: Critical - Memory > 90%: Warning - CPU > 80% for 5min: Warning

10. Cost Analysis¶

10.1 Infrastructure Costs (Monthly)¶

AWS EKS Example:

Component	Specification	Cost
EKS Cluster	Control plane	$73
EC2 Nodes	3x t3.xlarge (4 CPU, 16GB)	$223
EBS Volumes	300GB gp3	$24
EFS Storage	50GB + 5 MB/s provisioned	$65
Application Load Balancer	1 ALB	$23
Data Transfer	500GB/month	$45
RDS PostgreSQL	db.t3.medium (2 CPU, 4GB)	$123
CloudWatch	Logs + Metrics	$30
Subtotal		$606/month

Azure AKS Example:

Component	Specification	Cost
AKS Cluster	Control plane (free)	$0
VM Nodes	3x Standard_D4s_v3 (4 CPU, 16GB)	$347
Managed Disks	300GB Premium SSD	$51
Azure Files	50GB Premium	$110
Application Gateway	Standard_v2	$267
Azure Database for PostgreSQL	General Purpose, 2 vCores	$182
Log Analytics	10GB/day	$25
Subtotal		$982/month

GCP GKE Example:

Component	Specification	Cost
GKE Cluster	Control plane	$73
Compute Nodes	3x n1-standard-4 (4 CPU, 15GB)	$292
Persistent Disks	300GB SSD	$51
Filestore	1TB Basic	$204
Cloud Load Balancer	External HTTPS	$18
Cloud SQL PostgreSQL	db-n1-standard-2	$158
Cloud Logging	50GB/month	$25
Subtotal		$821/month

10.2 Scaling Costs¶

Auto-Scaling Impact:

With HPA configured (3-10 replicas): - Minimum (3 replicas): Base cost - Average (5 replicas): +40% compute cost - Peak (10 replicas): +100% compute cost

Recommendation: Set HPA max based on budget constraints and actual traffic patterns.

10.3 Cost Optimization Strategies¶

Right-Sizing:
Start with smaller instances
Use metrics to adjust
Potentially save 30-40%
Reserved Instances:
1-year reserved instances: ~30% savings
3-year reserved instances: ~50% savings
Spot Instances:
Use for non-critical workloads
70-90% savings
Not recommended for production API
Storage Optimization:
Use object storage (S3) instead of EFS for documents
Potential savings: $50/month
Multi-Tenancy:
Share cluster with other applications
Reduce per-app overhead

Estimated Optimized Cost: $400-500/month (vs $600-900 unoptimized)

11. Conclusion¶

11.1 Summary¶

The TQPro application is well-suited for Kubernetes deployment with the following assessment:

Strengths ✅: - Stateless REST API design (perfect for K8s) - Embedded Jetty server (no external dependencies) - Multi-module architecture (clear separation) - Standard Java/Gradle stack (well-supported)

Challenges ⚠️: - Configuration externalization required - Hazelcast needs K8s discovery implementation - Secrets hardcoded in config files - No health check endpoints (must add) - Shared file storage needed

Overall Effort: 6-8 weeks (including testing)

Cost: $400-900/month depending on cloud provider and optimization

Risk Level: Medium (manageable with proper planning)

11.2 Recommendations¶

Immediate Actions: 1. ✅ Approve development plan and budget 2. ✅ Provision development Kubernetes cluster 3. ✅ Assign development team (2-3 engineers) 4. ✅ Start Phase 1 (code changes)

Critical Success Factors: 1. Thorough testing in staging before production 2. Gradual rollout with rollback capability 3. Comprehensive monitoring from day one 4. Clear runbooks for operations team

Future Enhancements (Post-Deployment): 1. Service mesh (Istio) for advanced traffic management 2. GitOps (ArgoCD/Flux) for declarative deployments 3. Chaos engineering for resilience testing 4. Multi-region deployment for disaster recovery

11.3 Decision Point¶

Proceed with Kubernetes deployment? - ☐ Yes - Begin Phase 1 immediately - ☐ No - Document reasons and revisit in 6 months - ☐ Partial - Start with dev/staging only

Appendix A: Hazelcast vs Alternatives Analysis¶

A.1 Question: Should Hazelcast be replaced with another distributed cache?¶

Initial Consideration: Replace Hazelcast with Caffeine (local cache) or Redis (managed service)

Analysis Performed: Deep code analysis of all cache usage patterns across the application

A.2 Cache Usage Discovery¶

Finding: Hazelcast is minimally used but CRITICAL for multi-instance deployment

Only 4 files use Hazelcast: 1. CartHolder.java - Shopping cart storage (session-based) 2. OdooServiceFactory.java - User session management 3. ApiRoleManager.java - API authorization cache 4. TlinqFrameworkInitializer.java - Test initialization

3 Active Caches: | Cache Name | Purpose | Critical | DB Fallback | Multi-Pod Requirement | |------------|---------|----------|-------------|----------------------| | cartsCache | Shopping carts | YES | Partial | REQUIRED | | userSessions | Odoo authentication sessions | YES | NO | CRITICAL* | | apiRolesCache | API RBAC | MEDIUM | File reload | Optional |

*Logged-in user carts have database fallback; anonymous carts do not

A.3 Multi-Instance Impact Analysis¶

User Sessions (CRITICAL):

// OdooServiceFactory.java:146-158
private UserLogin fetchSession(String sessionToken) {
    UserLogin session = sessions.get(sessionToken);  // Hazelcast lookup
    if(null == session) {
        // NO DATABASE FALLBACK - User gets logged out!
        throw new TlinqClientException(TlinqErr.SESSION_ERROR,"Session expired");
    }
}

Impact if cache not distributed: - ❌ User sessions lost when request hits different pod - ❌ Users forced to re-authenticate frequently - ❌ Poor user experience - ❌ Horizontal scaling impossible

Shopping Carts (HIGH PRIORITY): - Logged-in users: ✅ Database fallback works - Anonymous users: ❌ Cart lost on pod switch (requires sticky sessions)

A.4 Alternative Evaluation¶

Option 1: Caffeine (Local Cache)¶

Pros: - ✅ Lightweight (~1MB vs 4MB) - ✅ Better memory management (TTL + eviction built-in) - ✅ No network configuration needed - ✅ 1-2 day migration effort

Cons: - ❌ BREAKS multi-instance deployment - ❌ User sessions not shared between pods - ❌ Anonymous carts lost on pod switch - ❌ Requires sticky sessions (defeats K8s benefits)

Verdict: ❌ NOT SUITABLE for Kubernetes multi-pod deployment

Option 2: Redis (External Cache)¶

Pros: - ✅ True distributed caching - ✅ Managed service available (AWS ElastiCache, Azure Cache for Redis) - ✅ Persistence across restarts - ✅ Simpler K8s deployment (no in-app clustering) - ✅ Can share with other services

Cons: - ⚠️ Additional infrastructure ($50-100/month) - ⚠️ 2-week migration effort - ⚠️ New dependency to manage - ⚠️ Network latency for cache operations

Verdict: ✅ VIABLE ALTERNATIVE but higher effort/cost

Option 3: Hazelcast (Fixed for Kubernetes)¶

Pros: - ✅ Already integrated (4 files, 3 caches) - ✅ Designed for distributed caching - ✅ Zero additional infrastructure cost - ✅ Embedded in application (no external service) - ✅ 3-5 day migration effort - ✅ Backward compatible with bare-metal

Cons: - ⚠️ Requires Kubernetes discovery configuration - ⚠️ Current config has hardcoded IP (fixable) - ⚠️ No TTL/eviction configured (fixable) - ⚠️ Cluster management in-app

Verdict: ⭐ RECOMMENDED - Best fit for current architecture

A.5 Decision Matrix¶

Criteria	Caffeine	Hazelcast (Fixed)	Redis
Multi-Instance Support	❌ NO	✅ YES	✅ YES
Effort	2 days	3-5 days	2 weeks
Cost	$0	$0	$50-100/mo
Complexity	Low	Medium	Medium-High
Code Changes	4 files	4 files + config	4 files + client
Infrastructure	None	None	Redis cluster
Backward Compat	✅ Easy	✅ Easy	⚠️ Complex
Performance	Best (local)	Good (in-cluster)	Good (network hop)
Persistence	❌ None	⚠️ In-memory	✅ Disk-backed
Overall Rating	⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐

A.6 Final Recommendation¶

Keep Hazelcast and Fix Kubernetes Discovery ⭐⭐⭐⭐⭐

Rationale: 1. Minimal disruption - Already integrated, just needs K8s config 2. Cost effective - No additional infrastructure 3. Quick to implement - 3-5 days vs 2 weeks for Redis 4. Proven technology - Hazelcast designed for this exact use case 5. Backward compatible - Bare-metal deployment still works

Implementation: - Upgrade to Hazelcast 5.3.6 (from 4.2.4) - Add Kubernetes service discovery plugin - Configure TTL and eviction policies - Add health check endpoints - See HAZELCAST_KUBERNETES_MIGRATION.md for complete implementation plan

When to Consider Redis: - If Hazelcast clustering proves problematic in production - If you need persistence across cluster restarts - If you want managed service with support - If sharing cache with other applications - Budget allows for $50-100/month additional cost

A.7 Current Hazelcast Issues Fixed¶

Before:

// ❌ Hardcoded IP
network.getJoin().getMulticastConfig()
    .addTrustedInterface("172.16.55.1");

// ❌ No TTL - memory leak risk
// ❌ No eviction policies
// ❌ No max size limits

After:

// ✅ Kubernetes discovery
if ("kubernetes".equals(deploymentMode)) {
    network.getJoin().getKubernetesConfig()
        .setEnabled(true)
        .setProperty("namespace", System.getenv("K8S_NAMESPACE"))
        .setProperty("service-name", "tqpro-hazelcast");
}

// ✅ TTL configured
MapConfig cartsConfig = new MapConfig("cartsCache");
cartsConfig.setTimeToLiveSeconds(1800);  // 30 min

// ✅ Eviction policy
cartsConfig.setEvictionConfig(new EvictionConfig()
    .setSize(10000)
    .setMaxSizePolicy(MaxSizePolicy.PER_NODE)
    .setEvictionPolicy(EvictionPolicy.LRU));

A.8 Testing Validation¶

Multi-Pod Cache Consistency Test:

# Create session on pod-1
SESSION_ID="test-session-123"
curl -X POST http://pod-1:11080/api/cart/addItem \
  -d '{"session":"'$SESSION_ID'","item":"ITEM123"}'

# Retrieve cart from pod-2 (different pod!)
curl -X POST http://pod-2:11080/api/cart/load \
  -d '{"session":"'$SESSION_ID'"}'

# ✅ Should return cart with ITEM123 (cache shared)
# ❌ With Caffeine: Would return empty cart (cache not shared)

Expected Results with Hazelcast: - ✅ Sessions accessible from all pods - ✅ Carts consistent across pod switches - ✅ Cluster size matches pod count - ✅ No session loss during scaling events

Document Owner: DevOps Team Last Updated: 2024-11-23 Next Review: After Phase 1 completion

End of Document