TQPro Kubernetes Deployment Plan¶
Document Version: 1.0 Date: 2024-11-23 Status: Analysis Complete - Development Plan Target Environment: Kubernetes (AWS EKS / Azure AKS / GKE)
Executive Summary¶
The TQPro application demonstrates GOOD containerization feasibility with moderate configuration changes required. The application is a stateless REST API built on embedded Jetty, making it well-suited for Kubernetes deployment.
Overall Assessment: ✅ FEASIBLE - Estimated 4-6 weeks development effort
Key Findings: - ✅ Stateless design (no server-side sessions) - ✅ Embedded server (Jetty 12) - no external app server needed - ✅ RESTful API architecture - ⚠️ Configuration requires externalization - ⚠️ Hazelcast needs Kubernetes discovery - ⚠️ Hardcoded secrets must be moved to Secrets - ❌ No health check endpoints (must be added)
Table of Contents¶
- Current Architecture Analysis
- Containerization Strategy
- Required Code Changes
- Kubernetes Resources
- Development Roadmap
- Configuration Migration
- Risk Assessment
- Testing Strategy
- Rollout Plan
- Appendix A: Hazelcast vs Alternatives Analysis
1. Current Architecture Analysis¶
1.1 Application Components¶
Backend API Server:
- Entry Point: tqapi/src/main/java/com/perun/tlinq/TQProApiServer.java:258
- Server: Embedded Jetty 12.0.10
- Runtime: Java 17 (Amazon Corretto)
- Framework: JAX-RS with Jersey 3.1.6
- Context Path: /tlinq-api
- Ports: HTTP 11080, HTTPS 11079 (configurable)
Module Structure:
tqapi (REST endpoints - 8,749 LOC)
├── tqapp (Business logic - 296 Java files)
│ ├── tqcommon (Utilities, config, DB)
│ ├── tqamds (Amadeus API integration)
│ ├── tqodoo (Odoo ERP integration)
│ └── tqryb2b (Rayna B2B integration)
└── tqcommon
Frontend Applications:
- tqweb-adm (5.6MB) - Admin dashboard (HTML/JS/Foundation)
- tqweb-b2b (8.4MB) - B2B portal
- tqweb-pub (87MB) - Public website
1.2 External Dependencies¶
Database:
- PostgreSQL (currently localhost:5432)
- Hibernate ORM with SessionFactory
- Connection configured in config/tourlinq-config.xml
- No connection pooling (uses Hibernate defaults)
Distributed Cache:
- Hazelcast 4.2.4 in-memory grid (will be upgraded to 5.3.6)
- CRITICAL ISSUE: Hardcoded IP 172.16.55.1 in TlinqClusterCache.java:43
- Multicast port: 55478 (incompatible with K8s)
- Usage: 3 caches - shopping carts, user sessions, API roles
- Status: REQUIRED for multi-instance deployment
- Action Required: Configure Kubernetes discovery (see detailed migration plan)
External APIs: - Amadeus (flights/hotels) - HTTPS - Odoo ERP - XML-RPC - Rayna B2B - HTTP REST - Mail server - SMTP:587 - Twilio - HTTPS
File System:
- Document storage: /var/www/docimages (configurable in tlinqapi.properties:36)
- Logs: /var/log/tqpro/ (hardcoded in log.properties:19)
- Config: TLINQ_HOME environment variable
1.3 Configuration Management¶
Environment Variable Dependencies:
- TLINQ_HOME - CRITICAL: All config loaded from this path
- Set in TQProApiServer.java:259
- Used by ClientConfig.java:37 and AppConfig.java:21
Configuration Files (in config/):
1. tourlinq-config.xml (7.6KB) - Main config with DB credentials ⚠️
2. tourlinq.properties (2KB) - App properties with secrets ⚠️
3. tlinqapi.properties (1.7KB) - Server config with paths ⚠️
4. log.properties (622 bytes) - Logging with hardcoded path ⚠️
5. api-roles.properties (6.1KB) - RBAC configuration
6. amadeus-client.xml (3.5KB) - Amadeus service mappings
7. Entity files (15 XML files via XInclude)
Security Issues 🔴:
- Database passwords in plaintext (tourlinq-config.xml:50-55)
- Mail password: KP8ZH8zwKeQ0 (tourlinq.properties:9)
- Payment gateway key: JLM6^MpHfH@K8MpP (tourlinq.properties:19)
- RaynaB2B JWT token (tourlinq.properties:32-33)
- Twilio credentials (tourlinq.properties:35-36)
- Odoo credentials in odoo-client.properties
1.4 State Management¶
Session Design: ✅ Stateless
- No server-side sessions (ServletContextHandler.NO_SESSIONS at TQProApiServer.java:218)
- Authentication via OAuth2-Proxy + Keycloak
- User identity in HTTP headers (X-User, X-Roles, X-Email, X-Name)
- Security context created per-request (AuthenticationFilter.java:100-105)
Implications: - ✅ Perfect for horizontal scaling - ✅ No session affinity required - ✅ Can use any load balancing strategy
1.5 Critical Gaps¶
Missing for Kubernetes:
1. ❌ No health check endpoints (/health, /ready, /live)
2. ❌ No graceful shutdown handling
3. ❌ Hardcoded secrets in config files
4. ❌ File-based logging (should use stdout)
5. ❌ Hazelcast not Kubernetes-aware
6. ❌ No connection pool configuration
7. ❌ No metrics endpoint (Prometheus)
2. Containerization Strategy¶
2.1 Docker Image Strategy¶
Multi-Stage Build Approach:
# ============================================
# Stage 1: Build
# ============================================
FROM gradle:8.10-jdk17-alpine AS builder
WORKDIR /build
# Copy gradle wrapper and build files
COPY ../../../gradlew gradlew.bat ./
COPY ../../../gradle gradle/
COPY ../../../settings.gradle.kts build.gradle.kts ./
# Copy all module build files first (for layer caching)
COPY ../../../tqcommon/build.gradle.kts tqcommon/
COPY ../../../tqapp/build.gradle.kts tqapp/
COPY ../../../tqapi/build.gradle.kts tqapi/
COPY ../../../tqamds/build.gradle.kts tqamds/
COPY ../../../tqodoo/build.gradle.kts tqodoo/
COPY ../../../tqryb2b/build.gradle.kts tqryb2b/
# Download dependencies (cached layer if no build file changes)
RUN gradle dependencies --no-daemon || true
# Copy source code
COPY ../../../tqcommon/src tqcommon/src/
COPY ../../../tqapp/src tqapp/src/
COPY ../../../tqapi/src tqapi/src/
COPY ../../../tqamds/src tqamds/src/
COPY ../../../tqodoo/src tqodoo/src/
COPY ../../../tqryb2b/src tqryb2b/src/
# Build application
RUN gradle clean build -x test --no-daemon
# Copy dependencies to known location
RUN gradle :tqapi:copyDependencies --no-daemon
# ============================================
# Stage 2: Runtime
# ============================================
FROM amazoncorretto:17-alpine
# Install curl for health checks
RUN apk add --no-cache curl
# Create app user (non-root)
RUN addgroup -g 1000 tqpro && \
adduser -D -u 1000 -G tqpro tqpro
WORKDIR /app
# Copy JARs from builder
COPY --from=builder /build/tqapi/build/libs/tqapi.jar ./
COPY --from=builder /build/tqapi/build/libs/lib/*.jar ./lib/
COPY --from=builder /build/tqapp/build/libs/tqapp.jar ./lib/
COPY --from=builder /build/tqcommon/build/libs/tqcommon.jar ./lib/
COPY --from=builder /build/tqamds/build/libs/tqamds.jar ./lib/
COPY --from=builder /build/tqodoo/build/libs/tqodoo.jar ./lib/
COPY --from=builder /build/tqryb2b/build/libs/tqryb2b.jar ./lib/
# Create directories
RUN mkdir -p /app/config /app/documents /app/logs && \
chown -R tqpro:tqpro /app
# Copy default config templates (will be overridden by ConfigMaps)
COPY --chown=tqpro:tqpro ../../../config /app/config/
# Switch to non-root user
USER tqpro
# Environment variables
ENV TLINQ_HOME=/app/config \
JAVA_OPTS="-Xmx2g -Xms512m -XX:+UseG1GC -XX:MaxGCPauseMillis=200" \
TZ=UTC
# Expose ports
EXPOSE 11080 11079
# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=60s --retries=3 \
CMD curl -f http://localhost:11080/tlinq-api/health || exit 1
# Start command
ENTRYPOINT ["sh", "-c", "java $JAVA_OPTS -DTLINQ_HOME=$TLINQ_HOME -jar tqapi.jar"]
Image Size Optimization: - Base: Alpine Linux (~40MB) - Java 17 Runtime: ~150MB - Application JARs: ~80MB - Total Estimated Size: ~270MB
2.2 Web Frontend Images¶
Nginx Image for Static Content:
FROM nginx:1.25-alpine
# Copy static content
COPY tqweb-adm /usr/share/nginx/html/adm
COPY tqweb-b2b /usr/share/nginx/html/b2b
COPY tqweb-pub /usr/share/nginx/html/pub
# Copy nginx config (from ConfigMap in K8s)
COPY nginx.conf /etc/nginx/nginx.conf
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]
Size: ~95MB (87MB static + 8MB nginx)
2.3 Image Naming Convention¶
registry.company.com/tqpro/api:1.0.0
registry.company.com/tqpro/api:1.0.0-sha-abc123
registry.company.com/tqpro/web:1.0.0
Tags:
- Semantic version: 1.0.0
- Git commit SHA: 1.0.0-sha-abc123 (for traceability)
- Environment: 1.0.0-dev, 1.0.0-staging, 1.0.0-prod
3. Required Code Changes¶
3.1 HIGH PRIORITY: Health Check Endpoints¶
New File: tqapi/src/main/java/com/perun/tlinq/api/HealthApi.java
package com.perun.tlinq.api;
import jakarta.ws.rs.*;
import jakarta.ws.rs.core.MediaType;
import jakarta.ws.rs.core.Response;
import com.perun.tlinq.util.TlinqDBSession;
import org.hibernate.Session;
import java.util.HashMap;
import java.util.Map;
@Path("/health")
public class HealthApi {
// Liveness probe - is the process running?
@GET
@Path("/live")
@Produces(MediaType.APPLICATION_JSON)
public Response liveness() {
Map<String, Object> status = new HashMap<>();
status.put("status", "UP");
status.put("timestamp", System.currentTimeMillis());
return Response.ok(status).build();
}
// Readiness probe - can we serve traffic?
@GET
@Path("/ready")
@Produces(MediaType.APPLICATION_JSON)
public Response readiness() {
Map<String, Object> status = new HashMap<>();
boolean ready = true;
// Check database connectivity
try {
Session session = TlinqDBSession.getSession();
session.createNativeQuery("SELECT 1").getSingleResult();
session.close();
status.put("database", "UP");
} catch (Exception e) {
status.put("database", "DOWN");
status.put("error", e.getMessage());
ready = false;
}
// Check Hazelcast (if enabled)
try {
if (System.getProperty("hazelcast.enabled", "true").equals("true")) {
// Add Hazelcast health check
status.put("cache", "UP");
}
} catch (Exception e) {
status.put("cache", "DOWN");
ready = false;
}
status.put("status", ready ? "UP" : "DOWN");
status.put("timestamp", System.currentTimeMillis());
return ready ?
Response.ok(status).build() :
Response.status(503).entity(status).build();
}
// Health endpoint - detailed health info
@GET
@Produces(MediaType.APPLICATION_JSON)
public Response health() {
return readiness();
}
}
Integration: Register in TQProApiServer.java alongside other API classes.
3.2 HIGH PRIORITY: Hazelcast Kubernetes Discovery¶
📋 DETAILED IMPLEMENTATION PLAN: See
HAZELCAST_KUBERNETES_MIGRATION.mdfor complete step-by-step implementation guide including: - Full code implementation with TTL and eviction policies - Kubernetes RBAC configuration - Testing procedures - Monitoring and health checks - Rollback procedures
Critical Finding - Multi-Instance Requirement:
After deep code analysis, Hazelcast is REQUIRED for multi-pod deployment because: 1. User sessions (Odoo integration) - cache-only, NO database fallback 2. Shopping carts (anonymous users) - memory-only, lost on pod switch 3. API roles - shared authorization cache
Why not Caffeine: Local caching would break session/cart sharing across pods.
Modify: tqcommon/src/main/java/com/perun/tlinq/entity/cache/TlinqClusterCache.java
Summary of Required Changes:
// 1. Upgrade dependency (build.gradle.kts)
implementation("com.hazelcast:hazelcast:5.3.6")
implementation("com.hazelcast:hazelcast-kubernetes:2.2.3")
// 2. Add deployment mode detection
String deploymentMode = System.getenv("DEPLOYMENT_MODE");
if ("kubernetes".equals(deploymentMode)) {
// Kubernetes deployment
network.getJoin().getMulticastConfig().setEnabled(false);
network.getJoin().getTcpIpConfig().setEnabled(false);
// Enable Kubernetes discovery
network.getJoin().getKubernetesConfig()
.setEnabled(true)
.setProperty("namespace", System.getenv("K8S_NAMESPACE"))
.setProperty("service-name", "tqpro-hazelcast")
.setProperty("service-port", "5701");
log.info("Hazelcast configured for Kubernetes discovery");
} else {
// Bare-metal deployment (existing logic)
network.setPort(55478);
network.getJoin().getMulticastConfig()
.setEnabled(true)
.addTrustedInterface(
System.getProperty("hazelcast.interface", "172.16.55.1")
);
log.info("Hazelcast configured for multicast discovery");
}
// 3. Configure TTL and eviction policies (prevents memory leaks)
MapConfig cartsConfig = new MapConfig("cartsCache");
cartsConfig.setTimeToLiveSeconds(1800); // 30 min TTL
cartsConfig.setEvictionConfig(new EvictionConfig()
.setSize(10000)
.setMaxSizePolicy(MaxSizePolicy.PER_NODE)
.setEvictionPolicy(EvictionPolicy.LRU));
cfg.addMapConfig(cartsConfig);
// Similar for userSessions (60 min TTL) and apiRolesCache
Required Kubernetes Resources:
- ServiceAccount with RBAC permissions (see k8s/hazelcast-rbac.yaml)
- Headless Service for discovery (see k8s/hazelcast-service.yaml)
- Environment variables: DEPLOYMENT_MODE=kubernetes, K8S_NAMESPACE, HAZELCAST_SERVICE
Effort: 3-5 days (includes testing and validation)
3.3 HIGH PRIORITY: Externalize Database Configuration¶
Modify: tqcommon/src/main/java/com/perun/tlinq/util/TlinqDBSession.java
Current (line 24):
New:
// Priority: Environment variable > Config file > Default
String dbHost = System.getenv("DB_HOST");
String dbPort = System.getenv("DB_PORT");
String dbName = System.getenv("DB_NAME");
String dbUser = System.getenv("DB_USER");
String dbPassword = System.getenv("DB_PASSWORD");
// Fallback to config file if env vars not set (bare-metal mode)
if (dbHost == null) {
String configDbName = AppConfig.instance().getProperty("tlinq.dbname");
// Use existing logic with configured database
} else {
// Build connection from environment variables
String jdbcUrl = String.format(
"jdbc:postgresql://%s:%s/%s",
dbHost,
dbPort != null ? dbPort : "5432",
dbName
);
Configuration configuration = new Configuration();
configuration.setProperty("hibernate.connection.url", jdbcUrl);
configuration.setProperty("hibernate.connection.username", dbUser);
configuration.setProperty("hibernate.connection.password", dbPassword);
// Add connection pooling (HikariCP)
configuration.setProperty("hibernate.connection.provider_class",
"org.hibernate.hikaricp.internal.HikariCPConnectionProvider");
configuration.setProperty("hibernate.hikari.minimumIdle", "5");
configuration.setProperty("hibernate.hikari.maximumPoolSize", "20");
configuration.setProperty("hibernate.hikari.idleTimeout", "300000");
}
Dependencies: Add HikariCP:
implementation("org.hibernate.orm:hibernate-hikaricp:6.5.1.Final")
implementation("com.zaxxer:HikariCP:5.0.1")
3.4 MEDIUM PRIORITY: Graceful Shutdown¶
Modify: tqapi/src/main/java/com/perun/tlinq/TQProApiServer.java
Add shutdown hook (after line 235 where server starts):
// Add graceful shutdown hook
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
log.info("Shutdown signal received, stopping server gracefully...");
try {
// Stop accepting new requests
server.stop();
// Close Hazelcast
if (TlinqClusterCache.getInstance() != null) {
TlinqClusterCache.getInstance().shutdown();
}
// Close database connections
TlinqDBSession.close();
log.info("Server stopped gracefully");
} catch (Exception e) {
log.error("Error during shutdown", e);
}
}));
3.5 MEDIUM PRIORITY: Logging to Stdout¶
Modify: config/log.properties
Current:
com.perun.tlinq.handlers=java.util.logging.ConsoleHandler, java.util.logging.FileHandler
java.util.logging.FileHandler.pattern=/var/log/tqpro/tlinqserver-%g-%u.log
For Kubernetes:
# Kubernetes mode - stdout only
com.perun.tlinq.handlers=java.util.logging.ConsoleHandler
com.perun.tlinq.level=INFO
java.util.logging.ConsoleHandler.level=INFO
java.util.logging.ConsoleHandler.formatter=java.util.logging.SimpleFormatter
java.util.logging.SimpleFormatter.format=%1$tY-%1$tm-%1$td %1$tH:%1$tM:%1$tS %4$-6s %2$s %5$s%6$s%n
Make it configurable:
Use environment variable LOG_MODE=kubernetes or LOG_MODE=baremetal to switch.
3.6 LOW PRIORITY: Metrics Endpoint¶
New File: tqapi/src/main/java/com/perun/tlinq/api/MetricsApi.java
package com.perun.tlinq.api;
import jakarta.ws.rs.*;
import jakarta.ws.rs.core.MediaType;
import jakarta.ws.rs.core.Response;
@Path("/metrics")
public class MetricsApi {
@GET
@Produces(MediaType.TEXT_PLAIN)
public Response metrics() {
StringBuilder sb = new StringBuilder();
// JVM metrics
Runtime runtime = Runtime.getRuntime();
sb.append("# HELP jvm_memory_used_bytes Used memory in bytes\n");
sb.append("# TYPE jvm_memory_used_bytes gauge\n");
sb.append("jvm_memory_used_bytes ")
.append(runtime.totalMemory() - runtime.freeMemory())
.append("\n");
sb.append("# HELP jvm_memory_max_bytes Max memory in bytes\n");
sb.append("# TYPE jvm_memory_max_bytes gauge\n");
sb.append("jvm_memory_max_bytes ")
.append(runtime.maxMemory())
.append("\n");
// Add more metrics as needed
return Response.ok(sb.toString()).build();
}
}
3.7 Summary of Code Changes¶
| Priority | Component | File | Change Type | Effort |
|---|---|---|---|---|
| HIGH | Health Checks | api/HealthApi.java |
New file | 4 hours |
| HIGH | Hazelcast | TlinqClusterCache.java |
Modify | 6 hours |
| HIGH | Database Config | TlinqDBSession.java |
Modify | 8 hours |
| MEDIUM | Graceful Shutdown | TQProApiServer.java |
Add code | 4 hours |
| MEDIUM | Logging | log.properties |
Modify | 2 hours |
| LOW | Metrics | api/MetricsApi.java |
New file | 4 hours |
| LOW | Config Externalization | Multiple files | Modify | 8 hours |
Total Development Effort: ~36 hours (1 week)
4. Kubernetes Resources¶
4.1 Namespace¶
4.2 ConfigMaps¶
4.2.1 Application Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: tqpro-config
namespace: tqpro
data:
# Main application properties (secrets removed)
tourlinq.properties: |
# Mail configuration
mail.host=smtp.gmail.com
mail.port=587
mail.from=noreply@peruntours.com
mail.username=noreply@peruntours.com
# mail.password - injected from Secret
# Content location (mounted PVC)
content.location=/app/documents
# Database (injected from environment)
# tlinq.dbname, tlinq.dbpass - from env vars
# Feature flags
tripmaker.enabled=true
# API Server configuration
tlinqapi.properties: |
http-port=11080
https-port=11079
# SSL (if terminating in pod)
keystore-path=/app/config/perun.jks
# Document location (PVC mount)
content.location=/app/documents
# Auth server (internal K8s service)
auth.server=http://oauth2-proxy:4180
auth.validate-url=http://oauth2-proxy:4180/oauth2/auth
# Development mode
dev-mode=false
dev-mode.bypass-auth=false
# Logging configuration (stdout for K8s)
log.properties: |
com.perun.tlinq.handlers=java.util.logging.ConsoleHandler
com.perun.tlinq.level=INFO
java.util.logging.ConsoleHandler.level=INFO
java.util.logging.ConsoleHandler.formatter=java.util.logging.SimpleFormatter
java.util.logging.SimpleFormatter.format=%1$tY-%1$tm-%1$td %1$tH:%1$tM:%1$tS %4$-6s %2$s %5$s%6$s%n
# API roles (RBAC)
api-roles.properties: |
# Copy entire content from config/api-roles.properties
# ... (6KB of role mappings)
4.2.2 XML Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: tqpro-xml-config
namespace: tqpro
data:
# tourlinq-config.xml (database credentials removed)
tourlinq-config.xml: |
<?xml version="1.0" encoding="UTF-8"?>
<PluginConfig xmlns:xi="http://www.w3.org/2001/XInclude">
<!-- Database configs with env var placeholders -->
<Database name="tlinq">
<url>jdbc:postgresql://${DB_HOST}:${DB_PORT}/${DB_NAME}</url>
<username>${DB_USER}</username>
<password>${DB_PASSWORD}</password>
</Database>
<!-- Plugins, Services, Entities -->
<Entities>
<xi:include href="entities/flight-entities.xml" xpointer="xpointer(/Entities/*)"/>
<!-- ... other includes -->
</Entities>
</PluginConfig>
# amadeus-client.xml
amadeus-client.xml: |
<?xml version="1.0" encoding="UTF-8"?>
<!-- Amadeus service configuration -->
# Entity XML files (15 files)
flight-entities.xml: |
# ... content
4.2.3 Nginx Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: nginx-config
namespace: tqpro
data:
nginx.conf: |
events {
worker_connections 1024;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
# Logging
access_log /dev/stdout;
error_log /dev/stderr;
# Compression
gzip on;
gzip_types text/plain text/css application/json application/javascript;
upstream api_backend {
server tqpro-api-service:11080;
}
server {
listen 80;
server_name _;
# Admin app
location /adm {
alias /usr/share/nginx/html/adm;
index index.html;
try_files $uri $uri/ /adm/index.html;
}
# B2B app
location /b2b {
alias /usr/share/nginx/html/b2b;
index index.html;
}
# Public site
location / {
root /usr/share/nginx/html/pub;
index index.html;
}
# API proxy
location /tlinq-api {
proxy_pass http://api_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
}
4.3 Secrets¶
apiVersion: v1
kind: Secret
metadata:
name: tqpro-db-credentials
namespace: tqpro
type: Opaque
stringData:
username: tlinq_user
password: <REDACTED>
host: postgres-service.database.svc.cluster.local
port: "5432"
database: tlinq
---
apiVersion: v1
kind: Secret
metadata:
name: tqpro-api-keys
namespace: tqpro
type: Opaque
stringData:
# Amadeus
amadeus.client.key: <REDACTED>
amadeus.client.secret: <REDACTED>
# Mail
mail.password: <REDACTED>
# Payment gateway
telr.auth.key: <REDACTED>
telr.merchant.id: <REDACTED>
# RaynaB2B
rayna.jwt.token: <REDACTED>
# Twilio
twilio.sid: <REDACTED>
twilio.token: <REDACTED>
# Odoo
odoo.password: <REDACTED>
---
apiVersion: v1
kind: Secret
metadata:
name: tqpro-ssl-keystore
namespace: tqpro
type: Opaque
data:
perun.jks: <BASE64_ENCODED_KEYSTORE>
keystore.password: <BASE64_ENCODED_PASSWORD>
4.4 PersistentVolumeClaim¶
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: tqpro-documents
namespace: tqpro
spec:
accessModes:
- ReadWriteMany # Multiple pods can write
storageClassName: efs-sc # AWS EFS / Azure Files / NFS
resources:
requests:
storage: 50Gi
4.5 Deployment - API Server¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: tqpro-api
namespace: tqpro
labels:
app: tqpro-api
version: v1
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: tqpro-api
template:
metadata:
labels:
app: tqpro-api
version: v1
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "11080"
prometheus.io/path: "/tlinq-api/metrics"
spec:
# Security context
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
# Init container to validate config
initContainers:
- name: config-validator
image: registry.company.com/tqpro/api:1.0.0
command: ['sh', '-c', 'ls -la /app/config && echo Config mounted successfully']
volumeMounts:
- name: config
mountPath: /app/config
readOnly: true
containers:
- name: api
image: registry.company.com/tqpro/api:1.0.0
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 11080
protocol: TCP
- name: https
containerPort: 11079
protocol: TCP
# Environment variables
env:
- name: DEPLOYMENT_MODE
value: "kubernetes"
- name: K8S_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
# Database configuration
- name: DB_HOST
valueFrom:
secretKeyRef:
name: tqpro-db-credentials
key: host
- name: DB_PORT
valueFrom:
secretKeyRef:
name: tqpro-db-credentials
key: port
- name: DB_NAME
valueFrom:
secretKeyRef:
name: tqpro-db-credentials
key: database
- name: DB_USER
valueFrom:
secretKeyRef:
name: tqpro-db-credentials
key: username
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: tqpro-db-credentials
key: password
# API Keys
- name: AMADEUS_CLIENT_KEY
valueFrom:
secretKeyRef:
name: tqpro-api-keys
key: amadeus.client.key
- name: AMADEUS_CLIENT_SECRET
valueFrom:
secretKeyRef:
name: tqpro-api-keys
key: amadeus.client.secret
- name: MAIL_PASSWORD
valueFrom:
secretKeyRef:
name: tqpro-api-keys
key: mail.password
# JVM options
- name: JAVA_OPTS
value: "-Xmx2g -Xms512m -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/app/logs"
# Resource limits
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
# Health checks
livenessProbe:
httpGet:
path: /tlinq-api/health/live
port: 11080
initialDelaySeconds: 90
periodSeconds: 30
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /tlinq-api/health/ready
port: 11080
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Startup probe (for slow starts)
startupProbe:
httpGet:
path: /tlinq-api/health/live
port: 11080
initialDelaySeconds: 0
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 30 # 5 minutes max
# Volume mounts
volumeMounts:
- name: config
mountPath: /app/config
readOnly: true
- name: documents
mountPath: /app/documents
- name: logs
mountPath: /app/logs
# Volumes
volumes:
- name: config
projected:
sources:
- configMap:
name: tqpro-config
- configMap:
name: tqpro-xml-config
- name: documents
persistentVolumeClaim:
claimName: tqpro-documents
- name: logs
emptyDir: {}
# Pod affinity - spread across nodes
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- tqpro-api
topologyKey: kubernetes.io/hostname
4.6 Deployment - Web Frontend¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: tqpro-web
namespace: tqpro
spec:
replicas: 2
selector:
matchLabels:
app: tqpro-web
template:
metadata:
labels:
app: tqpro-web
spec:
containers:
- name: nginx
image: registry.company.com/tqpro/web:1.0.0
ports:
- containerPort: 80
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
volumeMounts:
- name: nginx-config
mountPath: /etc/nginx/nginx.conf
subPath: nginx.conf
livenessProbe:
httpGet:
path: /adm/index.html
port: 80
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /adm/index.html
port: 80
initialDelaySeconds: 5
periodSeconds: 10
volumes:
- name: nginx-config
configMap:
name: nginx-config
4.7 Services¶
# API Service
apiVersion: v1
kind: Service
metadata:
name: tqpro-api-service
namespace: tqpro
labels:
app: tqpro-api
spec:
type: ClusterIP
ports:
- name: http
port: 11080
targetPort: 11080
protocol: TCP
selector:
app: tqpro-api
sessionAffinity: None # Stateless, no affinity needed
---
# Web Service
apiVersion: v1
kind: Service
metadata:
name: tqpro-web-service
namespace: tqpro
spec:
type: ClusterIP
ports:
- port: 80
targetPort: 80
selector:
app: tqpro-web
---
# Hazelcast headless service (for discovery)
apiVersion: v1
kind: Service
metadata:
name: tqpro-hazelcast
namespace: tqpro
spec:
type: ClusterIP
clusterIP: None # Headless service
ports:
- name: hazelcast
port: 5701
selector:
app: tqpro-api
4.8 Ingress¶
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: tqpro-ingress
namespace: tqpro
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
spec:
ingressClassName: nginx
tls:
- hosts:
- tqpro.company.com
secretName: tqpro-tls
rules:
- host: tqpro.company.com
http:
paths:
# API endpoints
- path: /tlinq-api
pathType: Prefix
backend:
service:
name: tqpro-api-service
port:
number: 11080
# Web frontend
- path: /
pathType: Prefix
backend:
service:
name: tqpro-web-service
port:
number: 80
4.9 HorizontalPodAutoscaler¶
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: tqpro-api-hpa
namespace: tqpro
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: tqpro-api
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 60
4.10 PodDisruptionBudget¶
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: tqpro-api-pdb
namespace: tqpro
spec:
minAvailable: 2
selector:
matchLabels:
app: tqpro-api
5. Development Roadmap¶
Phase 1: Foundation (Week 1-2)¶
Objectives: - Setup development environment - Create initial Docker images - Test containerization locally
Tasks: 1. Day 1-2: Docker Setup - Create Dockerfile for API server - Create Dockerfile for web frontend - Test local builds - Optimize image sizes
- Day 3-4: Code Changes - Health Checks
- Implement
HealthApi.java - Add liveness, readiness, startup endpoints
- Test health endpoints locally
-
Update API registration
-
Day 5-6: Code Changes - Database
- Modify
TlinqDBSession.javafor env vars - Add HikariCP connection pooling
- Test with local PostgreSQL
-
Validate connection fallback logic
-
Day 7-8: Code Changes - Hazelcast
- Modify
TlinqClusterCache.java - Add Kubernetes discovery support
- Add bare-metal fallback
-
Test multicast still works locally
-
Day 9-10: Testing & Documentation
- End-to-end local testing
- Document all changes
- Create migration guide
- Code review
Deliverables: - ✅ Working Docker images - ✅ Health check endpoints functional - ✅ Database configuration externalized - ✅ Hazelcast dual-mode support - ✅ Local testing passed
Phase 2: Configuration & Secrets (Week 3)¶
Objectives: - Externalize all configuration - Migrate secrets to Kubernetes Secrets - Create ConfigMaps
Tasks:
1. Day 1-2: ConfigMap Creation
- Create tqpro-config ConfigMap
- Create tqpro-xml-config ConfigMap
- Create nginx-config ConfigMap
- Validate XML parsing
- Day 3-4: Secrets Migration
- Identify all secrets (20+ credentials)
- Create Secret manifests
- Create secret rotation plan
-
Document secret access patterns
-
Day 5: Environment Variable Injection
- Update code to read secrets from env vars
- Test secret injection
-
Validate fallback to config files
-
Day 6-7: Testing
- Test with Secrets mounted
- Test with ConfigMaps mounted
- Validate environment variable precedence
- Test bare-metal compatibility
Deliverables: - ✅ All ConfigMaps created - ✅ All Secrets identified and documented - ✅ Code reads from environment variables - ✅ Bare-metal mode still works
Phase 3: Kubernetes Deployment (Week 4)¶
Objectives: - Deploy to development K8s cluster - Configure storage, networking - Validate functionality
Tasks: 1. Day 1: Cluster Setup - Create namespace - Apply RBAC policies - Setup StorageClass (EFS/Azure Files) - Create PersistentVolumeClaim
- Day 2: Deploy Dependencies
- Deploy PostgreSQL (or configure external)
- Create database schema
- Deploy Hazelcast StatefulSet (optional)
-
Test database connectivity
-
Day 3: Deploy Application
- Apply ConfigMaps
- Apply Secrets
- Deploy API server (1 replica initially)
-
Check logs for errors
-
Day 4: Networking
- Create Services
- Configure Ingress
- Setup TLS certificates
-
Test external access
-
Day 5: Scale & Test
- Scale to 3 replicas
- Test Hazelcast clustering
- Test file uploads (shared storage)
- Load testing
Deliverables: - ✅ Application running in K8s - ✅ 3 replicas healthy - ✅ Ingress accessible - ✅ All health checks passing - ✅ Hazelcast cluster formed
Phase 4: Observability & Monitoring (Week 5)¶
Objectives: - Setup logging aggregation - Configure monitoring - Add alerts
Tasks: 1. Day 1-2: Logging - Deploy EFK stack (Elasticsearch, Fluentd, Kibana) - Configure log shipping - Create log dashboards - Test log queries
- Day 3-4: Monitoring
- Deploy Prometheus
- Configure ServiceMonitor
- Deploy Grafana
-
Create dashboards (CPU, memory, requests)
-
Day 5: Alerting
- Configure AlertManager
- Create alert rules (pod down, high CPU, DB errors)
- Test alert routing
- Document runbooks
Deliverables: - ✅ Centralized logging - ✅ Prometheus metrics collection - ✅ Grafana dashboards - ✅ Alerts configured
Phase 5: Production Hardening (Week 6)¶
Objectives: - Security hardening - Performance optimization - Disaster recovery
Tasks: 1. Day 1-2: Security - Network policies (restrict pod-to-pod) - Security context (non-root user) - Image scanning (Trivy/Aqua) - Secret rotation automation
- Day 3: Performance
- JVM tuning
- Connection pool optimization
- Cache configuration tuning
-
Load testing (JMeter)
-
Day 4: High Availability
- PodDisruptionBudget
- HorizontalPodAutoscaler
- Multi-zone deployment
-
Database failover testing
-
Day 5: Disaster Recovery
- Backup automation (Velero)
- Database backup/restore
- Disaster recovery plan
- Runbook documentation
Deliverables: - ✅ Security scan passed - ✅ Performance benchmarks met - ✅ HA configuration tested - ✅ Backup/restore validated
Phase 6: Production Deployment (Week 7-8)¶
Objectives: - Deploy to staging - User acceptance testing - Deploy to production
Tasks: 1. Week 7: Staging Deployment - Deploy to staging cluster - Run smoke tests - User acceptance testing - Performance testing - Security audit
- Week 8: Production Deployment
- Blue-green deployment setup
- Deploy to production (canary rollout)
- Monitor metrics closely
- Gradual traffic shift (10% → 50% → 100%)
- Rollback plan ready
Deliverables: - ✅ Staging validated - ✅ Production deployed - ✅ Zero downtime migration - ✅ Rollback tested
6. Configuration Migration¶
6.1 Configuration Files Mapping¶
| File | Bare-Metal Location | Kubernetes Resource | Notes |
|---|---|---|---|
| tourlinq-config.xml | $TLINQ_HOME/tourlinq-config.xml |
ConfigMap tqpro-xml-config |
Remove DB credentials |
| tourlinq.properties | $TLINQ_HOME/tourlinq.properties |
ConfigMap tqpro-config |
Remove secrets |
| tlinqapi.properties | $TLINQ_HOME/tlinqapi.properties |
ConfigMap tqpro-config |
Update paths |
| log.properties | $TLINQ_HOME/log.properties |
ConfigMap tqpro-config |
Stdout only |
| api-roles.properties | $TLINQ_HOME/api-roles.properties |
ConfigMap tqpro-config |
No changes |
| amadeus-client.xml | $TLINQ_HOME/amadeus-client.xml |
ConfigMap tqpro-xml-config |
No changes |
| Entity XMLs (15 files) | $TLINQ_HOME/entities/*.xml |
ConfigMap tqpro-xml-config |
No changes |
| amadeus.idfile | Hardcoded path | Secret tqpro-api-keys |
Two-line CSV |
| perun.jks | Hardcoded path | Secret tqpro-ssl-keystore |
Binary file |
6.2 Environment Variables Strategy¶
Deployment Mode:
# Kubernetes deployment
DEPLOYMENT_MODE=kubernetes
# Bare-metal deployment
DEPLOYMENT_MODE=baremetal
Database Configuration:
DB_HOST=postgres-service.database.svc.cluster.local
DB_PORT=5432
DB_NAME=tlinq
DB_USER=tlinq_user
DB_PASSWORD=<from-secret>
Hazelcast Configuration:
Application Configuration:
6.3 Secrets Extraction¶
From tourlinq.properties:
# Extract to Secret
mail.password=KP8ZH8zwKeQ0
telr.auth-key=JLM6^MpHfH@K8MpP
telr.merchant-id=21401
rayna.jwt.token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9...
twilio.sid=ACb7e13c05a35d56650e4a8c528226fa13
twilio.token=d7b7cf33e9eb90177e5c4f8c58d0c065
From tourlinq-config.xml:
<!-- Extract to Secret -->
<Database name="tlinq">
<username>TlinqUser</username>
<password>TlinqAdmin</password>
</Database>
From odoo-client.properties:
# Extract to Secret
odoo.user=odoo@peruntours.com
odoo.password=<redacted>
odoo.session.id=<redacted>
Amadeus Credentials:
7. Risk Assessment¶
7.1 High Risk Items¶
| Risk | Impact | Probability | Mitigation |
|---|---|---|---|
| Hazelcast clustering fails in K8s | Cache not shared across pods | Medium | Test thoroughly; fallback to external Hazelcast |
| Shared storage performance | Slow document upload/download | Medium | Use high-performance storage (EFS Provisioned Throughput) |
| Database connection pool exhaustion | API failures under load | Medium | Implement HikariCP with proper sizing |
| Secret rotation breaks production | Application crashes | Low | Implement graceful secret reload |
| Configuration file parsing errors | App fails to start | Medium | Add init container to validate config |
7.2 Medium Risk Items¶
| Risk | Impact | Probability | Mitigation |
|---|---|---|---|
| Memory leaks in long-running pods | OOM kills | Medium | Monitor heap usage; set memory limits |
| External API rate limits | Service degradation | High | Implement circuit breakers and retries |
| SSL certificate expiry | HTTPS access lost | Low | Use cert-manager for auto-renewal |
| Log volume too high | Storage costs increase | Medium | Implement log level filtering |
| Migration downtime longer than expected | User impact | Medium | Extensive testing in staging |
7.3 Low Risk Items¶
| Risk | Impact | Probability | Mitigation |
|---|---|---|---|
| Container image size too large | Slow deployments | Low | Multi-stage builds optimize size |
| Pod startup time too slow | Slow scaling | Low | Tune JVM startup; use CDS |
| Ingress configuration errors | 404 errors | Low | Test routing before production |
8. Testing Strategy¶
8.1 Unit Testing¶
Scope: Code changes for Kubernetes compatibility
Tests to Add: 1. HealthApi Tests - Liveness returns 200 - Readiness returns 200 when DB up - Readiness returns 503 when DB down
- TlinqDBSession Tests
- Reads from environment variables
- Falls back to config file
-
Connection pool works
-
TlinqClusterCache Tests
- Kubernetes mode uses K8s discovery
- Bare-metal mode uses multicast
- Environment variable parsing
Tools: JUnit 5, Mockito
8.2 Integration Testing¶
Scope: Container and Kubernetes integration
Test Scenarios: 1. Container Build - Docker build succeeds - Image size within limits (<300MB) - Security scan passes (no critical CVEs)
- Container Runtime
- Container starts successfully
- Health endpoints respond
-
Logs appear on stdout
-
Kubernetes Deployment
- Pods start and become ready
- ConfigMaps mounted correctly
- Secrets injected as env vars
-
PVC mounted and writable
-
Multi-Pod
- 3 pods run simultaneously
- Hazelcast cluster forms
- Cache shared across pods
- Load balanced correctly
8.3 Functional Testing¶
Scope: End-to-end business functionality
Test Cases: 1. Flight Search - Search flights (JFK → LHR) - Verify results returned - Confirm pricing - Create booking
- Hotel Search
- Search hotels in Paris
- View offers
-
Verify pricing
-
Document Management
- Upload PDF
- Verify file saved to PVC
-
Download PDF from another pod
-
User Management
- Login via OAuth2
- Access admin endpoints
- Role-based access control
8.4 Performance Testing¶
Tools: Apache JMeter, k6
Test Scenarios:
- Load Test
- 100 concurrent users
- Mixed API calls (search, booking)
- Duration: 30 minutes
- Success rate: >99%
-
Avg response time: <500ms
-
Stress Test
- Gradually increase load to 500 users
- Identify breaking point
- Monitor auto-scaling behavior
-
Verify graceful degradation
-
Soak Test
- 50 concurrent users
- Duration: 12 hours
- Check for memory leaks
- Verify no degradation over time
Acceptance Criteria: - Response time p95 < 1 second - Error rate < 0.1% - No memory leaks - Auto-scaling triggers appropriately
8.5 Disaster Recovery Testing¶
Test Scenarios:
- Pod Failure
- Delete random pod
- Verify auto-restart
-
Verify no service interruption
-
Node Failure
- Drain node
- Verify pods rescheduled
-
Verify service continuity
-
Database Failure
- Stop database
- Verify readiness probe fails
- Verify graceful error handling
- Restore database
-
Verify automatic recovery
-
Complete Cluster Failure
- Backup application state
- Destroy cluster
- Restore from backup
- Verify data integrity
9. Rollout Plan¶
9.1 Pre-Deployment Checklist¶
Infrastructure: - [ ] Kubernetes cluster provisioned (3+ nodes) - [ ] StorageClass configured (EFS/Azure Files/NFS) - [ ] PostgreSQL database deployed or accessible - [ ] Ingress controller installed (nginx-ingress) - [ ] Cert-manager installed (for TLS) - [ ] Monitoring stack deployed (Prometheus/Grafana) - [ ] Logging stack deployed (EFK/Loki)
Application: - [ ] Docker images built and pushed to registry - [ ] ConfigMaps created - [ ] Secrets created (encrypted in repo via sealed-secrets) - [ ] Database schema migrated - [ ] Health endpoints tested - [ ] Code changes merged to main branch
Documentation: - [ ] Deployment runbook completed - [ ] Rollback procedure documented - [ ] Monitoring dashboard created - [ ] Alert rules configured - [ ] On-call rotation established
9.2 Deployment Steps¶
Development Environment (Week 4):
-
Create namespace
-
Apply secrets (from sealed-secrets or vault)
-
Apply ConfigMaps
-
Create PVC
-
Deploy database (if not external)
-
Deploy application (1 replica)
-
Verify health
-
Create service
-
Create ingress
-
Test access
Staging Environment (Week 7):
- Repeat steps above with
tqpro-stagingnamespace - Run full test suite
- User acceptance testing
- Performance testing
- Security audit
Production Environment (Week 8):
- Blue-Green Strategy:
- Deploy "green" environment alongside existing "blue"
- Route 10% traffic to green
- Monitor metrics for 1 hour
- Gradually increase to 50%, then 100%
-
Keep blue running for 24 hours as rollback option
-
Deployment Command:
# Deploy green kubectl apply -f k8s/prod/deployment-green.yaml # Update service to route to green (weighted) kubectl apply -f k8s/prod/service-weighted.yaml # Monitor kubectl get pods -n tqpro -l version=green -w # Full cutover kubectl apply -f k8s/prod/service.yaml # Cleanup blue after 24h kubectl delete deployment tqpro-api-blue -n tqpro
9.3 Rollback Procedure¶
If issues detected during rollout:
-
Immediate Rollback (< 5 minutes):
-
Post-Rollback:
- Investigate root cause
- Fix issues in dev/staging
- Re-test thoroughly
- Schedule new deployment
9.4 Post-Deployment Validation¶
Automated Checks: - [ ] All pods healthy (3/3 running) - [ ] Health endpoints returning 200 - [ ] Ingress routing correctly - [ ] SSL certificate valid - [ ] Metrics being collected - [ ] Logs being aggregated
Manual Checks: - [ ] Login functionality works - [ ] Flight search works - [ ] Hotel search works - [ ] Document upload works - [ ] API responses correct - [ ] No errors in logs
Performance Validation: - [ ] Response times within SLA - [ ] No increase in error rate - [ ] Database connections stable - [ ] Memory usage normal - [ ] CPU usage normal
9.5 Monitoring & Alerts¶
Key Metrics to Monitor:
- Application Health:
- Pod restart count
- Health check failures
-
Application errors (500s)
-
Performance:
- Request rate
- Response time (p50, p95, p99)
- Error rate
-
Database query time
-
Resource Usage:
- CPU utilization
- Memory usage
- Disk I/O (PVC)
-
Network traffic
-
External Dependencies:
- Database connection pool
- Amadeus API latency
- Odoo API availability
- Cache hit ratio
Alert Thresholds: - Pod crash loop: Immediate alert - Error rate > 1%: Warning - Error rate > 5%: Critical - Response time p95 > 2s: Warning - Response time p95 > 5s: Critical - Memory > 90%: Warning - CPU > 80% for 5min: Warning
10. Cost Analysis¶
10.1 Infrastructure Costs (Monthly)¶
AWS EKS Example:
| Component | Specification | Cost |
|---|---|---|
| EKS Cluster | Control plane | $73 |
| EC2 Nodes | 3x t3.xlarge (4 CPU, 16GB) | $223 |
| EBS Volumes | 300GB gp3 | $24 |
| EFS Storage | 50GB + 5 MB/s provisioned | $65 |
| Application Load Balancer | 1 ALB | $23 |
| Data Transfer | 500GB/month | $45 |
| RDS PostgreSQL | db.t3.medium (2 CPU, 4GB) | $123 |
| CloudWatch | Logs + Metrics | $30 |
| Subtotal | $606/month |
Azure AKS Example:
| Component | Specification | Cost |
|---|---|---|
| AKS Cluster | Control plane (free) | $0 |
| VM Nodes | 3x Standard_D4s_v3 (4 CPU, 16GB) | $347 |
| Managed Disks | 300GB Premium SSD | $51 |
| Azure Files | 50GB Premium | $110 |
| Application Gateway | Standard_v2 | $267 |
| Azure Database for PostgreSQL | General Purpose, 2 vCores | $182 |
| Log Analytics | 10GB/day | $25 |
| Subtotal | $982/month |
GCP GKE Example:
| Component | Specification | Cost |
|---|---|---|
| GKE Cluster | Control plane | $73 |
| Compute Nodes | 3x n1-standard-4 (4 CPU, 15GB) | $292 |
| Persistent Disks | 300GB SSD | $51 |
| Filestore | 1TB Basic | $204 |
| Cloud Load Balancer | External HTTPS | $18 |
| Cloud SQL PostgreSQL | db-n1-standard-2 | $158 |
| Cloud Logging | 50GB/month | $25 |
| Subtotal | $821/month |
10.2 Scaling Costs¶
Auto-Scaling Impact:
With HPA configured (3-10 replicas): - Minimum (3 replicas): Base cost - Average (5 replicas): +40% compute cost - Peak (10 replicas): +100% compute cost
Recommendation: Set HPA max based on budget constraints and actual traffic patterns.
10.3 Cost Optimization Strategies¶
- Right-Sizing:
- Start with smaller instances
- Use metrics to adjust
-
Potentially save 30-40%
-
Reserved Instances:
- 1-year reserved instances: ~30% savings
-
3-year reserved instances: ~50% savings
-
Spot Instances:
- Use for non-critical workloads
- 70-90% savings
-
Not recommended for production API
-
Storage Optimization:
- Use object storage (S3) instead of EFS for documents
-
Potential savings: $50/month
-
Multi-Tenancy:
- Share cluster with other applications
- Reduce per-app overhead
Estimated Optimized Cost: $400-500/month (vs $600-900 unoptimized)
11. Conclusion¶
11.1 Summary¶
The TQPro application is well-suited for Kubernetes deployment with the following assessment:
Strengths ✅: - Stateless REST API design (perfect for K8s) - Embedded Jetty server (no external dependencies) - Multi-module architecture (clear separation) - Standard Java/Gradle stack (well-supported)
Challenges ⚠️: - Configuration externalization required - Hazelcast needs K8s discovery implementation - Secrets hardcoded in config files - No health check endpoints (must add) - Shared file storage needed
Overall Effort: 6-8 weeks (including testing)
Cost: $400-900/month depending on cloud provider and optimization
Risk Level: Medium (manageable with proper planning)
11.2 Recommendations¶
Immediate Actions: 1. ✅ Approve development plan and budget 2. ✅ Provision development Kubernetes cluster 3. ✅ Assign development team (2-3 engineers) 4. ✅ Start Phase 1 (code changes)
Critical Success Factors: 1. Thorough testing in staging before production 2. Gradual rollout with rollback capability 3. Comprehensive monitoring from day one 4. Clear runbooks for operations team
Future Enhancements (Post-Deployment): 1. Service mesh (Istio) for advanced traffic management 2. GitOps (ArgoCD/Flux) for declarative deployments 3. Chaos engineering for resilience testing 4. Multi-region deployment for disaster recovery
11.3 Decision Point¶
Proceed with Kubernetes deployment? - ☐ Yes - Begin Phase 1 immediately - ☐ No - Document reasons and revisit in 6 months - ☐ Partial - Start with dev/staging only
Appendix A: Hazelcast vs Alternatives Analysis¶
A.1 Question: Should Hazelcast be replaced with another distributed cache?¶
Initial Consideration: Replace Hazelcast with Caffeine (local cache) or Redis (managed service)
Analysis Performed: Deep code analysis of all cache usage patterns across the application
A.2 Cache Usage Discovery¶
Finding: Hazelcast is minimally used but CRITICAL for multi-instance deployment
Only 4 files use Hazelcast:
1. CartHolder.java - Shopping cart storage (session-based)
2. OdooServiceFactory.java - User session management
3. ApiRoleManager.java - API authorization cache
4. TlinqFrameworkInitializer.java - Test initialization
3 Active Caches:
| Cache Name | Purpose | Critical | DB Fallback | Multi-Pod Requirement |
|------------|---------|----------|-------------|----------------------|
| cartsCache | Shopping carts | YES | Partial | REQUIRED |
| userSessions | Odoo authentication sessions | YES | NO | CRITICAL* |
| apiRolesCache | API RBAC | MEDIUM | File reload | Optional |
*Logged-in user carts have database fallback; anonymous carts do not
A.3 Multi-Instance Impact Analysis¶
User Sessions (CRITICAL):
// OdooServiceFactory.java:146-158
private UserLogin fetchSession(String sessionToken) {
UserLogin session = sessions.get(sessionToken); // Hazelcast lookup
if(null == session) {
// NO DATABASE FALLBACK - User gets logged out!
throw new TlinqClientException(TlinqErr.SESSION_ERROR,"Session expired");
}
}
Impact if cache not distributed: - ❌ User sessions lost when request hits different pod - ❌ Users forced to re-authenticate frequently - ❌ Poor user experience - ❌ Horizontal scaling impossible
Shopping Carts (HIGH PRIORITY): - Logged-in users: ✅ Database fallback works - Anonymous users: ❌ Cart lost on pod switch (requires sticky sessions)
A.4 Alternative Evaluation¶
Option 1: Caffeine (Local Cache)¶
Pros: - ✅ Lightweight (~1MB vs 4MB) - ✅ Better memory management (TTL + eviction built-in) - ✅ No network configuration needed - ✅ 1-2 day migration effort
Cons: - ❌ BREAKS multi-instance deployment - ❌ User sessions not shared between pods - ❌ Anonymous carts lost on pod switch - ❌ Requires sticky sessions (defeats K8s benefits)
Verdict: ❌ NOT SUITABLE for Kubernetes multi-pod deployment
Option 2: Redis (External Cache)¶
Pros: - ✅ True distributed caching - ✅ Managed service available (AWS ElastiCache, Azure Cache for Redis) - ✅ Persistence across restarts - ✅ Simpler K8s deployment (no in-app clustering) - ✅ Can share with other services
Cons: - ⚠️ Additional infrastructure ($50-100/month) - ⚠️ 2-week migration effort - ⚠️ New dependency to manage - ⚠️ Network latency for cache operations
Verdict: ✅ VIABLE ALTERNATIVE but higher effort/cost
Option 3: Hazelcast (Fixed for Kubernetes)¶
Pros: - ✅ Already integrated (4 files, 3 caches) - ✅ Designed for distributed caching - ✅ Zero additional infrastructure cost - ✅ Embedded in application (no external service) - ✅ 3-5 day migration effort - ✅ Backward compatible with bare-metal
Cons: - ⚠️ Requires Kubernetes discovery configuration - ⚠️ Current config has hardcoded IP (fixable) - ⚠️ No TTL/eviction configured (fixable) - ⚠️ Cluster management in-app
Verdict: ⭐ RECOMMENDED - Best fit for current architecture
A.5 Decision Matrix¶
| Criteria | Caffeine | Hazelcast (Fixed) | Redis |
|---|---|---|---|
| Multi-Instance Support | ❌ NO | ✅ YES | ✅ YES |
| Effort | 2 days | 3-5 days | 2 weeks |
| Cost | $0 | $0 | $50-100/mo |
| Complexity | Low | Medium | Medium-High |
| Code Changes | 4 files | 4 files + config | 4 files + client |
| Infrastructure | None | None | Redis cluster |
| Backward Compat | ✅ Easy | ✅ Easy | ⚠️ Complex |
| Performance | Best (local) | Good (in-cluster) | Good (network hop) |
| Persistence | ❌ None | ⚠️ In-memory | ✅ Disk-backed |
| Overall Rating | ⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
A.6 Final Recommendation¶
Keep Hazelcast and Fix Kubernetes Discovery ⭐⭐⭐⭐⭐
Rationale: 1. Minimal disruption - Already integrated, just needs K8s config 2. Cost effective - No additional infrastructure 3. Quick to implement - 3-5 days vs 2 weeks for Redis 4. Proven technology - Hazelcast designed for this exact use case 5. Backward compatible - Bare-metal deployment still works
Implementation:
- Upgrade to Hazelcast 5.3.6 (from 4.2.4)
- Add Kubernetes service discovery plugin
- Configure TTL and eviction policies
- Add health check endpoints
- See HAZELCAST_KUBERNETES_MIGRATION.md for complete implementation plan
When to Consider Redis: - If Hazelcast clustering proves problematic in production - If you need persistence across cluster restarts - If you want managed service with support - If sharing cache with other applications - Budget allows for $50-100/month additional cost
A.7 Current Hazelcast Issues Fixed¶
Before:
// ❌ Hardcoded IP
network.getJoin().getMulticastConfig()
.addTrustedInterface("172.16.55.1");
// ❌ No TTL - memory leak risk
// ❌ No eviction policies
// ❌ No max size limits
After:
// ✅ Kubernetes discovery
if ("kubernetes".equals(deploymentMode)) {
network.getJoin().getKubernetesConfig()
.setEnabled(true)
.setProperty("namespace", System.getenv("K8S_NAMESPACE"))
.setProperty("service-name", "tqpro-hazelcast");
}
// ✅ TTL configured
MapConfig cartsConfig = new MapConfig("cartsCache");
cartsConfig.setTimeToLiveSeconds(1800); // 30 min
// ✅ Eviction policy
cartsConfig.setEvictionConfig(new EvictionConfig()
.setSize(10000)
.setMaxSizePolicy(MaxSizePolicy.PER_NODE)
.setEvictionPolicy(EvictionPolicy.LRU));
A.8 Testing Validation¶
Multi-Pod Cache Consistency Test:
# Create session on pod-1
SESSION_ID="test-session-123"
curl -X POST http://pod-1:11080/api/cart/addItem \
-d '{"session":"'$SESSION_ID'","item":"ITEM123"}'
# Retrieve cart from pod-2 (different pod!)
curl -X POST http://pod-2:11080/api/cart/load \
-d '{"session":"'$SESSION_ID'"}'
# ✅ Should return cart with ITEM123 (cache shared)
# ❌ With Caffeine: Would return empty cart (cache not shared)
Expected Results with Hazelcast: - ✅ Sessions accessible from all pods - ✅ Carts consistent across pod switches - ✅ Cluster size matches pod count - ✅ No session loss during scaling events
Document Owner: DevOps Team Last Updated: 2024-11-23 Next Review: After Phase 1 completion
End of Document