TQPro Kubernetes Containerization - Executive Summary¶
Date: 2024-11-23 Status: Analysis Complete - Ready for Implementation
Quick Reference¶
| Document | Purpose | Audience |
|---|---|---|
| KUBERNETES_DEPLOYMENT_PLAN.md | Complete deployment strategy and architecture | DevOps, Tech Leads, Architects |
| HAZELCAST_KUBERNETES_MIGRATION.md | Step-by-step Hazelcast migration guide | Developers, DevOps |
| REQUIREMENTS_SPECIFICATION.md (tqamds) | Functional requirements for Amadeus module | Product, QA, Business |
Executive Decision Summary¶
✅ Recommendation: Proceed with Kubernetes Deployment¶
Feasibility: GOOD - 4/5 stars
Effort: 6-8 weeks
Risk: Medium (manageable)
Cost: $400-900/month (cloud provider dependent)
Key Findings¶
Application Strengths ✅¶
- Stateless Design - No server-side sessions, perfect for K8s
- Embedded Server - Jetty 12, no external app server needed
- RESTful API - Standard JAX-RS, easy to containerize
- OAuth2 Auth - Identity in HTTP headers, pod-independent
Critical Requirements ⚠️¶
- Hazelcast Must Be Fixed - Hardcoded IP incompatible with K8s
- User Sessions Require Distributed Cache - No database fallback
- Health Checks Must Be Added - Required for K8s probes
- Secrets Must Be Externalized - 20+ hardcoded credentials
Hazelcast Decision: Keep and Fix¶
Question: Should we replace Hazelcast with Caffeine or Redis?
Answer: NO - Keep Hazelcast and configure for Kubernetes
Why Hazelcast is Required¶
Deep code analysis revealed: - User sessions stored in cache ONLY (no DB fallback) - Anonymous shopping carts memory-only - Multi-pod deployment requires distributed cache
Impact if Not Fixed: - ❌ Users logged out when hitting different pods - ❌ Shopping carts lost - ❌ Cannot scale horizontally
Alternatives Considered¶
| Solution | Multi-Pod | Effort | Cost | Verdict |
|---|---|---|---|---|
| Caffeine | ❌ NO | 2 days | $0 | ❌ Breaks clustering |
| Hazelcast (fixed) | ✅ YES | 3-5 days | $0 | ⭐ RECOMMENDED |
| Redis | ✅ YES | 2 weeks | $50-100/mo | ✅ Viable alternative |
Decision: Fix Hazelcast (lowest effort, no cost, backward compatible)
Implementation Plan Overview¶
Phase 1: Code Changes (Week 1-2)¶
- ✅ Add health check endpoints
- ✅ Fix Hazelcast Kubernetes discovery
- ✅ Externalize database configuration
- ✅ Add graceful shutdown
- ✅ Configure TTL/eviction policies
Phase 2: Configuration (Week 3)¶
- ✅ Create ConfigMaps
- ✅ Migrate secrets to K8s Secrets
- ✅ Update environment variable injection
Phase 3: K8s Deployment (Week 4)¶
- ✅ Deploy to dev cluster
- ✅ Test multi-pod clustering
- ✅ Validate cache consistency
Phase 4: Observability (Week 5)¶
- ✅ Setup logging (EFK/Loki)
- ✅ Setup monitoring (Prometheus/Grafana)
- ✅ Configure alerts
Phase 5: Hardening (Week 6)¶
- ✅ Security scanning
- ✅ Performance tuning
- ✅ Load testing
- ✅ DR planning
Phase 6: Production (Week 7-8)¶
- ✅ Staging deployment
- ✅ UAT
- ✅ Production rollout (blue-green)
Resource Requirements¶
Compute (3-pod cluster)¶
- API Pods: 3 × (2 CPU, 4GB RAM) = 6 CPU, 12GB RAM
- Web Pods: 2 × (0.5 CPU, 512MB) = 1 CPU, 1GB RAM
- Total: ~7-8 CPU cores, ~13-15GB RAM
Storage¶
- Documents PVC: 50GB (ReadWriteMany - EFS/Azure Files)
- Database: External PostgreSQL (managed service)
Monthly Cost Estimate¶
- AWS EKS: ~$606/month
- Azure AKS: ~$982/month
- GCP GKE: ~$821/month
- Optimized: $400-500/month
Critical Path Items¶
Must Complete Before Multi-Pod¶
- Hazelcast Kubernetes Discovery (3-5 days)
- Upgrade to 5.3.6
- Add K8s plugin
- Configure service discovery
-
Add TTL/eviction policies
-
Health Check Endpoints (4 hours)
/health/live- liveness probe/health/ready- readiness probe (DB + cache)-
/health/cache- Hazelcast cluster health -
RBAC for Hazelcast (30 minutes)
- ServiceAccount
- Role (endpoints, pods, services read)
-
RoleBinding
-
Headless Service (15 minutes)
- ClusterIP: None
- Port: 5701
- publishNotReadyAddresses: true
Testing Strategy¶
Unit Tests¶
- Health API endpoints
- Database env var configuration
- Hazelcast cluster formation
Integration Tests¶
- Container builds successfully
- Multi-pod Hazelcast clustering
- Cache consistency across pods
Load Tests¶
- 100 concurrent users
- Cache hit/miss ratios
- Response times < 500ms (p95)
Chaos Tests¶
- Pod kills (random)
- Node drains
- Network partitions
Success Criteria¶
Functional¶
- ✅ 3-pod cluster forms Hazelcast cluster
- ✅ User sessions shared across pods
- ✅ Shopping carts accessible from any pod
- ✅ Health checks respond correctly
- ✅ Bare-metal deployment still works
Performance¶
- ✅ Response time p95 < 1 second
- ✅ Cache operations < 100ms
- ✅ Cluster formation < 30 seconds
- ✅ Error rate < 0.1%
Operational¶
- ✅ Graceful shutdown < 30 seconds
- ✅ No cache-related errors under load
- ✅ Monitoring dashboards functional
- ✅ Alerts working
Risks & Mitigation¶
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Hazelcast clustering fails | Medium | High | Thorough testing; Redis fallback plan |
| User session loss | Medium | Critical | Add session persistence to DB (future) |
| Memory leaks | Low | High | TTL + eviction policies configured |
| Network latency | Low | Medium | Keep pods in same AZ |
Rollback Plan¶
Immediate (< 5 minutes)¶
Code Rollback¶
git revert <commit>
./gradlew clean build
docker build -t registry/tqpro/api:rollback .
kubectl set image deployment/tqpro-api api=registry/tqpro/api:rollback
Next Steps¶
Immediate Actions Required¶
- ☐ Review and approve deployment plan
- ☐ Provision development K8s cluster
- ☐ Assign development team (2-3 engineers)
- ☐ Allocate budget ($400-900/month cloud costs)
Week 1 Deliverables¶
- ☐ Hazelcast code changes completed
- ☐ Health check endpoints added
- ☐ Local testing passed
- ☐ Code review completed
Decision Point¶
- ☐ Approve - Begin Phase 1 immediately
- ☐ Defer - Revisit in 6 months
- ☐ Pilot - Dev/staging only
Contact & Support¶
Technical Lead: [Name]
DevOps Lead: [Name]
Project Manager: [Name]
Documentation:
- Full deployment plan: KUBERNETES_DEPLOYMENT_PLAN.md
- Hazelcast migration: HAZELCAST_KUBERNETES_MIGRATION.md
- tqamds requirements: tqamds/REQUIREMENTS_SPECIFICATION.md
Prepared By: DevOps & Platform Team
Date: 2024-11-23
Version: 1.0