Horizontal Scalability Implementation¶
Status: COMPLETE¶
Completed: 2026-02-19
All scalability components have been implemented: - HikariCP connection pooling via Hibernate integration - Hazelcast cluster configuration with TCP-IP discovery - Distributed scheduling locks with auto-release on crash - Distributed cache invalidation with pub/sub - Plugin lifecycle management with ordered shutdown - Health check and readiness endpoints - Correlation ID propagation for request tracing - Database migration script for optimistic locking
Executive Summary¶
This document describes the implementation of horizontal scalability infrastructure for TQPro, enabling the platform to run as a multi-instance cluster behind a load balancer. The changes address 28 scalability findings across caching, scheduling, database connections, Hazelcast configuration, concurrency, and operational infrastructure.
1. Database Connection Pooling (HikariCP)¶
1.1 Overview¶
Replaced Hibernate's default connection handling with HikariCP connection pooling across all six database sessions. Each module gets its own named pool for monitoring and diagnostics.
1.2 Configuration¶
Dependency (build.gradle.kts):
Hibernate config (tqapp/src/main/resources/hibernate.cfg.xml):
<property name="hibernate.connection.provider_class">
org.hibernate.hikaricp.internal.HikariCPConnectionProvider
</property>
<property name="hibernate.hikari.minimumIdle">2</property>
<property name="hibernate.hikari.maximumPoolSize">10</property>
<property name="hibernate.hikari.idleTimeout">300000</property>
<property name="hibernate.hikari.connectionTimeout">20000</property>
<property name="hibernate.hikari.maxLifetime">1200000</property>
1.3 Pool Names¶
Each *DBSession.java class sets a unique pool name before building the session factory:
| Class | Pool Name | Module |
|---|---|---|
NTSDBSession |
NTSPool | tqapp |
RaynaDBSession |
RaynaPool | tqryb2b |
TiqetsDBSession |
TiqetsPool | tqtiqets |
GoGlobalDBSession |
GoGlobalPool | tqgglbl |
AmdDBSession |
AmadeusPool | tqamds |
TlinqDBSession |
TlinqCommonPool | tqcommon |
1.4 Tuning¶
Default pool sizes (min 2, max 10) are suitable for typical workloads. For production, adjust per module based on observed query load:
<!-- Example: increase NTS pool for high-traffic deployments -->
<property name="hibernate.hikari.maximumPoolSize">20</property>
2. Hazelcast Cluster Configuration¶
2.1 Overview¶
Replaced Hazelcast's default multicast discovery with explicit TCP-IP discovery. Multicast is unreliable in cloud/container environments and was the root cause of instances running as isolated single-node clusters.
2.2 Configuration File¶
Location: config/hazelcast.xml (loaded from TLINQ_HOME at startup)
<hazelcast>
<cluster-name>tqpro-cluster</cluster-name>
<network>
<join>
<multicast enabled="false"/>
<tcp-ip enabled="true">
<member>127.0.0.1</member>
</tcp-ip>
</join>
</network>
<!-- Cart cache: session-scoped, TTL 1 hour, idle 30 min -->
<map name="cartsCache">
<time-to-live-seconds>3600</time-to-live-seconds>
<max-idle-seconds>1800</max-idle-seconds>
<eviction eviction-policy="LRU" max-size-policy="PER_NODE" size="10000"/>
</map>
<!-- Scheduler locks: for distributed scheduling -->
<map name="schedulerLocks">
<time-to-live-seconds>3600</time-to-live-seconds>
<backup-count>1</backup-count>
</map>
<!-- Distributed locks: for cross-instance synchronization -->
<map name="distributedLocks">
<time-to-live-seconds>120</time-to-live-seconds>
<backup-count>1</backup-count>
</map>
<!-- Cache invalidation topic -->
<topic name="cacheInvalidation">
<global-ordering-enabled>false</global-ordering-enabled>
</topic>
</hazelcast>
2.3 Multi-Instance Deployment¶
To deploy multiple instances, add all member IPs/hostnames to the <tcp-ip> section:
<tcp-ip enabled="true">
<member>10.0.1.10</member>
<member>10.0.1.11</member>
<member>10.0.1.12</member>
</tcp-ip>
2.4 TlinqClusterCache Changes¶
The TlinqClusterCache singleton was rewritten to:
- Load hazelcast.xml from TLINQ_HOME (falls back to default config with warning)
- Expose getHazelcastInstance() for direct Hazelcast operations
- Provide invalidateCache(String) for publishing cache invalidation events
- Provide onCacheInvalidation(Consumer<String>) for subscribing to invalidation events
- Provide shutdown() for graceful cluster leave
3. Distributed Scheduling Locks¶
3.1 Problem¶
Three scheduled tasks (GoGlobal refresh, Rayna B2B refresh, Tiqets catalog refresh) run on every instance. Without coordination, all instances simultaneously call external APIs and rebuild caches, causing duplicate load and potential data races.
3.2 Solution¶
Each scheduler uses IMap.tryLock() with a lease timeout to ensure only one instance executes the task. If the instance crashes, the lock is automatically released when the member leaves the cluster.
3.3 Implementation Pattern¶
IMap<String, Long> lockMap = TlinqClusterCache.instance()
.getHazelcastInstance().getMap("schedulerLocks");
// No wait (0s), lease for 55 min (< 1-hour refresh interval)
boolean acquired = lockMap.tryLock(LOCK_KEY, 0, TimeUnit.SECONDS, 55, TimeUnit.MINUTES);
if (!acquired) {
logger.info("Another instance holds the lock — skipping.");
return;
}
try {
runRefresh();
} finally {
lockMap.unlock(LOCK_KEY);
}
3.4 Affected Files¶
| File | Lock Key | Lease Time |
|---|---|---|
GGRefreshRunner.java |
GoGlobalRefresh |
55 min |
SDRefreshRunner.java |
RaynaB2BRefresh |
55 min |
TiqetsPlugin.java |
TiqetsRefresh |
configuredInterval - 5 min |
All three fall back to local execution if Hazelcast is unavailable.
4. Distributed Cache Invalidation¶
4.1 Architecture¶
Cache invalidation uses Hazelcast's ITopic pub/sub mechanism via a cacheInvalidation topic. When one instance refreshes a cache, it publishes an invalidation event. Other instances receive the event and rebuild their local caches from the shared database.
Instance A: refresh data → rebuild local cache → publish("cacheName")
↓
Instance B: receives event → rebuild local cache from DB
Instance C: receives event → rebuild local cache from DB
4.2 StaticMapCache Integration¶
StaticMapCache provides two methods for distributed invalidation:
| Method | Behavior |
|---|---|
invalidateDistributed(name) |
Removes local cache AND publishes invalidation |
notifyRemoteInvalidation(name) |
Publishes invalidation WITHOUT removing locally |
Use notifyRemoteInvalidation() when the local cache was just rebuilt and only remote instances need to be notified. Use invalidateDistributed() when the local cache is also stale.
4.3 Rayna Tour Cache¶
RaynaCacheManager rebuilds its tour cache locally, then calls notifyRemoteInvalidation(TOURCACHE). Remote instances receive the event and call initTourCache() to rebuild from the shared database.
4.4 Tiqets Catalog Cache¶
TiqetsCacheManager uses a split refresh pattern to prevent infinite invalidation loops:
refresh()(public) — refreshes from DB, then publishes invalidationrefreshLocal()(private) — refreshes from DB without publishing
The distributed invalidation listener calls refreshLocal(), not refresh(), to avoid re-triggering the event on all instances.
5. Concurrency Fixes¶
5.1 Volatile Singleton Fields¶
Added volatile to singleton instance fields that use double-checked locking, ensuring visibility across threads:
| Class | Field |
|---|---|
OdooCacheManager |
instance |
RaynaClientConfig |
_instance |
GoGlobalClientConfig |
_instance |
OdooServiceFactory |
userName, userPwd |
GroupManagerFacade |
sysSession |
TripOfferFacade |
sysSession |
5.2 Distributed Locks for Business Operations¶
Replaced synchronized (key.intern()) with IMap.tryLock() for two critical business operations:
CartFacade — Payment Creation Lock¶
IMap<String, Boolean> lockMap = TlinqClusterCache.instance()
.getHazelcastInstance().getMap("distributedLocks");
boolean acquired = lockMap.tryLock("cart-payment-" + sessionId, 10, TimeUnit.SECONDS);
Prevents duplicate payment gateway requests for the same cart across instances.
BookingRequestFacade — Booking Confirmation Lock¶
Prevents duplicate supplier booking calls. Also reloads the booking status after acquiring the lock to detect double-confirmation.
6. Plugin Lifecycle Management¶
6.1 Ordered Shutdown¶
TlinqFrameworkInitializer tracks all initialized plugins in insertion order. On shutdown, plugins are stopped in reverse order (LIFO) to respect dependency ordering.
// TlinqFrameworkInitializer.shutdownAll()
for (int i = initializedPlugins.size() - 1; i >= 0; i--) {
plugin.shutdown();
}
TlinqClusterCache.instance().shutdown(); // Hazelcast last
6.2 Plugin Shutdown Implementations¶
| Plugin | Shutdown Behavior |
|---|---|
GoGlobalPlugin |
Stops ScheduledExecutorService (30s grace period, then shutdownNow) |
RaynaB2BActPlugin |
Same pattern as GoGlobalPlugin |
AbstractPlugin |
Default no-op (other plugins inherit this) |
6.3 Server Integration¶
TQProApiServer shutdown hook calls TlinqFrameworkInitializer.shutdownAll() to ensure graceful cluster leave and resource cleanup.
7. Operational Infrastructure¶
7.1 Health Check Endpoints¶
Two new endpoints for load balancer integration:
| Endpoint | Purpose | Returns |
|---|---|---|
POST /system/health |
Deep health check | Database, Hazelcast cluster, plugin status |
POST /system/ready |
Readiness probe | Whether all plugins are initialized |
Both return HTTP 200 when healthy/ready, HTTP 503 when degraded/not ready. See System API Specification for details.
7.2 Correlation ID Propagation¶
Every API request gets a correlation ID for distributed tracing:
AuthenticationFilterreadsX-Correlation-IDfrom request headers, or generates an 8-character UUID- Stored as a request property and included in log messages
CORSResponseFilteradds the correlation ID to the response headers
This enables end-to-end request tracing across load-balanced instances.
7.3 Database Migration for Optimistic Locking¶
A SQL migration script (config/db-changes/add_version_columns.sql) adds version BIGINT DEFAULT 0 columns to 35+ entity tables. This prepares for JPA @Version optimistic locking.
Usage:
The migration is safe to run multiple times (ADD COLUMN IF NOT EXISTS). The @Version annotation on entity classes should be enabled after the migration is applied.
8. Configuration Checklist¶
8.1 Prerequisites for Multi-Instance Deployment¶
| Step | File | Action |
|---|---|---|
| 1 | config/hazelcast.xml |
Add all instance IPs to <tcp-ip> section |
| 2 | config/db-changes/add_version_columns.sql |
Run migration on the shared database |
| 3 | Load balancer | Configure health check pointing to POST /system/health |
| 4 | Load balancer | Configure readiness check pointing to POST /system/ready |
| 5 | config/tlinqapi.properties |
Ensure dev-mode=false on all instances |
8.2 Monitoring¶
- HikariCP pools: Each pool logs metrics to JUL under
com.zaxxer.hikari.pool.<PoolName> - Hazelcast cluster: Logged at startup (
Hazelcast cluster joined: N member(s)) - Scheduler locks: Logged when acquired/skipped (
Another instance is running ... — skipping.) - Cache invalidation: Logged on publish and receive
- Correlation IDs: All API log entries include
[correlationId]prefix
9. Files Modified¶
New Files¶
| File | Purpose |
|---|---|
config/hazelcast.xml |
Hazelcast cluster configuration |
config/db-changes/add_version_columns.sql |
Database migration for optimistic locking |
tqapi/.../HealthApi.java |
Health and readiness endpoints |
Modified Files¶
| File | Changes |
|---|---|
build.gradle.kts |
Added hibernate-hikaricp dependency |
hibernate.cfg.xml |
HikariCP connection pool properties |
NTSDBSession.java |
Pool name: NTSPool |
RaynaDBSession.java |
Pool name: RaynaPool |
TiqetsDBSession.java |
Pool name: TiqetsPool |
GoGlobalDBSession.java |
Pool name: GoGlobalPool |
AmdDBSession.java |
Pool name: AmadeusPool |
TlinqDBSession.java |
Pool name: TlinqCommonPool |
TlinqClusterCache.java |
Rewritten: external config, pub/sub, shutdown |
StaticMapCache.java |
Distributed invalidation, notifyRemoteInvalidation |
TlinqFrameworkInitializer.java |
Plugin registry, shutdownAll, isInitialized |
AbstractPlugin.java |
Added shutdown() method |
GoGlobalPlugin.java |
Graceful executor shutdown |
RaynaB2BActPlugin.java |
Graceful executor shutdown |
TQProApiServer.java |
Shutdown hook calls shutdownAll() |
GGRefreshRunner.java |
IMap.tryLock distributed lock |
SDRefreshRunner.java |
IMap.tryLock distributed lock |
TiqetsPlugin.java |
IMap.tryLock distributed lock |
TiqetsCacheManager.java |
Distributed invalidation with loop prevention |
RaynaCacheManager.java |
Distributed invalidation with rebuild listener |
CartFacade.java |
IMap.tryLock for payment creation |
BookingRequestFacade.java |
IMap.tryLock for booking confirmation |
AuthenticationFilter.java |
Correlation ID generation |
CORSResponseFilter.java |
Correlation ID propagation |
OdooCacheManager.java |
volatile singleton |
RaynaClientConfig.java |
volatile singleton |
GoGlobalClientConfig.java |
volatile singleton |
OdooServiceFactory.java |
volatile fields |
GroupManagerFacade.java |
volatile field |
TripOfferFacade.java |
volatile field |
api-roles.properties |
Added system/health and system/ready |