Skip to content

Horizontal Scalability Implementation

Status: COMPLETE

Completed: 2026-02-19

All scalability components have been implemented: - HikariCP connection pooling via Hibernate integration - Hazelcast cluster configuration with TCP-IP discovery - Distributed scheduling locks with auto-release on crash - Distributed cache invalidation with pub/sub - Plugin lifecycle management with ordered shutdown - Health check and readiness endpoints - Correlation ID propagation for request tracing - Database migration script for optimistic locking


Executive Summary

This document describes the implementation of horizontal scalability infrastructure for TQPro, enabling the platform to run as a multi-instance cluster behind a load balancer. The changes address 28 scalability findings across caching, scheduling, database connections, Hazelcast configuration, concurrency, and operational infrastructure.

1. Database Connection Pooling (HikariCP)

1.1 Overview

Replaced Hibernate's default connection handling with HikariCP connection pooling across all six database sessions. Each module gets its own named pool for monitoring and diagnostics.

1.2 Configuration

Dependency (build.gradle.kts):

"implementation"("org.hibernate.orm:hibernate-hikaricp:6.5.1.Final")

Hibernate config (tqapp/src/main/resources/hibernate.cfg.xml):

<property name="hibernate.connection.provider_class">
    org.hibernate.hikaricp.internal.HikariCPConnectionProvider
</property>
<property name="hibernate.hikari.minimumIdle">2</property>
<property name="hibernate.hikari.maximumPoolSize">10</property>
<property name="hibernate.hikari.idleTimeout">300000</property>
<property name="hibernate.hikari.connectionTimeout">20000</property>
<property name="hibernate.hikari.maxLifetime">1200000</property>

1.3 Pool Names

Each *DBSession.java class sets a unique pool name before building the session factory:

Class Pool Name Module
NTSDBSession NTSPool tqapp
RaynaDBSession RaynaPool tqryb2b
TiqetsDBSession TiqetsPool tqtiqets
GoGlobalDBSession GoGlobalPool tqgglbl
AmdDBSession AmadeusPool tqamds
TlinqDBSession TlinqCommonPool tqcommon

1.4 Tuning

Default pool sizes (min 2, max 10) are suitable for typical workloads. For production, adjust per module based on observed query load:

<!-- Example: increase NTS pool for high-traffic deployments -->
<property name="hibernate.hikari.maximumPoolSize">20</property>

2. Hazelcast Cluster Configuration

2.1 Overview

Replaced Hazelcast's default multicast discovery with explicit TCP-IP discovery. Multicast is unreliable in cloud/container environments and was the root cause of instances running as isolated single-node clusters.

2.2 Configuration File

Location: config/hazelcast.xml (loaded from TLINQ_HOME at startup)

<hazelcast>
    <cluster-name>tqpro-cluster</cluster-name>
    <network>
        <join>
            <multicast enabled="false"/>
            <tcp-ip enabled="true">
                <member>127.0.0.1</member>
            </tcp-ip>
        </join>
    </network>

    <!-- Cart cache: session-scoped, TTL 1 hour, idle 30 min -->
    <map name="cartsCache">
        <time-to-live-seconds>3600</time-to-live-seconds>
        <max-idle-seconds>1800</max-idle-seconds>
        <eviction eviction-policy="LRU" max-size-policy="PER_NODE" size="10000"/>
    </map>

    <!-- Scheduler locks: for distributed scheduling -->
    <map name="schedulerLocks">
        <time-to-live-seconds>3600</time-to-live-seconds>
        <backup-count>1</backup-count>
    </map>

    <!-- Distributed locks: for cross-instance synchronization -->
    <map name="distributedLocks">
        <time-to-live-seconds>120</time-to-live-seconds>
        <backup-count>1</backup-count>
    </map>

    <!-- Cache invalidation topic -->
    <topic name="cacheInvalidation">
        <global-ordering-enabled>false</global-ordering-enabled>
    </topic>
</hazelcast>

2.3 Multi-Instance Deployment

To deploy multiple instances, add all member IPs/hostnames to the <tcp-ip> section:

<tcp-ip enabled="true">
    <member>10.0.1.10</member>
    <member>10.0.1.11</member>
    <member>10.0.1.12</member>
</tcp-ip>

2.4 TlinqClusterCache Changes

The TlinqClusterCache singleton was rewritten to: - Load hazelcast.xml from TLINQ_HOME (falls back to default config with warning) - Expose getHazelcastInstance() for direct Hazelcast operations - Provide invalidateCache(String) for publishing cache invalidation events - Provide onCacheInvalidation(Consumer<String>) for subscribing to invalidation events - Provide shutdown() for graceful cluster leave


3. Distributed Scheduling Locks

3.1 Problem

Three scheduled tasks (GoGlobal refresh, Rayna B2B refresh, Tiqets catalog refresh) run on every instance. Without coordination, all instances simultaneously call external APIs and rebuild caches, causing duplicate load and potential data races.

3.2 Solution

Each scheduler uses IMap.tryLock() with a lease timeout to ensure only one instance executes the task. If the instance crashes, the lock is automatically released when the member leaves the cluster.

3.3 Implementation Pattern

IMap<String, Long> lockMap = TlinqClusterCache.instance()
        .getHazelcastInstance().getMap("schedulerLocks");

// No wait (0s), lease for 55 min (< 1-hour refresh interval)
boolean acquired = lockMap.tryLock(LOCK_KEY, 0, TimeUnit.SECONDS, 55, TimeUnit.MINUTES);
if (!acquired) {
    logger.info("Another instance holds the lock — skipping.");
    return;
}
try {
    runRefresh();
} finally {
    lockMap.unlock(LOCK_KEY);
}

3.4 Affected Files

File Lock Key Lease Time
GGRefreshRunner.java GoGlobalRefresh 55 min
SDRefreshRunner.java RaynaB2BRefresh 55 min
TiqetsPlugin.java TiqetsRefresh configuredInterval - 5 min

All three fall back to local execution if Hazelcast is unavailable.


4. Distributed Cache Invalidation

4.1 Architecture

Cache invalidation uses Hazelcast's ITopic pub/sub mechanism via a cacheInvalidation topic. When one instance refreshes a cache, it publishes an invalidation event. Other instances receive the event and rebuild their local caches from the shared database.

Instance A: refresh data → rebuild local cache → publish("cacheName")
Instance B: receives event → rebuild local cache from DB
Instance C: receives event → rebuild local cache from DB

4.2 StaticMapCache Integration

StaticMapCache provides two methods for distributed invalidation:

Method Behavior
invalidateDistributed(name) Removes local cache AND publishes invalidation
notifyRemoteInvalidation(name) Publishes invalidation WITHOUT removing locally

Use notifyRemoteInvalidation() when the local cache was just rebuilt and only remote instances need to be notified. Use invalidateDistributed() when the local cache is also stale.

4.3 Rayna Tour Cache

RaynaCacheManager rebuilds its tour cache locally, then calls notifyRemoteInvalidation(TOURCACHE). Remote instances receive the event and call initTourCache() to rebuild from the shared database.

4.4 Tiqets Catalog Cache

TiqetsCacheManager uses a split refresh pattern to prevent infinite invalidation loops:

  • refresh() (public) — refreshes from DB, then publishes invalidation
  • refreshLocal() (private) — refreshes from DB without publishing

The distributed invalidation listener calls refreshLocal(), not refresh(), to avoid re-triggering the event on all instances.


5. Concurrency Fixes

5.1 Volatile Singleton Fields

Added volatile to singleton instance fields that use double-checked locking, ensuring visibility across threads:

Class Field
OdooCacheManager instance
RaynaClientConfig _instance
GoGlobalClientConfig _instance
OdooServiceFactory userName, userPwd
GroupManagerFacade sysSession
TripOfferFacade sysSession

5.2 Distributed Locks for Business Operations

Replaced synchronized (key.intern()) with IMap.tryLock() for two critical business operations:

CartFacade — Payment Creation Lock

IMap<String, Boolean> lockMap = TlinqClusterCache.instance()
        .getHazelcastInstance().getMap("distributedLocks");
boolean acquired = lockMap.tryLock("cart-payment-" + sessionId, 10, TimeUnit.SECONDS);

Prevents duplicate payment gateway requests for the same cart across instances.

BookingRequestFacade — Booking Confirmation Lock

boolean acquired = lockMap.tryLock("booking-confirm-" + bkReqId, 15, TimeUnit.SECONDS);

Prevents duplicate supplier booking calls. Also reloads the booking status after acquiring the lock to detect double-confirmation.


6. Plugin Lifecycle Management

6.1 Ordered Shutdown

TlinqFrameworkInitializer tracks all initialized plugins in insertion order. On shutdown, plugins are stopped in reverse order (LIFO) to respect dependency ordering.

// TlinqFrameworkInitializer.shutdownAll()
for (int i = initializedPlugins.size() - 1; i >= 0; i--) {
    plugin.shutdown();
}
TlinqClusterCache.instance().shutdown(); // Hazelcast last

6.2 Plugin Shutdown Implementations

Plugin Shutdown Behavior
GoGlobalPlugin Stops ScheduledExecutorService (30s grace period, then shutdownNow)
RaynaB2BActPlugin Same pattern as GoGlobalPlugin
AbstractPlugin Default no-op (other plugins inherit this)

6.3 Server Integration

TQProApiServer shutdown hook calls TlinqFrameworkInitializer.shutdownAll() to ensure graceful cluster leave and resource cleanup.


7. Operational Infrastructure

7.1 Health Check Endpoints

Two new endpoints for load balancer integration:

Endpoint Purpose Returns
POST /system/health Deep health check Database, Hazelcast cluster, plugin status
POST /system/ready Readiness probe Whether all plugins are initialized

Both return HTTP 200 when healthy/ready, HTTP 503 when degraded/not ready. See System API Specification for details.

7.2 Correlation ID Propagation

Every API request gets a correlation ID for distributed tracing:

  1. AuthenticationFilter reads X-Correlation-ID from request headers, or generates an 8-character UUID
  2. Stored as a request property and included in log messages
  3. CORSResponseFilter adds the correlation ID to the response headers

This enables end-to-end request tracing across load-balanced instances.

7.3 Database Migration for Optimistic Locking

A SQL migration script (config/db-changes/add_version_columns.sql) adds version BIGINT DEFAULT 0 columns to 35+ entity tables. This prepares for JPA @Version optimistic locking.

Usage:

psql -U tlinq -d tlinq -f config/db-changes/add_version_columns.sql

The migration is safe to run multiple times (ADD COLUMN IF NOT EXISTS). The @Version annotation on entity classes should be enabled after the migration is applied.


8. Configuration Checklist

8.1 Prerequisites for Multi-Instance Deployment

Step File Action
1 config/hazelcast.xml Add all instance IPs to <tcp-ip> section
2 config/db-changes/add_version_columns.sql Run migration on the shared database
3 Load balancer Configure health check pointing to POST /system/health
4 Load balancer Configure readiness check pointing to POST /system/ready
5 config/tlinqapi.properties Ensure dev-mode=false on all instances

8.2 Monitoring

  • HikariCP pools: Each pool logs metrics to JUL under com.zaxxer.hikari.pool.<PoolName>
  • Hazelcast cluster: Logged at startup (Hazelcast cluster joined: N member(s))
  • Scheduler locks: Logged when acquired/skipped (Another instance is running ... — skipping.)
  • Cache invalidation: Logged on publish and receive
  • Correlation IDs: All API log entries include [correlationId] prefix

9. Files Modified

New Files

File Purpose
config/hazelcast.xml Hazelcast cluster configuration
config/db-changes/add_version_columns.sql Database migration for optimistic locking
tqapi/.../HealthApi.java Health and readiness endpoints

Modified Files

File Changes
build.gradle.kts Added hibernate-hikaricp dependency
hibernate.cfg.xml HikariCP connection pool properties
NTSDBSession.java Pool name: NTSPool
RaynaDBSession.java Pool name: RaynaPool
TiqetsDBSession.java Pool name: TiqetsPool
GoGlobalDBSession.java Pool name: GoGlobalPool
AmdDBSession.java Pool name: AmadeusPool
TlinqDBSession.java Pool name: TlinqCommonPool
TlinqClusterCache.java Rewritten: external config, pub/sub, shutdown
StaticMapCache.java Distributed invalidation, notifyRemoteInvalidation
TlinqFrameworkInitializer.java Plugin registry, shutdownAll, isInitialized
AbstractPlugin.java Added shutdown() method
GoGlobalPlugin.java Graceful executor shutdown
RaynaB2BActPlugin.java Graceful executor shutdown
TQProApiServer.java Shutdown hook calls shutdownAll()
GGRefreshRunner.java IMap.tryLock distributed lock
SDRefreshRunner.java IMap.tryLock distributed lock
TiqetsPlugin.java IMap.tryLock distributed lock
TiqetsCacheManager.java Distributed invalidation with loop prevention
RaynaCacheManager.java Distributed invalidation with rebuild listener
CartFacade.java IMap.tryLock for payment creation
BookingRequestFacade.java IMap.tryLock for booking confirmation
AuthenticationFilter.java Correlation ID generation
CORSResponseFilter.java Correlation ID propagation
OdooCacheManager.java volatile singleton
RaynaClientConfig.java volatile singleton
GoGlobalClientConfig.java volatile singleton
OdooServiceFactory.java volatile fields
GroupManagerFacade.java volatile field
TripOfferFacade.java volatile field
api-roles.properties Added system/health and system/ready