Skip to content

Multi-Tenancy Execution Plan

Date: 2026-04-14 Audience: Claude Code (implementation) Scope: Step-by-step execution of the multi-tenancy architecture for TQPro. This plan is executable — each task names concrete files, concrete changes, and a verification command.


1. Reference Plans

This plan implements and sharpens the design captured in prior planning documents. Read these in full before starting:

Source Role
doc/plans/multitenancy.md Master design (41 KB) — architecture, strategy rationale, phase outline. This execution plan extends it with concrete tasks and fixes the shortcomings identified in review.
doc/plans/management/staff-management-plan.md Staff management + Keycloak admin — sequenced AFTER Phase 0 of this plan. Must be built tenant-aware.
doc/plans/whatsapp/gaps.md WhatsApp tech debt + future features — Phase 5 here supersedes the WhatsApp tenant-routing question.

All other files under doc/plans/whatsapp/ and doc/plans/multitenancy.md will be archived once implementation is complete; this plan remains alive until its own tasks are closed.


2. Architectural Decisions (Committed)

Decisions taken before execution begins. These shape every downstream task.

ID Decision Rationale
D-1 Multi-tenancy Phase 0 ships before any staff-management-plan work. Avoids retrofit. Staff-management is built tenant-aware on day one.
D-2 Database-per-tenant on a single PostgreSQL instance. Preserves @Table(schema=...) annotations, PG-level isolation, trivial provisioning via pg_dump/pg_restore.
D-3 Realm-per-tenant within one Keycloak installation. Two clients per realm: tqweb-adm (public) and tqpro-admin-api (confidential, service account). Complete user isolation, independent themes/IdPs, admin-REST calls naturally scoped.
D-4 Platform service account tqpro-platform-admin in the master realm gets create-realm only — NOT master-realm admin. Narrow blast radius. Subsequent operations on a realm use that realm's own tqpro-admin-api service account. Fixes shortcoming #1 from plan review.
D-5 Raise PostgreSQL max_connections only — no pgbouncer in this plan. Simple, single concern. Revisit when tenant count exceeds ~30. Fixes shortcoming #2.
D-6 TenantRegistry is an in-memory cache populated at startup. Refresh is push-only (explicit refresh() after onboarding / deprovisioning). Stale reads are preferred over 503s if tqplatform is briefly unreachable. Addresses shortcoming #3 — tqplatform DB is not a per-request dependency.
D-7 TQPRO_ENCRYPTION_KEY lives in a manually-managed env var / property file (mode 0600), matching the existing TQPro secrets pattern. Rotation procedure is documented in the WhatsApp security runbook (extended). No AWS SSM, no Vault. Fixes shortcoming #4.
D-8 Subdomain routing is the committed frontend tenant-resolution design (<tenant>.tourlinq.com). A pre-login tenant-selector fallback is explicitly out of scope. Fixes shortcoming #6.
D-9 WhatsApp webhook → tenant routing keyed by phone_number_id via a new tqplatform.wa_phone_routing table. Fixes shortcoming #7.
D-10 Deprovisioned tenants: keep DB forever, disable (not delete) Keycloak realm. No data ever lost, audit history preserved. Addresses shortcoming #8.
D-11 Manual admin-triggered onboarding only in Phase 1. CLI + internal admin API. No public signup page. Addresses shortcoming #9.
D-12 Realm roles (guest, agent, manager, finance, admin) are identical across every tenant — this is a design constraint, not a feature. Per-tenant custom roles are not supported. Addresses shortcoming #10.
D-13 Tenant-owned domain data (including CStaffMember and the CTeamMember → CStaffMember migration) lives in each tenant's DB independently. Platform DB tqplatform holds only cross-tenant registry/routing. Addresses shortcoming #11.
D-14 Gateway nginx is the sole TLS terminator and forks routing into two upstream pools: /tlinq-api/* → 1..N stateless Jetty hosts; /* → 1..N web nginx hosts (each serving the static tqweb-adm SPA from /opt/tqpro/tqweb-adm). Upstream pools are defined ONCE in /etc/nginx/conf.d/tqpro-upstreams.conf (rendered from templates/upstreams.conf.template by scripts/render-upstreams.sh using upstream.api/upstream.web CSV lists in tourlinq.properties) — every per-tenant gateway server block references them by name (tqpro_api, tqpro_web). Scaling either tier = edit the CSV + re-run the renderer + reload nginx; existing tenants pick up the change with no per-tenant rework. Provisioning runs from an internal orchestration host (typically the API server) in the protected subnet — never from the gateway itself. The gateway is in the DMZ and holds no DB credentials, no platform-admin token, and no TQPro process: its only role is TLS termination + nginx proxy. All gateway changes (vhost install, nginx reload, certbot) and all web-host changes go via SSH to a narrow-NOPASSWD-sudo tqpro-deploy user on each remote host; API hosts need NO per-tenant changes — Jetty learns about new tenants from the platform DB via TenantRegistry.refresh(). Supersedes the earlier "script runs on the gateway" model — that would have given any DMZ intrusion direct access to the platform DB and provisioning.
D-15 Per-tenant DB schema migrations are tracked via a public.schema_migrations ledger table inside each tenant DB. Each row: filename, applied_at, checksum. The platform DB does NOT hold a schema_version column. Flyway is the planned successor for production; this in-repo ledger is transitional but sufficient for Phase 1.
D-16 TenantRegistry.refresh() propagates across cluster nodes via a Hazelcast ITopic<String> (message body = tenant_id or "*" for full reload). Every JVM subscribes at startup; publishing happens after every provisioning / deprovisioning / config change. TlinqClusterCache is already the Hazelcast host — reuse its HazelcastInstance. No manual per-node refresh endpoint.
D-17 The Python WhatsApp service splits into two deployables: tqwhatsapp (webhook handler, FastAPI — stays) and tqwhatsapp-worker (new — retry + cleanup + broadcast-dispatch consumer). Shared modules (config, db, tenant, clients) live in tqwhatsapp/app/ and both services import them. The handler scales on inbound-webhook latency; the worker scales on tenant count. Both services read the platform registry and maintain per-tenant connection pools.
D-18 Per-tenant property overrides live in nts.system_settings in each tenant's DB (NOT in tqplatform.tenant.config JSONB for business props). TenantConfig reads the table at first access, caches per-JVM. Settings-page audit log (nts.settings_audit_log) sits alongside. Tenant-DB cloning naturally carries settings. tqplatform.tenant.config JSONB is reserved for platform-admin-only metadata (plan tier, feature flags set by platform admin, region, compliance flags). Supersedes the settings-page-plan's AppConfig.setProp() design.
D-19 The settings-page-plan (doc/plans/gaps/settings-page-plan.md) is folded into multi-tenancy Phase 3 as a single delivery. Phase 3 now delivers: TenantConfig + SystemSettingsFacade + SystemSettingsApi + admin UI (tqweb-adm/settings.html). Plugin config singletons' ## substitution caching is removed; every plugin reads TenantConfig.get(key) at use time. Settings API is tenant-admin scoped (reads/writes only the caller's tenant). Supersedes settings-page-plan Phases 1-4.

3. Open Decisions (Defer, Do Not Block)

These are flagged for later. They do NOT block execution of this plan.

ID Decision pending When to decide
OD-1 Whether to introduce pgbouncer and at what tenant count When tenant count approaches 30 or pool exhaustion is observed
OD-2 Self-service signup (public /platform/tenant/signup endpoint) When business plans public SaaS signup
OD-3 Per-tenant deprovisioning policy (archive / delete after N days) If/when D-10 "keep forever" creates real storage pressure
OD-4 Per-tenant custom realm roles If/when a tenant contract requires it

4. Roles and Responsibilities

Every task in this plan belongs to one of two actors. The plan never mixes them — if a task appears to, the split is implicit and must be read as two handoffs.

Actor Where they work What they do Delivery
Claude Code Developer workstation Writes Java/Python source, SQL migration files, shell/Python scripts, nginx templates, config examples, test code, and documentation under doc/. Runs ./gradlew build and pytest locally. Produces pull requests. The repo is the handoff artifact. Claude Code never touches a production, staging, or shared host.
Ops Gateway host, DB host, Keycloak host, TeamCity Applies SQL migrations, edits /etc/nginx/** / /etc/postgresql/** / /etc/letsencrypt/**, runs certbot, restarts systemd units, creates DNS records, executes tenant-provision.sh, runs ./gradlew deploy. Changes on real hosts, verified against the runbooks in doc/operations/.

Every phase task below is labeled:

  • Artifact (repo): file(s) Claude Code creates or modifies. Verifiable on a laptop.
  • Runbook: documentation Claude Code adds under doc/operations/ so ops knows what to do with the artifact.
  • Ops execution: host-level work performed by ops after the PR merges. NOT done by Claude Code. Verified against the runbook.

Every phase ends with a two-part acceptance gate:

  • Repo gate — everything Claude Code can verify on a laptop: build passes, unit tests pass, nginx -t against a sample-rendered template, shellcheck on scripts, psql syntax-checking SQL with \i in a dry-run mode.
  • Deployment gate — ops-owned, performed after merge in staging first, then production. Includes smoke tests, certbot dry-runs, end-to-end provisioning against a second tenant, etc.

The repo gate must pass before the PR merges. The deployment gate is a separate ops checklist; merging the PR does not automatically mean the phase is live.


5. Prerequisites

Status: COMPLETED on 2026-04-14 (TQ-115). All four prerequisites have been satisfied. The detailed Keycloak walkthrough that ops actually executed is captured under P0.8's runbook content (see "Keycloak setup" section spec). Re-running the prerequisites is idempotent — the steps below remain authoritative for any new TQPro Keycloak deployment (staging clone, future on-prem install, etc.).

Before Task P0.1, verify:

  • [x] master realm of the Keycloak instance is accessible with a platform-admin-capable credential. Created tqpro-platform-admin confidential client (service account enabled, realm role create-realm assigned, client secret recorded). Do NOT grant master-realm admin. Note: create-realm is a master-realm REALM role, not a client role on master-realm — filter "by realm roles" when assigning. The Service accounts roles tab will also show the auto-added default-roles-master composite (no admin scope; harmless).
  • [x] Seed realm tqpro-adm brought into line with the new pattern: missing realm roles manager and finance added (joining existing guest, agent, admin); new confidential client tqpro-admin-api created with service account enabled and manage-users + view-users on realm-management (NOT manage-realm); its client secret recorded for later encrypted insertion into tqplatform.tenant.kc_admin_client_secret for the seed row.
  • [x] PostgreSQL instance: current max_connections documented. P0.2 raises to 500.
  • [x] 32-byte TQPRO_ENCRYPTION_KEY generated (python3 -c "import secrets; print(secrets.token_hex(32))") and stored in /etc/tqpro/tqpro.env (mode 0600, owned by the service user). NOT committed to git.
  • [x] Verification: POST /admin/realms succeeds (HTTP 201) with the platform service account; GET /admin/realms/master/users?max=1 returns 403 (proves the account is NOT a master-realm admin).
  • [x] Clean working branch identified (dev); every phase lands as a separate PR under TQ-115 child tickets.

Important note for KeycloakRealmProvisioner (P1.3): Keycloak auto-grants the realm-management roles for a newly created realm to the creator's principal — but the existing access token does not reflect those new claims. Every multi-step admin sequence must re-fetch the platform service-account token immediately after POST /admin/realms succeeds, before any subsequent POST /admin/realms/${realm}/... calls. Same applies to Phase 8 deprovisioning batches that call PUT /admin/realms/${realm} — always start with a fresh token.


6. Phase 0 — Foundation

Goal: Tenant identity flows from JWT through the entire request lifecycle to the DB session, with a single existing tenant (the current tlinq DB + tqpro-adm realm) proving the plumbing. Zero behavioral change for end users.

Effort: 3–4 weeks (revised — original 2-3 week estimate undersized P0.6 fan-out across 6 DB-session classes and the Hazelcast cluster-refresh wiring). Risk: Low-Medium. Depends on: Prerequisites done.

P0.1 — Platform database and tenant registry

New migration location: Platform-DB migrations are kept separate from tenant-DB migrations. - config/db-changes/*.sql — per-tenant schema changes (current location, unchanged semantics). - config/db-changes/platform/*.sql — platform-DB only (NEW subdirectory).

Rationale: the two run against different databases with different lifecycles. Mixing them would require every migration file to carry a target marker. A directory split is self-documenting.

Create: config/db-changes/platform/0001-tqplatform-schema.sql — the platform DB has its own numbering starting at 0001 (independent of the tenant-DB numbering that is now at 0071).

-- Separate database 'tqplatform' — NOT a schema in 'tlinq'.
-- Applied by scripts/apply-platform-migrations.sh against the platform DB only.
CREATE TABLE IF NOT EXISTS tenant (
    tenant_id                 VARCHAR(36) PRIMARY KEY,
    tenant_code               VARCHAR(50) UNIQUE NOT NULL,
    tenant_name               VARCHAR(200) NOT NULL,
    db_name                   VARCHAR(63) NOT NULL,
    db_user                   VARCHAR(63),
    db_pass                   VARCHAR(200),
    kc_realm                  VARCHAR(63) NOT NULL UNIQUE,
    kc_admin_client_secret    VARCHAR(200),
    status                    VARCHAR(20) DEFAULT 'ACTIVE'
                              CHECK (status IN ('ACTIVE', 'SUSPENDED', 'DEPROVISIONED')),
    config                    JSONB DEFAULT '{}',
    created_at                TIMESTAMPTZ DEFAULT NOW(),
    deprovisioned_at          TIMESTAMPTZ
);

CREATE INDEX IF NOT EXISTS idx_tenant_kc_realm ON tenant(kc_realm);
CREATE INDEX IF NOT EXISTS idx_tenant_status   ON tenant(status);

-- Platform DB's own migration ledger (same shape as tenant DB ledger per D-15).
CREATE TABLE IF NOT EXISTS schema_migrations (
    filename     VARCHAR(255) PRIMARY KEY,
    applied_at   TIMESTAMPTZ DEFAULT NOW(),
    checksum     VARCHAR(64)
);

-- Greenfield installs leave tqplatform.tenant empty. The first tenant is
-- onboarded via scripts/tenant-provision.sh just like every subsequent one.
-- Environments migrating from a single-tenant install run a one-line INSERT
-- documented in doc/operations/multitenancy-setup.md "Appendix A".

-- WhatsApp webhook routing: phone_number_id -> tenant_id. Consulted by tqwhatsapp on every webhook.
CREATE TABLE IF NOT EXISTS wa_phone_routing (
    phone_number_id  VARCHAR(64) PRIMARY KEY,
    tenant_id        VARCHAR(36) NOT NULL REFERENCES tenant(tenant_id),
    created_at       TIMESTAMPTZ DEFAULT NOW()
);

-- Record this migration as applied.
INSERT INTO schema_migrations (filename) VALUES ('0001-tqplatform-schema.sql')
ON CONFLICT (filename) DO NOTHING;

Note on per-tenant schema_migrations ledger (D-15): The existing tlinq DB (seed tenant) does not yet have a public.schema_migrations table. P1.2 handles bootstrap of that ledger with all 71 current migrations recorded as applied.

Operational step (manual): createdb tqplatform && psql tqplatform -f config/db-changes/platform/0001-tqplatform-schema.sql. Document in doc/operations/multitenancy-setup.md (create in P0.8).

Verify: psql tqplatform -c "SELECT * FROM tenant;" returns the seed row; SELECT * FROM schema_migrations; shows the bootstrap entry.

P0.2 — PostgreSQL connection tuning

File: doc/operations/multitenancy-setup.md — document the change; actual edits happen on the DB host by ops.

# postgresql.conf
max_connections = 500           # was 100
shared_buffers = 2GB            # rule of thumb: 25% of RAM
work_mem = 8MB

Restart PostgreSQL required. Include this in the same maintenance window as the tqplatform DB creation.

Verify: psql -c "SHOW max_connections;" returns 500.

P0.3 — Tenant registry Java code (tqcommon)

Create:

  • tqcommon/src/main/java/com/perun/tlinq/tenant/TenantInfo.java — immutable DTO: tenantId, tenantCode, tenantName, dbName, dbUser, dbPass, kcRealm, kcAdminClientSecret, status, config (as JsonNode).
  • tqcommon/src/main/java/com/perun/tlinq/tenant/TenantRegistry.java — singleton. Loads all status='ACTIVE' tenants from tqplatform.tenant at startup into ConcurrentHashMap<String, TenantInfo> byId and a parallel byRealm and byCode map. Exposes:
  • getById(String), getByRealm(String), getByCode(String) — all return TenantInfo or null
  • listActive() — returns Collection<TenantInfo>
  • refresh() — re-queries tqplatform.tenant, replaces maps atomically; logs and retains stale map if query fails. Also publishes a Hazelcast topic event (tenant-registry-refresh, payload = "*") so peer nodes reload.
  • refreshOne(String tenantId) — targeted refresh for a single tenant; publishes the same topic with the tenant_id as payload.
  • requireById(String) — throws UnknownTenantException (new runtime exception) if not found
  • tqcommon/src/main/java/com/perun/tlinq/tenant/UnknownTenantException.java
  • tqcommon/src/main/java/com/perun/tlinq/tenant/PlatformDbConfig.java — loads platform DB connection info via AppConfig.getInstance().getProp(...) for keys platform.db.url, platform.db.user, platform.db.pass in tourlinq.properties (the application-level config, not the Jetty-specific tlinqapi.properties). Dedicated small HikariCP pool (max 2 connections) separate from tenant pools.

Cluster refresh via Hazelcast (D-16):

Two distinct topics are used across the plan. Both are subscribed in TenantRegistry at startup. The HazelcastInstance is acquired via TlinqClusterCache.instance().getHazelcastInstance() (TlinqClusterCache uses instance() — not getInstance() — and exposes the handle via getHazelcastInstance()).

Topic Payload Publishers Subscriber action
tenant-registry-refresh tenant_id or "*" Provisioning (P1.3), deprovisioning (P8.1), reactivation (P8.2), suspend/activate (P1.4) TenantRegistry.refreshLocal() re-queries platform DB; if payload is a specific tenant_id AND the tenant's post-refresh status is non-ACTIVE (or missing entirely), also call TenantSessionRegistry.evictAll(tenantId) to drop all Hibernate pools on this node
tenant-config-refresh tenant_id SystemSettingsFacade after every settings write (P3.5) TenantConfig.reload(tenantId) invalidates that tenant's property cache on this node

Single-JVM dev mode (Hazelcast not running) degrades gracefully: refresh() calls the local handlers directly and logs a WARN.

Why subscribe in TenantRegistry (not in each subsystem): Centralising all multi-tenant state transitions at one subscriber keeps evict-order deterministic (registry first, then pools, then config) and avoids split-brain where one topic delivers but another doesn't.

Subscriber handlers (private methods — they do NOT re-publish, avoiding loops):

private void onTenantRegistryRefresh(String payload) {
    String tenantIdBefore = payload.equals("*") ? null : payload;
    TenantInfo before = tenantIdBefore != null ? byId.get(tenantIdBefore) : null;
    refreshLocal();   // re-query tqplatform.tenant, swap maps
    if (tenantIdBefore != null) {
        TenantInfo after = byId.get(tenantIdBefore);
        boolean goneOrInactive = after == null || !"ACTIVE".equals(after.getStatus());
        if (goneOrInactive && before != null) {
            TenantSessionRegistry.evictAll(tenantIdBefore);
        }
    }
}

private void onTenantConfigRefresh(String tenantId) {
    TenantConfig.reload(tenantId);     // no platform-DB hit; just invalidate cache
}

refresh() and refreshOne() do: (1) call the local handler directly, (2) publish to peers. The publish-after-local ordering ensures the node that triggered the change is always consistent, even if its publish fails.

Modify: config/tourlinq.properties — append (these are application-level settings loaded by AppConfig, not Jetty / API-server settings, so they belong in tourlinq.properties rather than tlinqapi.properties):

# Platform registry DB — holds tenant registry and WhatsApp phone routing
platform.db.url=jdbc:postgresql://localhost:5432/tqplatform
platform.db.user=tqpro_platform
platform.db.pass=

Verify: New unit test tqcommon/src/test/java/com/perun/tlinq/tenant/TenantRegistryTest.java with an in-memory H2 platform DB seeded with one tenant — assert getById and getByRealm return it; refresh() picks up a new INSERT.

P0.4 — Tenant context in RequestContext

Modify: tqcommon/src/main/java/com/perun/tlinq/util/RequestContext.java - Add private final String tenantId; field - Update constructor: RequestContext(userId, userName, userEmail, correlationId, tenantId) - Add getter getTenantId() - Update the current()/set() thread-local pattern to carry tenantId

Modify all call sites that construct RequestContext. These are limited to: - tqapi/src/main/java/com/perun/tlinq/AuthenticationFilter.java - Any test helpers under tqapi/src/test/ and tqapp/src/test/

Backward compatibility: Keep a deprecated 4-arg constructor that defaults tenantId to the seed-tenant ID — used by existing tests until they are updated. Javadoc the deprecated overload with @deprecated Will be removed in Phase 1 once all call sites migrate; track remaining call sites in the PR description and remove the overload as the final Phase 1 cleanup task. (Grep shows only a handful of call sites today — mostly AuthenticationFilter and test helpers.)

Verify: ./gradlew build compiles; existing tests pass.

P0.5 — Multi-realm JWT validation

Current state (context, before edits): OIDCConfig has fields issuer, clientId, jwksUri, endSessionEndpoint, rolesClaim, authMode. tlinqapi.properties has oidc-issuer=https://dev-auth.vanevski.net/realms/tqpro-adm, oidc-client-id=tqweb-adm, etc. AuthenticationFilter hardcodes the realm tqpro-adm at one location (logout URL composition).

Modify: tqapi/src/main/java/com/perun/tlinq/oidc/OIDCConfig.java - Introduce (net-new field) String keycloakBaseUrl — loaded from a new property oidc-keycloak-base-url in tlinqapi.properties (e.g. https://dev-auth.vanevski.net). - Deprecate and remove the single-realm issuer field. Migration: on startup, if oidc-keycloak-base-url is unset, derive it from oidc-issuer by stripping /realms/<realm> — log a WARN urging config migration. - clientId stays (same across all realms — tqweb-adm by design). - jwksUri, endSessionEndpoint become per-realm derived at use time: ${keycloakBaseUrl}/realms/${realm}/protocol/openid-connect/certs and .../logout. - The per-realm issuer is derived: ${keycloakBaseUrl}/realms/${realm}

Modify: config/tlinqapi.properties

# New — base Keycloak URL (no /realms/... suffix). Supersedes oidc-issuer for multi-realm.
oidc-keycloak-base-url=https://dev-auth.vanevski.net
# Dev mode tenant default (seed tenant UUID). Only consulted when dev-mode=true.
dev-tenant-id=00000000-0000-0000-0000-000000000001
Leave oidc-issuer in place as a compatibility fallback for one release; document removal in a follow-up.

Modify: tqapi/src/main/java/com/perun/tlinq/oidc/JWKSManager.java - Internal map ConcurrentHashMap<String, JWKSource<SecurityContext>> keyed by realm name - getJWKSource(realmName) — lazy, thread-safe creation

Modify: tqapi/src/main/java/com/perun/tlinq/oidc/JWTValidator.java - validateToken(token): 1. Parse JWT (unverified) to extract iss claim 2. Extract realm from issuer: strip ${keycloakBaseUrl}/realms/ prefix — reject if doesn't match base URL 3. Master-realm special case: if realm.equals("master"), validate against the master realm's JWKS and skip TenantRegistry.getByRealm. Return ValidatedToken with tenantId=null, roles from realm_access.roles (the caller is a platform admin). Downstream AuthenticationFilter only allows null-tenant requests on /platform/* endpoints (P1.4 authorization layer). 4. Otherwise: TenantRegistry.getByRealm(realm) — throw TokenValidationException("Unknown realm") if null 5. Get or create ConfigurableJWTProcessor for that realm (internal map similar to JWKSManager) 6. Verify signature and claims via the processor 7. Extract userId (sub), email, name, roles (realm_access.roles) as today 8. Return ValidatedToken with tenantId = tenant.getTenantId() (or null for master realm per step 3)

Modify: tqapi/src/main/java/com/perun/tlinq/oidc/ValidatedToken.java — add tenantId field and getter.

Modify: tqapi/src/main/java/com/perun/tlinq/AuthenticationFilter.java - Extract tenantId from: 1. OIDC path — ValidatedToken.getTenantId() 2. oauth2-proxy header path — read X-Tenant-ID header; reject if missing on authenticated (non-guest) requests 3. Internal API key path (the WhatsApp shared-secret) — read X-Tenant-ID header; reject if missing (internal callers MUST identify the tenant) 4. Dev mode — read dev-tenant-id from tlinqapi.properties (defaults to seed tenant ID for local dev) - Construct RequestContext with tenantId - On any path, if tenantId cannot be resolved for an authenticated request → HTTP 403 with body {"error":"tenant-unresolved"} - Fix hardcoded logout-URL realm (line ~255): The current code composes the Keycloak logout URL using a hardcoded /realms/tqpro-adm/... path. Replace with:

String realm = TenantRegistry.requireById(ctx.getTenantId()).getKcRealm();
String logoutUrl = oidcConfig.getKeycloakBaseUrl() + "/realms/" + realm
                 + "/protocol/openid-connect/logout";
Every non-seed tenant currently gets a 404 on logout until this is fixed.

Verify: - Unit test: valid JWT with known realm → ValidatedToken.tenantId is set correctly - Unit test: valid JWT with unknown realm → TokenValidationException - Integration test: GET /tlinq-api/health with valid JWT returns 200; the RequestContext in a test-only echo endpoint reports the right tenantId

P0.6 — Tenant-aware DB sessions (the critical change)

Scope — all SIX DB-session classes must be converted (single static SessionFactory today, one per class — verified via find . -name '*DBSession.java'):

Class Module Path Responsibility
TlinqDBSession.java tqcommon tqcommon/src/main/java/com/perun/tlinq/util/ Shared base-schema session
NTSDBSession.java tqapp tqapp/src/main/java/com/perun/tlinq/client/nts/db/ Primary nts.* entities
AmdDBSession.java tqamds tqamds/src/main/java/com/perun/tlinq/client/amadeus/db/ Amadeus cache/lookup
RaynaDBSession.java tqryb2b tqryb2b/src/main/java/com/perun/tlinq/client/ryb2b/util/ Rayna B2B cache/lookup
GoGlobalDBSession.java tqgglbl tqgglbl/src/main/java/com/perun/tlinq/client/goglobal/util/ GoGlobal cache/lookup
TiqetsDBSession.java tqtiqets tqtiqets/src/main/java/com/perun/tlinq/client/tiqets/db/ Tiqets cache/lookup

Every one of them holds a private static SessionFactory _factory; static initializer today and exposes a public static getSession(). Each must be converted to the same per-tenant pattern described below. Omitting any one of them creates a silent cross-tenant leak on the paths that touch that plugin.

Extract common logic: Create tqcommon/src/main/java/com/perun/tlinq/tenant/TenantAwareDBSession.java — an abstract base that encapsulates the factory map, lazy build, evict, and null-tenant guard. Each of the 6 classes becomes a thin subclass that supplies its annotated-class list and pool name, AND self-registers with TenantSessionRegistry (see below).

public abstract class TenantAwareDBSession {
    private final ConcurrentHashMap<String, SessionFactory> factories = new ConcurrentHashMap<>();

    protected abstract List<Class<?>> annotatedClasses();
    protected abstract String poolPrefix();  // "NTS", "AMD", "RYB2B", "GGLBL", "TLINQ"

    public final Session openSession() {
        String tenantId = RequestContext.current().getTenantId();
        if (tenantId == null) {
            throw new IllegalStateException(
                "No tenant in RequestContext — wrap background tasks in TenantScope.run");
        }
        SessionFactory sf = factories.computeIfAbsent(tenantId, this::buildFactory);
        return sf.openSession();
    }

    private SessionFactory buildFactory(String tenantId) {
        TenantInfo t = TenantRegistry.requireById(tenantId);
        Configuration cfg = new Configuration();     // fresh — never Configuration.copy()
        annotatedClasses().forEach(cfg::addAnnotatedClass);
        cfg.setProperty("hibernate.connection.url", jdbcUrlFor(t));
        cfg.setProperty("hibernate.connection.username", t.getDbUser());
        cfg.setProperty("hibernate.connection.password", TenantConfig.decrypt(t.getDbPass()));
        cfg.setProperty("hibernate.hikari.minimumIdle", "2");
        cfg.setProperty("hibernate.hikari.maximumPoolSize", "5");
        cfg.setProperty("hibernate.hikari.poolName", poolPrefix() + "-" + t.getDbName());
        // + other non-URL base properties from the old static initializer
        return cfg.buildSessionFactory();
    }

    public final void evictFactory(String tenantId) {
        SessionFactory sf = factories.remove(tenantId);
        if (sf != null) sf.close();
    }

    public final void evictAll() {
        factories.values().forEach(SessionFactory::close);
        factories.clear();
    }
}

Each subclass (e.g. NTSDBSession extends TenantAwareDBSession) defines annotatedClasses() — keeping the long class list that exists today — and poolPrefix(). The public static getSession() method stays the same signature for call-site compatibility; it now delegates to an instance.

Do not use Configuration.copy() — that method does not exist. The buildFactory above constructs a new Configuration each time.

P0.6.a — Central eviction registry

Create: tqcommon/src/main/java/com/perun/tlinq/tenant/TenantSessionRegistry.java — a process-wide registry of TenantAwareDBSession instances. Each subclass self-registers in its constructor (or static initializer).

public final class TenantSessionRegistry {
    private static final List<TenantAwareDBSession> sessions = new CopyOnWriteArrayList<>();

    public static void register(TenantAwareDBSession s) { sessions.add(s); }

    /** Evict a single tenant from every registered session factory. Used on deprovision. */
    public static void evictAll(String tenantId) {
        sessions.forEach(s -> s.evictFactory(tenantId));
    }

    /** Optional health check support — iterate all (tenantId, session) combos with a "SELECT 1". */
    public static List<TenantAwareDBSession> listSessions() { return List.copyOf(sessions); }
}

The abstract TenantAwareDBSession constructor calls TenantSessionRegistry.register(this). Every subclass instance (singleton per class) is registered exactly once at load time. Phase 8's deprovisioning calls TenantSessionRegistry.evictAll(tenantId) to fan out across all 6 DB sessions without knowing them by name. This also means plugins added in the future automatically participate.

P0.6.b — Minimal TenantConfig.decrypt() stub (forward-compat for P3)

The buildFactory example above calls TenantConfig.decrypt(t.getDbPass()). TenantConfig is formally introduced in Phase 3, but a minimal stub ships in Phase 0 to avoid a forward reference:

Create (Phase 0): tqcommon/src/main/java/com/perun/tlinq/tenant/TenantConfig.java — initial version contains only:

public final class TenantConfig {
    /** AES-256-GCM decrypt for credentials stored in tqplatform.tenant.
     *  Phase 0 scope: just decrypt(). Phase 3 (P3.1) expands with per-tenant property overrides. */
    public static String decrypt(String cipher) {
        if (cipher == null || !cipher.startsWith("encrypted:")) return cipher;
        byte[] key = loadKeyFromEnv();   // TQPRO_ENCRYPTION_KEY
        // base64-decode, split nonce|ct, AES/GCM/NoPadding
        ...
    }
    private static byte[] loadKeyFromEnv() { ... }
}

Phase 3 (P3.1) will expand this class with static Properties platformDefaults, static ConcurrentHashMap<String, Properties> tenantOverrides, and the get(key) / reload(tenantId) methods. Do not delete or replace the decrypt() method — Phase 3 only adds to it.

Create: tqcommon/src/main/java/com/perun/tlinq/tenant/TenantScope.java

// Wraps a Runnable with a RequestContext for a given tenant — for background jobs,
// Hazelcast listeners, scheduled maintenance tasks that don't run under a JAX-RS filter.
public class TenantScope {
    public static void run(String tenantId, Runnable r) { ... }
    public static <T> T call(String tenantId, Callable<T> c) throws Exception { ... }
}

Verify: - Compile: ./gradlew build - Unit test: getSession() with no RequestContext throws IllegalStateException - Unit test: TenantScope.run("00000000-...-001", () -> { ... NTSDBSession.getSession() ... }) works - Integration test: start TQPro with seed tenant, hit an entity endpoint — existing single-tenant behavior is preserved

P0.7 — Frontend tenant resolution via subdomain (D-8)

Current state: AuthApi already has POST /auth/config. Today it returns {enabled, authority, clientId} (or {enabled:false} if OIDC is disabled). tqweb-adm/js/modules/oidc-config.js already fetches from this endpoint — no hardcoded issuer in the frontend. This task therefore only modifies existing code paths; it does not introduce new ones.

Modify: tqapi/src/main/java/com/perun/tlinq/api/AuthApi.java - POST /auth/config — now reads the Host header to derive the subdomain (acme.tourlinq.com → tenantCode acme). - Resolve via TenantRegistry.getByCode(tenantCode)TenantInfo. - New response shape (additive — existing fields stay):

{
  "enabled": true,
  "authority": "${keycloakBaseUrl}/realms/${tenant.kcRealm}",
  "clientId": "tqweb-adm",
  "tenantName": "Acme Travel LLC",
  "tenantCode": "acme-travel"
}
- If tenant not found: return 404 with {"error":"unknown-tenant","tenantCode":"acme"}. - Fallback: if Host lacks a subdomain (e.g. bare tourlinq.com or localhost), serve the seed tenant for backward-compat — gated on dev-mode=true; in production mode, 404.

Modify frontend: - tqweb-adm/js/modules/oidc-config.js — already calls /tlinq-api/auth/config; just start using the new tenantName and tenantCode fields (stash in window.TQPro.tenant for Phase 6 branding). - tqweb-adm/callback.html — no change required; it reuses oidc-config.js.

Nginx: See dedicated section P0.9 below — gateway terminates TLS and proxies directly to Jetty (D-14), certbot is per-tenant subdomain.

Local dev: Add tqpro-adm.tourlinq.local and one acme.tourlinq.local to /etc/hosts for local testing. Documented in doc/operations/multitenancy-setup.md.

Verify: - curl -H "Host: tqpro-adm.tourlinq.com" http://localhost:11080/tlinq-api/auth/config returns the right issuer URL - curl -H "Host: unknown.tourlinq.com" ... returns 404 with unknown-tenant - Browser navigation to https://acme.tourlinq.com (once a second tenant exists in P1) routes to that tenant's realm

P0.8 — Operations documentation

Create: doc/operations/multitenancy-setup.md — the consolidated install runbook for ops. Sections must include:

Keycloak setup (one-time, ops-performed)

This runbook captures the manual Keycloak configuration that the Prerequisites section (§5) tracks as completed. Future installs (staging clones, on-prem deployments) re-execute it. Document each part with the exact admin-console click path and verification curls. Sections to cover:

  • Part A — Master realm: platform service account tqpro-platform-admin
  • Create confidential client; capability config (service accounts only, no Standard/Direct flow)
  • Assign create-realm REALM role (NOT a client role on master-realm); explicitly call out that the UI filter must be set to "Filter by realm roles"
  • Capture client secret to /etc/tqpro/tqpro.env as TQPRO_PLATFORM_ADMIN_SECRET (mode 0600)
  • Document that default-roles-master will appear auto-assigned alongside create-realm — harmless, no admin scope
  • Explicitly forbid: assigning admin realm role, or any role on master-realm client
  • Part B — Seed realm tqpro-adm: align to the two-client pattern
  • Add missing realm roles manager, finance (joining existing guest, agent, admin)
  • Create tqpro-admin-api confidential client (service accounts on)
  • Assign manage-users + view-users on realm-management (NOT manage-realm)
  • Capture client secret for later encrypted insertion into tqplatform.tenant.kc_admin_client_secret for the seed row
  • Confirm existing tqweb-adm client unchanged
  • Part C — Verification curls (POST /admin/realms succeeds, GET /admin/realms/master/users returns 403, DELETE on the test realm needs a re-fetched token — that re-fetch behavior IS the auto-grant signal that proves the model works for Phase 8 deprovisioning)
  • Part D — Secret recording matrix (which secret goes where, when, with what encryption)
  • Part E — Pre-Phase-0 checklist (the one in §5)

Other ops setup sections

  • Platform DB bootstrap (P0.1 script execution, initial seed tenant row); platform-DB migration directory layout (config/db-changes/platform/); seed-tenant schema_migrations ledger bootstrap (P1.2.b)
  • PostgreSQL max_connections tuning (P0.2)
  • TQPRO_ENCRYPTION_KEY generation and /etc/tqpro/tqpro.env placement (mode 0600)
  • nginx bootstrap (P0.9): install certbot, create /var/www/certbot, convert seed tenant from the old single-server config to a rendered vhost, issue its cert (dry-run → real)
  • Per-tenant DNS prerequisite checklist (A record + propagation verification before provisioning)
  • Local-dev /etc/hosts entries for subdomain routing
  • Smoke test: start TQPro with seed tenant, verify existing behavior

Cross-reference: Phase 9 (P9.3) finalizes this same file; treat P0.8's version as the working draft that ops uses during Phase 0 deployment, and P9.3's version as the canonical post-rollout reference.

P0.9 — Reverse Proxy Configuration (Gateway-Only, D-14)

Goal: The gateway nginx is the sole TLS terminator and proxies traffic directly to Jetty (which serves both the /tlinq-api/ endpoints and the static tqweb-adm content via its content-location property). The internal nginx-websrv hop is dropped for new tenant vhosts. Every tenant provisioning event writes a new nginx vhost, obtains a Let's Encrypt certificate via HTTP-01, and reloads nginx.

Current nginx layout (baseline)

File Role Today Phase 0 change
config/Nginx Config/nginx-gw.conf Public-facing gateway — SSL termination, reverse proxy. server_name webdev.vanevski.net, single tenant. Single tenant only. Retired — replaced by per-tenant rendered vhosts from the template below.
config/Nginx Config/nginx-websrv.conf Backend nginx that serves tqweb-adm static files on 443. Single server_name; internal hop. Not modified for Phase 0. Existing single-tenant installs that still rely on it keep working. All NEW tenant vhosts bypass it and target Jetty directly. A follow-up retirement PR may remove it entirely once no tenant references it.
config/Nginx Config/wa-service.conf Reverse proxy template for the Python WhatsApp service (terminates TLS for wa.perunapps.com). Uses <SERVER_NAME> / <WA_SERVICE_PRIVATE_IP> placeholders rendered by ops. Template already, single service. Unchanged for Phase 0. Revisit in Phase 5 if per-tenant WhatsApp subdomains are introduced.

Changes to nginx-gw.conf — template-driven per-tenant vhost

Convert the current single-server block into a template rendered per tenant. The template lives at config/Nginx Config/templates/tenant-gw.conf.template; the provisioner writes the rendered file to /etc/nginx/sites-available/<tenant-code>.conf and symlinks into sites-enabled/.

Template: config/Nginx Config/templates/tenant-gw.conf.template

# Tenant: {{TENANT_CODE}} — managed by TQPro tenant provisioner. Do not edit by hand.

upstream tqpro_jetty_{{TENANT_CODE_UNDERSCORED}} {
    server 127.0.0.1:{{JETTY_PORT}};    # Jetty serves both /tlinq-api/ and static /
    keepalive 32;
}

server {
    listen 443 ssl http2;
    server_name {{TENANT_HOST}};        # e.g. acme-travel.tourlinq.com

    access_log /var/log/nginx/{{TENANT_CODE}}-gw.access.log;
    error_log  /var/log/nginx/{{TENANT_CODE}}-gw.error.log;

    ssl_certificate     /etc/letsencrypt/live/{{TENANT_HOST}}/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/{{TENANT_HOST}}/privkey.pem;
    ssl_protocols       TLSv1.2 TLSv1.3;
    ssl_ciphers         HIGH:!aNULL:!MD5;
    ssl_session_timeout 30m;
    ssl_prefer_server_ciphers on;

    # Header forwarding — CRITICAL: AuthApi reads Host to resolve tenant from subdomain
    proxy_set_header Host              $host;
    proxy_set_header X-Forwarded-Host  $http_host;
    proxy_set_header X-Forwarded-For   $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header X-Real-IP         $remote_addr;

    proxy_read_timeout 720s;
    proxy_connect_timeout 10s;
    proxy_send_timeout 720s;

    # Everything → Jetty (both /tlinq-api/ and static tqweb-adm are served by the same process)
    location / {
        proxy_pass http://tqpro_jetty_{{TENANT_CODE_UNDERSCORED}};
        proxy_http_version 1.1;
    }
}

# HTTP → HTTPS redirect + certbot HTTP-01 challenge path
server {
    listen 80;
    server_name {{TENANT_HOST}};

    location /.well-known/acme-challenge/ {
        root /var/www/certbot;
    }

    location / {
        return 301 https://$host$request_uri;
    }
}

Placeholders resolved at render time: - {{TENANT_CODE}} → tenant code from tqplatform.tenant.tenant_code (e.g. acme-travel) - {{TENANT_CODE_UNDERSCORED}} → same with hyphens replaced by underscores (nginx upstream name constraint, e.g. acme_travel) - {{TENANT_HOST}} → full hostname (e.g. acme-travel.tourlinq.com) - {{JETTY_PORT}} → Jetty HTTP port from tlinqapi.properties http-port (currently 11080). A single platform-wide Jetty instance serves every tenant — tenant resolution happens inside the JVM via Host header + TenantRegistry.

Platform host (tourlinq.com itself) is configured via platform.domain in tourlinq.properties (application-level config, not Jetty); the provisioner computes {{TENANT_HOST}} = {{TENANT_CODE}}.{platform.domain}.

New tourlinq.properties keys to add in P0.9:

platform.domain=tourlinq.com
# Tenant DB topology (all tenant DBs share the same PG host per D-2).
tenant.db.host=localhost
tenant.db.port=5432

nginx-websrv.conf — not modified for Phase 0

The legacy nginx-websrv path remains operational for any environment still using it. New per-tenant vhosts do not target it; they point straight at Jetty per D-14. Retirement is out of scope for this plan — ops can decommission it after migrating all existing tenants to the new template.

wa-service.conf — unchanged for Phase 0

The WhatsApp webhook URL stays at a single host (e.g. wa.perunapps.com). Tenant routing happens at the application layer via phone_number_id lookup in tqplatform.wa_phone_routing (D-9, Phase 5). No nginx changes needed here. The existing template placeholder pattern (<SERVER_NAME>, <WA_SERVICE_PRIVATE_IP>) stays — ops continues to render it manually per deployment.

DNS

Assumption: DNS is managed outside of TQPro (ops creates the A record manually at tenant onboarding time, OR an external automation hooked into the registrar API handles it). The provisioner does not write DNS records.

Operational ordering: 1. Ops creates A record <tenant-code>.tourlinq.com → gateway public IP 2. Ops waits for DNS propagation (~5 min, verify with dig +short <tenant-host>) 3. Tenant provisioning script (P1.1) runs — relies on DNS being live for certbot HTTP-01 to succeed

Document this in doc/operations/multitenancy-setup.md as a prerequisite checklist item per tenant.

Certbot workflow

Prerequisite: certbot is installed on the gateway host. The http-01 challenge directory /var/www/certbot exists (referenced in the template).

Per-tenant certificate issuance (called from the provisioning flow — see P1.3 update below):

# Run by the tenant provisioner after the vhost file is written but BEFORE nginx reload
certbot certonly \
  --webroot \
  --webroot-path /var/www/certbot \
  --non-interactive --agree-tos \
  --email ops@tourlinq.com \
  --domain <tenant-host> \
  --deploy-hook "systemctl reload nginx"

Renewal: certbot's default systemd timer (certbot.timer) handles renewal for all issued certs. Verify with certbot renew --dry-run after the first real issuance.

Revocation on deprovisioning: when a tenant is deprovisioned (Phase 8), the vhost file is removed and certbot revoke --cert-name <tenant-host> is called. The Keycloak realm is disabled per D-10 but the certificate is revoked so the subdomain stops serving HTTPS.

Provisioning orchestration — what runs where

Extend the tenant-provision.sh script (P1.1) with two new steps between "DB created" and "tenant row inserted":

# After DB clone succeeds, before calling the Java provisioning API:

# 1. Render nginx vhost from template
TENANT_HOST="${CODE}.$(grep '^platform.domain=' /etc/tqpro/tourlinq.properties | cut -d= -f2)"
TENANT_UNDERSCORED="${CODE//-/_}"
JETTY_PORT=$(grep '^http-port=' /etc/tqpro/tlinqapi.properties | cut -d= -f2)

sed -e "s/{{TENANT_CODE}}/${CODE}/g" \
    -e "s/{{TENANT_CODE_UNDERSCORED}}/${TENANT_UNDERSCORED}/g" \
    -e "s/{{TENANT_HOST}}/${TENANT_HOST}/g" \
    -e "s/{{JETTY_PORT}}/${JETTY_PORT}/g" \
    /etc/tqpro/nginx-templates/tenant-gw.conf.template \
  > /etc/nginx/sites-available/${CODE}.conf

ln -sf /etc/nginx/sites-available/${CODE}.conf /etc/nginx/sites-enabled/${CODE}.conf
nginx -t                     # verify config before any reload

# 2. Reload nginx BEFORE certbot — certbot needs the HTTP-01 location block active
systemctl reload nginx

# 3. Obtain the cert (webroot method as above)
certbot certonly --webroot --webroot-path /var/www/certbot \
    --non-interactive --agree-tos --email ops@tourlinq.com \
    --domain "${TENANT_HOST}" \
    --deploy-hook "systemctl reload nginx"

# 4. THEN call the Java provisioning API (Keycloak realm + tenant row)
#    If the Java step fails, run the rollback script which deletes vhost + revokes cert

Rollback on any failure: scripts/tenant-rollback.sh <tenant-code> 1. certbot revoke --cert-name <tenant-host> --non-interactive 2. certbot delete --cert-name <tenant-host> --non-interactive 3. rm /etc/nginx/sites-enabled/<tenant-code>.conf /etc/nginx/sites-available/<tenant-code>.conf 4. systemctl reload nginx 5. dropdb tlinq_<tenant-code> (if DB was created) 6. DELETE FROM tqplatform.tenant WHERE tenant_code = ... (if row was inserted) 7. KeycloakRealmProvisioner.deleteRealm(tenant_code) (if realm was created)

The Java TenantProvisioningFacade (P1.3) is responsible for invoking this rollback script on any failure via its existing failure handler.

CORS origins

CORS origins: config/tlinqapi.properties already has cors-allowed-origins=*.tourlinq.com,...,localhost. No change needed — the platform wildcard is already present. If any environment's copy is missing it, align to the repo version.

Local development

Developers need <tenant-code>.tourlinq.local to resolve to 127.0.0.1 for browser-side OIDC flows. Two options — support both in the docs:

  1. /etc/hosts entries — simplest; add one line per tenant being tested:
    127.0.0.1 tqpro-adm.tourlinq.local acme-travel.tourlinq.local
    
  2. dnsmasq wildcard — configure address=/tourlinq.local/127.0.0.1 in dnsmasq for developers who regularly work with multi-tenant.

For local dev TLS is optional — use plain HTTP on the gateway port. Document in doc/operations/multitenancy-setup.md.

P0.9 acceptance

  • [ ] config/Nginx Config/templates/tenant-gw.conf.template exists and renders correctly for the seed tenant via sed substitution
  • [ ] Seed tenant vhost (/etc/nginx/sites-available/tqpro-adm.conf) is generated from the template — replaces the legacy single-server nginx-gw.conf without any behavior change end-users can see. The vhost proxies directly to Jetty on {{JETTY_PORT}} (D-14)
  • [ ] nginx -t passes after each vhost write
  • [ ] certbot certonly succeeds for the seed tenant host (dry-run first via --dry-run)
  • [ ] Certbot systemd timer is active: systemctl is-active certbot.timeractive
  • [ ] Jetty serves static tqweb-adm content via content-location — hitting https://<seed-tenant-host>/ returns the admin UI index (no separate nginx-websrv hop)
  • [ ] CORS origin list already contains the platform wildcard (verified, no change)
  • [ ] Local dev works via /etc/hosts on at least one developer machine

P0.10 — Phase 0 acceptance gate

Repo gate — must pass before the PR merges. Claude Code verifies on a laptop. - [ ] ./gradlew build passes on a clean checkout - [ ] All existing tests pass unchanged - [ ] New TenantRegistryTest passes (incl. Hazelcast topic-listener fan-out — D-16) - [ ] Unit test: every one of the 6 DB-session classes (NTSDBSession, AmdDBSession, RaynaDBSession, GoGlobalDBSession, TlinqDBSession, TiqetsDBSession) getSession() fails loud when no tenant is in RequestContext, and works correctly when wrapped in TenantScope.run(...). Each must self-register with TenantSessionRegistry. - [ ] Unit test: AuthenticationFilter logout URL composes the realm dynamically from TenantInfo.getKcRealm() (no tqpro-adm literal in the test output). - [ ] Unit test: /auth/config returns the right issuer + tenantName for a given Host header; returns 404 for unknown subdomain. - [ ] Unit test: master-realm token (issuer = ${keycloakBaseUrl}/realms/master) validates and grants platform-admin only for /platform/tenant/* paths. - [ ] nginx -t passes against the template rendered with the seed tenant values (use a local nginx install or nginx -t -c <rendered-file>). - [ ] shellcheck scripts/tenant-provision.sh scripts/tenant-rollback.sh scripts/apply-tenant-migrations.sh scripts/apply-platform-migrations.sh scripts/bootstrap-schema-migrations.sh passes. - [ ] doc/operations/multitenancy-setup.md exists and documents every manual ops step including DNS + certbot prerequisites for new tenants, the platform-DB migration directory layout, and the seed-tenant schema_migrations bootstrap.

Deployment gate — ops-owned, performed in staging first, then production. The PR has merged; ops executes per the runbook. - [ ] Platform DB tqplatform created via scripts/apply-platform-migrations.sh (applies config/db-changes/platform/0001-tqplatform-schema.sql); seed tenant row inserted; schema_migrations ledger in the platform DB shows 0001-tqplatform-schema.sql applied - [ ] Seed tenant DB (tlinq) has public.schema_migrations ledger populated by scripts/bootstrap-schema-migrations.sh (all 72 existing migration filenames recorded as applied) - [ ] max_connections=500 applied; PostgreSQL restarted cleanly - [ ] tqpro-platform-admin service account configured in master realm with create-realm scope only - [ ] TQPRO_ENCRYPTION_KEY stored in /etc/tqpro/tqpro.env (mode 0600, correct owner); offline backup made - [ ] Seed tenant's vhost file rendered and symlinked; legacy single-server nginx-gw.conf retired - [ ] Seed tenant's TLS cert issued via certbot (--dry-run first, then real); certbot.timer is active - [ ] Smoke test (ops/QA): log in via existing tqweb-adm flow, create a booking — behavior indistinguishable from pre-change - [ ] journalctl -u nginx and journalctl -u postgresql show clean startup, no errors


7. Phase 1 — Onboarding Automation (Manual-Triggered Only — D-11)

Goal: Provision a second tenant via a platform admin CLI. Prove end-to-end data isolation. No public signup.

Effort: 1–2 weeks. Risk: Medium. Depends on: Phase 0 acceptance gate.

P1.1 — Database cloning + nginx/certbot orchestration script

Prerequisite for each run: DNS A record <tenant-code>.<platform.domain> → gateway public IP already exists and has propagated (verify with dig +short). See doc/operations/multitenancy-setup.md.

P1.1.a — Template DB (supersedes clone-and-truncate)

Architecture change (2026-04-14, mid-Phase-1): rather than clone the seed tenant's tlinq DB and truncate ~86 customer-data tables, every new tenant is cloned from a dedicated template DB (tqpro_template by default; set by template.db.name in tourlinq.properties). The template holds the tenant schema with NO customer data and is kept in schema sync with every migration automatically.

Reasons: - No 86-table YAML to maintain — a new migration that adds a tenant-specific table doesn't require updating any truncation list. - Faster: pg_dump -Fc of the template is kilobytes, not gigabytes. - Zero risk of customer-data leakage — there is no customer data to accidentally skip. - Curated starter kit: terms templates, action templates, voucher sections, trip snippets can be pre-populated in the template and every new tenant inherits them. - apply-tenant-migrations.sh processes the template automatically alongside every ACTIVE tenant, so it never falls behind.

Create: scripts/bootstrap-template-db.sh — one-time setup that: 1. createdb tqpro_template (no-op if already exists). 2. Drop all non-system schemas (idempotent) and load a pg_dump --schema-only file passed via the required --schema-file <path> argument. Ops produces the dump by themselves: pg_dump --schema-only --no-owner --no-privileges -d <source-db> > /tmp/schema.sql. The source DB must have all current migrations applied — the template inherits whatever the source had. Re-running migrations from scratch against an empty DB is NOT a safe alternative: early migrations (roughly 0001..0030) were reverse-engineered from production and may not faithfully reproduce real schema. A verified dump is the only trusted source of truth. 3. TRUNCATE RESTART IDENTITY CASCADE every user table in the template — defence in depth against any incidental rows in the dump. 4. Create public.schema_migrations if missing; seed it with every committed migration filename so the runner sees the template as up-to-date.

Run once per environment. Re-running is safe — the script refreshes schema and re-seeds the ledger from whatever the supplied dump contains.

P1.1.b — Provisioning orchestration

Create: scripts/tenant-provision.sh

#!/usr/bin/env bash
# Usage: tenant-provision.sh <tenant_code> <tenant_name> <admin_email> [<wa_phone_id>]
# (Full file under scripts/tenant-provision.sh — summarised here.)
#
# 1. DNS pre-flight: dig +short "${TENANT_HOST}"
# 2. Clone template DB → new tenant DB (pg_dump -Fc ${TEMPLATE_DB} | pg_restore -d ${DB_NAME}).
#    Reads template.db.name and platform.domain from ${TLINQ_HOME}/tourlinq.properties.
#    Reads http-port from tlinqapi.properties.
# 3. Render nginx vhost via scripts/render-tenant-vhost.sh; symlink into sites-enabled.
# 4. nginx -t; systemctl reload nginx (BEFORE certbot so HTTP-01 location is active).
# 5. certbot certonly --webroot --webroot-path /var/www/certbot --domain <host>.
# 6. POST /tlinq-api/platform/tenant/provision — Java creates the Keycloak realm and
#    inserts tqplatform.tenant (+ wa_phone_routing if a phone-id was supplied).
# 7. trap ERR → scripts/tenant-rollback.sh <code> if any step after step 2 fails.
#
# Tenant code validated against ^[a-z0-9][a-z0-9-]{1,48}[a-z0-9]$ — same chars
# PostgreSQL DB names and nginx upstream names accept.

Create: scripts/tenant-rollback.sh — step-wise undo. Safe to re-run; every step tolerates missing prior state. 1. certbot revoke + certbot delete for the tenant host. 2. Remove nginx vhost symlink and file; nginx -t + reload. 3. dropdb the tenant DB if it exists. 4. DELETE FROM tqplatform.wa_phone_routing + DELETE FROM tqplatform.tenant by code. 5. DELETE /admin/realms/<code> via the platform service account (tolerates 401 — caller needs a fresh token after any recent realm creation). 6. Does NOT touch DNS records.

Note: clean_tenant_clone.py and scripts/tenant-clone-reset.yaml mentioned in earlier drafts are superseded by the template-DB approach and are NOT shipped.

P1.2 — Schema migration management (D-15, per-DB ledger)

Design: Each tenant DB owns its own migration state via a public.schema_migrations ledger table. The platform DB does NOT track per-tenant migration versions — that coupling would break on DB restore/clone. Flyway is the planned long-term successor (out of scope for this plan).

P1.2.a — Ledger schema (applied to every tenant DB)

Create: config/db-changes/0072-schema-migrations-ledger.sql

-- Migration ledger — tracks which files have been applied to this DB.
CREATE TABLE IF NOT EXISTS public.schema_migrations (
    filename     VARCHAR(255) PRIMARY KEY,
    applied_at   TIMESTAMPTZ DEFAULT NOW(),
    checksum     VARCHAR(64)
);

This is the last migration applied by the old (file-order) deployment process and the first the new ledger-aware runner recognizes.

P1.2.b — Seed-tenant bootstrap

Create: scripts/bootstrap-schema-migrations.sh — runs against the seed tenant DB (tlinq) after 0072-schema-migrations-ledger.sql is applied, marking every pre-existing migration as already-applied so the ledger-aware runner skips them.

#!/usr/bin/env bash
# Usage: bootstrap-schema-migrations.sh [DB_NAME]
# Runs ONCE against the seed tenant DB. Idempotent — safe to re-run.
set -euo pipefail
DB="${1:-tlinq}"
MIGRATIONS_DIR="$(dirname "$0")/../config/db-changes"

# Walk all existing migration files in committed order; insert each filename.
while IFS= read -r SQL; do
  FNAME=$(basename "${SQL}")
  CHECKSUM=$(sha256sum "${SQL}" | cut -d' ' -f1)
  psql "${DB}" -c "INSERT INTO public.schema_migrations (filename, applied_at, checksum) \
                   VALUES ('${FNAME}', NOW(), '${CHECKSUM}') \
                   ON CONFLICT (filename) DO NOTHING"
done < <(ls "${MIGRATIONS_DIR}"/*.sql | sort)

echo "Bootstrap complete. Applied rows: $(psql -At "${DB}" -c 'SELECT count(*) FROM public.schema_migrations')"

The script enumerates filenames at runtime from config/db-changes/*.sql (committed in git), so the list stays in sync automatically. After the first bootstrap against the seed DB, scripts/apply-tenant-migrations.sh will see zero pending migrations.

P1.2.c — Runner

Create: scripts/apply-tenant-migrations.sh

#!/usr/bin/env bash
# Iterates every ACTIVE tenant DB, applies any config/db-changes/*.sql not yet in its ledger.
set -euo pipefail

TENANT_ROWS=$(psql -At tqplatform -c \
  "SELECT db_name FROM tenant WHERE status='ACTIVE' ORDER BY created_at")

for DB in $TENANT_ROWS; do
  echo "=== tenant DB: ${DB} ==="
  APPLIED=$(psql -At "${DB}" -c "SELECT filename FROM public.schema_migrations" || echo "")
  for SQL in config/db-changes/*.sql; do
    FNAME=$(basename "${SQL}")
    if grep -qx "${FNAME}" <<<"${APPLIED}"; then
      continue
    fi
    echo "  applying ${FNAME}"
    CHECKSUM=$(sha256sum "${SQL}" | cut -d' ' -f1)
    psql "${DB}" -v ON_ERROR_STOP=1 -f "${SQL}"
    psql "${DB}" -c "INSERT INTO public.schema_migrations (filename, checksum) \
                     VALUES ('${FNAME}', '${CHECKSUM}')"
  done
done

The runner is idempotent (safe to re-run). Transactional boundary is per-file: if a file fails mid-apply, the ledger row is NOT inserted; the next run retries.

P1.2.d — Platform DB runner

Create: scripts/apply-platform-migrations.sh — same logic but runs config/db-changes/platform/*.sql against the tqplatform DB using its own schema_migrations table (bootstrapped inside 0001-tqplatform-schema.sql, see P0.1).

P1.2.e — Tenant clone flow

Every new tenant DB is cloned from the seed via pg_dump | pg_restore; the clone inherits the seed's schema_migrations contents, so the runner sees it as up-to-date immediately. No special handling.

Verify: - bash scripts/bootstrap-schema-migrations.sh tlinq — exit 0; SELECT COUNT(*) FROM schema_migrations = 72. - bash scripts/apply-tenant-migrations.sh with no new migration files — exit 0, zero INSERTs. - Create a fake config/db-changes/9999-test.sql, re-run — exits 0, inserts exactly one ledger row, file contents applied.

P1.3 — Keycloak realm provisioner

Create: tqapp/src/main/java/com/perun/tlinq/entity/tenant/KeycloakRealmProvisioner.java

Uses java.net.http.HttpClient. Authenticates via the platform service account (tqpro-platform-admin in master realm — D-4). Exposes:

public class KeycloakRealmProvisioner {
    // Full provisioning flow — atomic with rollback
    ProvisioningResult provisionTenant(String tenantCode, String tenantName,
                                       String subdomainRootUrl, String adminEmail);
    // Rollback individual steps (called on failure)
    void deleteRealm(String realmName);
}

Provisioning steps (in order, with rollback on failure): 1. POST ${master}/auth/admin/realms — create realm from template JSON (name = tenantCode, displayName = tenantName, sslRequired=external, registrationAllowed=false) 1a. Re-fetch the platform service-account token before continuing. The realm-management roles for the new realm are auto-granted to the platform service account on step 1, but the in-memory access token from before the POST does not yet reflect them. All subsequent calls in this provisioning sequence (and any Phase 8 deprovisioning sequence) must run against a freshly-issued token. Without this refresh, step 2 onward will fail with HTTP 401/403 even though the role exists. 2. POST /admin/realms/${realm}/roles × 5 — create guest, agent, manager, finance, admin 3. POST /admin/realms/${realm}/clients — create tqweb-adm (public, rootUrl=${subdomainRootUrl}, redirectUris=[${subdomainRootUrl}/*]) 4. POST /admin/realms/${realm}/clients — create tqpro-admin-api (confidential, serviceAccountsEnabled=true) 5. Fetch the secret of tqpro-admin-api via GET /admin/realms/${realm}/clients/${id}/client-secret — capture it for step 7 6. Get the realm-management client ID in this new realm, then assign manage-users + view-users roles to the tqpro-admin-api service account (NOT manage-realm — narrowed per shortcoming review) 7. POST /admin/realms/${realm}/users — create initial tenant admin user (email = adminEmail, enabled = true, requiredActions: ["UPDATE_PASSWORD"]) 8. Assign admin realm role to the new user 9. PUT /admin/realms/${realm}/users/${id}/execute-actions-email with UPDATE_PASSWORD — sends welcome email

Returns ProvisioningResult { realmName, adminClientSecret, initialAdminUserId }.

Create: tqapp/src/main/java/com/perun/tlinq/entity/tenant/TenantProvisioningFacade.java - Orchestrates: insert into tqplatform.tenant (with kc_admin_client_secret encrypted using TQPRO_ENCRYPTION_KEY) → call KeycloakRealmProvisioner → call TenantRegistry.refresh() - On any failure, delete the realm, remove the tenant row, drop the DB (the DB was created by the shell script; the facade records what to clean up)

P1.4 — Platform admin API

Create: tqapi/src/main/java/com/perun/tlinq/api/PlatformAdminApi.java

POST /platform/tenant/provision    — called by scripts/tenant-provision.sh
POST /platform/tenant/list         — list all tenants (optional status filter)
POST /platform/tenant/read         — read one tenant (by id or code)
POST /platform/tenant/suspend      — flip status to SUSPENDED
POST /platform/tenant/activate     — flip status back to ACTIVE
POST /platform/tenant/deprovision  — see Phase 8
POST /platform/tenant/reactivate   — see Phase 8
POST /platform/tenant/refresh      — force TenantRegistry.refreshLocal() on THIS node
                                     and publish tenant-registry-refresh for peers.
                                     Escape hatch if Hazelcast propagation fails.
                                     Body: { tenantId?: string }  (omit for full reload)

Access control: Dedicated platform-admin role, separate from tenant-scoped admin. Only a user in the master realm (via a separate bootstrap flow) OR a platform-level service token may hold this role.

Modify config/api-roles.properties — append a new section:

# --- Platform admin endpoints (multi-tenancy) ---
# Only the platform-admin role (held in the master realm) can hit these.
platform/tenant/provision=platform-admin
platform/tenant/list=platform-admin
platform/tenant/read=platform-admin
platform/tenant/suspend=platform-admin
platform/tenant/activate=platform-admin
platform/tenant/deprovision=platform-admin
platform/tenant/reactivate=platform-admin
platform/tenant/refresh=platform-admin
The role platform-admin is not mappable from any tenant realm — a deliberate isolation choice (a tenant admin must never gain platform-admin scope by accident).

Token resolution for platform-admin paths: master-realm JWT validation is already implemented in JWTValidator in Phase 0 (P0.5 step 3) — tokens from ${keycloakBaseUrl}/realms/master produce a ValidatedToken with tenantId=null and the roles from realm_access.roles. P1.4's added responsibility is only the authorization layer: AuthenticationFilter allows null tenantId specifically for /platform/* paths and rejects it everywhere else.

P1.5 — Provision second tenant (proof of concept)

Run scripts/tenant-provision.sh acme-travel "Acme Travel LLC" admin@acmetravel.example.

Verify: - Keycloak: realm acme-travel exists with both clients, 5 roles, and one admin user with requiredActions=[UPDATE_PASSWORD] - DB: tlinq_acme_travel exists with empty nts.booking but intact schema - tqplatform.tenant has a row for acme-travel with encrypted kc_admin_client_secret - TenantRegistry.refresh() loads it - Log in as the tenant admin at https://acme-travel.tourlinq.com (local: http://acme-travel.tourlinq.local:11080) → forced password reset → land in TQPro admin UI - Create a booking in acme-travel — verify it lands in tlinq_acme_travel, NOT in tlinq - Log in as the seed-tenant user → verify acme-travel's booking is invisible

P1.6 — Phase 1 acceptance gate

Repo gate — Claude Code verifies. - [ ] scripts/tenant-provision.sh, scripts/tenant-rollback.sh, scripts/clean_tenant_clone.py, scripts/apply-tenant-migrations.sh all pass shellcheck / pyflakes - [ ] KeycloakRealmProvisioner and TenantProvisioningFacade have unit tests against a mock Keycloak HTTP server (realm create, rollback on failure mid-sequence) - [ ] PlatformAdminApi integration tests with mocked TenantProvisioningFacade - [ ] nginx -t passes against a freshly rendered vhost for a synthetic tenant code - [ ] Documentation: doc/operations/tenant-provisioning.md runbook exists describing the end-to-end onboarding procedure

Deployment gate — ops-owned, performed in staging first. - [ ] Ops creates DNS A record for the staging second tenant; propagation verified with dig +short - [ ] Ops runs scripts/tenant-provision.sh acme-travel "Acme Travel LLC" admin@acmetravel.example end-to-end in staging - [ ] Keycloak staging: realm exists with both clients, 5 roles, one admin user with requiredActions=[UPDATE_PASSWORD] - [ ] Staging DB: tlinq_acme_travel exists with empty customer tables but intact schema; tqplatform.tenant row present with encrypted secret - [ ] Ops simulates a mid-run failure (kill the script during step 3 or 4) → runs scripts/tenant-rollback.sh acme-travel → verifies vhost removed, cert revoked, realm deleted, DB dropped, tenant row gone - [ ] Ops completes the happy-path provisioning; the tenant admin receives the welcome email, sets password, lands in TQPro admin UI, creates a booking — booking lands only in tlinq_acme_travel - [ ] Ops logs in as seed-tenant user and confirms acme-travel's booking is invisible - [ ] Platform service account only has create-realm; does not appear in tenant realms' user lists (D-4) - [ ] Browsing https://acme-travel.tourlinq.com (staging domain) serves a valid cert (no browser warning) - [ ] certbot renew --dry-run includes the new tenant's cert


8. Phase 2 — Cache Partitioning

Effort: 1 week. Risk: Low. Depends on: Phase 0.

Tasks

Create: tqcommon/src/main/java/com/perun/tlinq/tenant/TenantCacheKey.java

public static String of(String name) {
    String tid = RequestContext.current().getTenantId();
    if (tid == null) throw new IllegalStateException("TenantCacheKey.of called outside tenant scope");
    return tid + "::" + name;
}

Modify (prefix all cache keys with TenantCacheKey.of(...)) — paths verified in repo: - tqapp/src/main/java/com/perun/tlinq/entity/cache/StaticMapCache.java - tqapp/src/main/java/com/perun/tlinq/entity/cache/PricelistCache.java - tqapp/src/main/java/com/perun/tlinq/entity/cache/ProductCache.java - tqapp/src/main/java/com/perun/tlinq/entity/cache/SupplierCache.java - tqapp/src/main/java/com/perun/tlinq/entity/cart/CartHolder.java - tqcommon/src/main/java/com/perun/tlinq/entity/cache/TlinqClusterCache.java (Hazelcast map names)

Verify: Unit tests that populate a cache under Tenant A's context, switch to Tenant B, verify lookup misses.


9. Phase 3 — Configuration & Credential Isolation (includes Settings Page, D-18, D-19)

Effort: 4–5 weeks (revised — original 2-3 weeks did not include the settings UI + plugin ## removal folded in from the settings-page-plan). Risk: Medium. Depends on: Phase 0, Phase 1.

Scope note: Phase 3 now subsumes the work previously scoped in doc/plans/gaps/settings-page-plan.md. That file is retained as reference but its Phases 1-4 are superseded by the subsections below. See §9 entry point and the supersession note at the top of the settings-page-plan file.

P3.1 — TenantConfig (the central abstraction) — D-18

Per-tenant property overrides live in each tenant's DB in nts.system_settings. TenantConfig is the single read path; AppConfig is reduced to the platform-default fallback layer. tqplatform.tenant.config JSONB is reserved for platform-admin-only metadata (plan tier, feature flags, region) — not for business properties.

Extend tqcommon/src/main/java/com/perun/tlinq/tenant/TenantConfig.java — the Phase 0 stub only contained decrypt() (see P0.6.b). In this phase:

public final class TenantConfig {
    /** Platform-wide defaults: AppConfig (tourlinq.properties + properties.d/*). Read-only. */
    private static final AppConfig defaults = AppConfig.getInstance();

    /** Per-tenant overrides: loaded lazily from tenant DB's nts.system_settings. Refreshed via Hazelcast. */
    private static final ConcurrentHashMap<String, Properties> tenantOverrides = new ConcurrentHashMap<>();

    /** Per-tenant platform metadata from tqplatform.tenant.config JSONB — platform-admin managed only. */
    private static final ConcurrentHashMap<String, JsonNode> platformMetadata = new ConcurrentHashMap<>();

    /** Reads current tenant from RequestContext. Layer order: tenant override → platform default. */
    public static String get(String key) {
        String tid = RequestContext.current().getTenantId();
        return tid == null ? defaults.getProp(key) : get(tid, key);
    }

    public static String get(String tenantId, String key) {
        Properties p = tenantOverrides.computeIfAbsent(tenantId, TenantConfig::loadFromTenantDb);
        String v = p.getProperty(key);
        return v != null ? decrypt(v) : defaults.getProp(key);
    }

    /** Invalidate + reload one tenant. Called by SystemSettingsFacade after write. */
    public static void reload(String tenantId) {
        tenantOverrides.remove(tenantId);
        // lazy re-populate on next get()
    }

    /** Encryption stub introduced in P0.6.b; full implementation here. */
    public static String decrypt(String cipher) { ... }
    public static String encrypt(String plaintext) { ... }

    private static Properties loadFromTenantDb(String tenantId) {
        return TenantScope.call(tenantId, () -> {
            try (Session s = NTSDBSession.getSession()) {
                List<SystemSettingEntity> rows = s.createQuery(
                    "FROM SystemSettingEntity", SystemSettingEntity.class).list();
                Properties p = new Properties();
                rows.forEach(r -> p.setProperty(r.getKey(), r.getValue()));
                return p;
            }
        });
    }
}

Format: Values stored with prefix encrypted:<base64(nonce||ct)> are decrypted transparently via TenantConfig.decrypt() using TQPRO_ENCRYPTION_KEY (already placed on disk in Prerequisites). Settings UI writes use encrypt() before INSERT when the property is flagged sensitive=true.

Hazelcast refresh: Every write (via SystemSettingsFacade) publishes the tenant-config-refresh topic with the tenantId. The subscriber was wired in P0.3 (see the "two topics" table there) — the handler simply calls TenantConfig.reload(tenantId) locally. No platform-DB roundtrip on refresh; the next TenantConfig.get(tenantId, key) call does a lazy reload.

P3.2 — Settings schema (per-tenant DB)

Create: config/db-changes/0073-system-settings.sql (renumbered from the settings-page-plan's 0038 to align with the post-0072 sequence introduced in P0.1/P1.2):

-- Per-tenant config overrides. Lives in each tenant DB under schema `nts`.
CREATE TABLE IF NOT EXISTS nts.system_settings (
    setting_id    BIGSERIAL PRIMARY KEY,
    category      VARCHAR(30) NOT NULL,    -- general|email|sms|ai|payment|integration
    setting_key   VARCHAR(100) NOT NULL UNIQUE,
    setting_value TEXT,                    -- plaintext or `encrypted:<base64>`
    sensitive     BOOLEAN DEFAULT FALSE,
    description   VARCHAR(255),
    updated_by    VARCHAR(100),
    updated_at    TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_system_settings_category ON nts.system_settings(category);

CREATE TABLE IF NOT EXISTS nts.settings_audit_log (
    log_id        BIGSERIAL PRIMARY KEY,
    setting_key   VARCHAR(100) NOT NULL,
    old_value     TEXT,
    new_value     TEXT,
    changed_by    VARCHAR(100),
    changed_at    TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_settings_audit_key  ON nts.settings_audit_log(setting_key);
CREATE INDEX IF NOT EXISTS idx_settings_audit_date ON nts.settings_audit_log(changed_at);

Applied to every tenant DB via the Phase 1 ledger runner (scripts/apply-tenant-migrations.sh). Sensitive values are stored with the encrypted: prefix and never returned to the UI in full (see P3.5 masking rules).

P3.3 — Plugin config migration (removes ## substitution caching)

The current plugin configs (RaynaClientConfig, TiqetsClientConfig, GoGlobalClientConfig, Google Flights, AmadeusClientConfig, OdooClientConfig) resolve ##key references once at plugin init and cache the resolved values in singletons. This is fundamentally incompatible with per-tenant credentials — Tenant A's token would stick in the singleton forever.

Remove the ## resolution for per-tenant properties. Every plugin's code path that currently reads config.getProp("rayna.client.token") (or similar) is changed to TenantConfig.get("ryb2b.token") at call time. The TenantConfig lookup takes the property from (a) the current tenant's nts.system_settings, falling back to (b) AppConfig / tourlinq.properties.

Migration targets (verified paths from §C of the review):

Plugin File Keys moving to TenantConfig
Odoo tqodoo/.../OdooClientConfig.java odoo.server, odoo.db, odoo.user, odoo.password
Amadeus tqamds/.../AmadeusClientConfig.java amadeus.api.key, amadeus.api.secret
Rayna tqryb2b/.../RaynaClientConfig.java ryb2b.server, ryb2b.token, ryb2b.refresh-interval
Tiqets tqtiqets/.../TiqetsClientConfig.java tiqets.api.key, tiqets.jwt.keyId
GoGlobal tqgglbl/.../GoGlobalClientConfig.java goglobal.api.password, others
Google Flights tqgflights/.../GoogleFlightsClientConfig.java gf.rapidapi.key

Structural config stays shared: service class mappings in nts-client.xml, odoo-client.properties, Hibernate dialect, JPA entity annotations — these are code-like, not per-tenant data.

Static-data URLs (e.g. rayna.client.sdcityurl, rayna.client.sdtoururl, service endpoint paths) stay in plugin XML files — they are not per-tenant.

Deprecation: the ## substitution is retained for platform-wide non-sensitive keys that no one overrides per tenant (e.g. rayna.client.apiModule). Over time, remove remaining ## uses for simplicity. Not urgent once the per-tenant properties are moved.

P3.4 — Application-level property migration

Convert every AppConfig.getInstance().getProp(key) call site where the property is per-tenant to TenantConfig.get(key):

  • Mail/SMTP in MessageUtil.java and anywhere else reading mail.*
  • Payment gateway (TelrJPgw.java reads telr.*, pgw.*)
  • Twilio in MessageUtil.java reads twilio.*
  • Company branding (headers, email templates, PDF generators that read tqpro.company.*, tqpro.agency.*, company.support-email, company.code)
  • hotel.margin, service.charge.*, erp.default.product.*
  • content.cdn-prefix, s3.path.prefix, content.urlprefix
  • ai.* (all 9 AI properties)
  • rapidapi.visa.key, pexels.api.key, amadeus.idfile
  • offline-tickets.storage-path

Task: grep first. Run grep -rn 'AppConfig.getInstance().getProp\|AppConfig.getInstance().getProp' tqapp tqapi tqodoo tqamds tqryb2b tqtiqets tqgglbl tqgflights and produce a complete call-site inventory at doc/plans/multitenancy-config-migration-checklist.md, then migrate one plugin/module at a time. Each migration is its own PR.

Singleton-reset gotchas from the settings-page-plan (still relevant): - MessageUtil.Twilio.init(...) is called once at construction. When twilio.sid/twilio.token change per tenant, MessageUtil must be refactored to (a) resolve Twilio credentials via TenantConfig.get() inside every send call rather than caching them, and (b) call Twilio.init(newSid, newToken) as a per-call preamble — or use Twilio's stateless TwilioRestClient constructor which accepts credentials per-request. Test carefully: concurrent requests from different tenants must not race on the global Twilio client state. - TelrJPgw caches telr.* in instance fields at construction. Convert to per-call TenantConfig.get("telr.auth-key") etc. - MailUtil is already NOT a singleton (new instance per use) — once it reads via TenantConfig, no further rework needed.

P3.5 — SystemSettingsFacade + SystemSettingsApi (tenant-admin scoped)

Absorbs the settings-page-plan's Phase 1.4/1.5. Tenant-admin-only; operates on the current tenant's DB (resolved from RequestContext).

Create: - tqapp/.../client/nts/db/common/SystemSettingEntity.java — NTS JPA entity mapping nts.system_settings - tqapp/.../client/nts/db/common/SettingsAuditLogEntity.java — NTS JPA entity mapping nts.settings_audit_log - tqapp/.../entity/common/CSystemSetting.java — canonical entity - tqapp/.../entity/common/CSettingsAuditLog.java — canonical entity - tqapp/.../entity/common/SystemSettingsFacade.java — CRUD + audit + TenantConfig.reload(tenantId) trigger + Hazelcast publish - tqapi/.../api/SystemSettingsApi.java — tenant-admin REST endpoints

Entity-config wiring: add SystemSetting and SettingsAuditLog to config/entities/common-entities.xml; add NTS services to config/nts-client.xml.

API endpoints (all POST, role=admin, tenant-scoped via RequestContext):

system/settings/list          — list, optionally by category (masked)
system/settings/read          — read one by key (masked if sensitive)
system/settings/write         — upsert; encrypts if sensitive; audit-logs; refreshes TenantConfig
system/settings/delete        — remove override (reverts to AppConfig default)
system/settings/audit         — view audit log for a key
system/settings/reload        — force full TenantConfig reload (this tenant only)
system/settings/test-email    — sends test email using current tenant's SMTP
system/settings/test-sms      — sends test SMS using current tenant's Twilio config

Authorization (extend config/api-roles.properties):

# Tenant-admin manages only the caller's own tenant settings.
system/settings/list=admin
system/settings/read=admin
system/settings/write=admin
system/settings/delete=admin
system/settings/audit=admin
system/settings/reload=admin
system/settings/test-email=admin
system/settings/test-sms=admin
No platform-admin scope on these — platform admins must impersonate or use the platform-level tenant-provisioning flow if they need to edit settings for a tenant.

Masking rules (in the facade, applied to every API response that leaves the server — UI or otherwise): - For sensitive=true settings: API responses show only the last 4 characters preceded by **** (e.g. ****a11f). The full encrypted value is never returned to any UI. - list and read endpoints: decrypt the stored value in-memory only far enough to produce the mask; never echo the plaintext. - write endpoint: the UI submits a new plaintext value. The facade encrypts it (TenantConfig.encrypt()) before INSERT/UPDATE and audit-logs the old masked value + new masked value (never the plaintext of either). - Internal runtime reads (e.g. MessageUtil calling TenantConfig.get("mail.password")) receive the decrypted plaintext — those calls don't pass through the masking layer.

Write flow: 1. Resolve tenantId from RequestContext 2. If setting is sensitive=true: encrypted:<base64(AES-GCM nonce||ct)> via TenantConfig.encrypt() 3. INSERT INTO nts.system_settings (...) ON CONFLICT DO UPDATE 4. INSERT INTO nts.settings_audit_log (key, old_value, new_value, changed_by, changed_at) 5. TenantConfig.reload(tenantId) — local invalidation 6. Publish tenant-config-refresh Hazelcast topic with tenantId — peers invalidate their caches 7. No plugin-singleton coordination needed (D-19 removed ## caching — all reads are live)

P3.6 — Settings UI (Phase 3 deliverable)

Absorbs the settings-page-plan's Phase 2.

Create: - tqweb-adm/settings.html — tabbed layout (General / Email / SMS & WhatsApp / AI Assistant / Payment / Integrations) - tqweb-adm/js/modules/settings.js — page logic (ES6 module, design-system compliant) - tqweb-adm/js/modules/settings-api.jstlinq('system/settings/...', ...) wrappers - tqweb-adm/css/settings.css — minimal; leverage tqadmin.css

Modify: - tqweb-adm/index.html — activate the "Settings" sidebar link (currently marked "Soon")

Design system: follow doc/design-system/tq-components.md + tq-templates.md (T3 spoke-form template). Use TQ.loading / TQ.confirm / notify.*, escapeHtml(), password-type inputs with eye toggle for sensitive=true fields, test-email / test-sms buttons per relevant tab.

P3.7 — Platform vs tenant property classification

Create: doc/developer/multitenancy-config-reference.md

Tabular form: every property currently in tourlinq.properties + properties.d/ classified as: - PLATFORM-ONLY (stays in files; read via AppConfig directly) — e.g. pgw.class, tlinq.dbname, tlinq.dbpass, content locations - TENANT (moves to nts.system_settings overlay; read via TenantConfig.get()) — the 50 properties from the settings-page-plan's inventory - PLATFORM-METADATA (stored in tqplatform.tenant.config JSONB, platform-admin-only) — plan tier, feature flags, region

Pair this with P3.4's migration checklist. Numbers from the settings-page-plan Section 10 inventory give a ready-made starting list: 50 properties become TENANT, 7 stay PLATFORM-ONLY (the ones marked "Stays in file only" in its Section 10.1).

P3.8 — Phase 3 acceptance gate

Repo gate — Claude Code verifies. - [ ] ./gradlew build passes; unit tests for TenantConfig.get/reload/encrypt/decrypt pass - [ ] Integration test: set mail.password via SystemSettingsApi as Tenant A → send test-email as Tenant A → verify new SMTP used → send test-email as Tenant B → uses Tenant B's distinct SMTP - [ ] Integration test: ## substitution no longer referenced for migrated per-tenant properties (grep comes up empty for those keys) - [ ] Unit test: TenantConfig fallback — remove the override row → get() returns the AppConfig value - [ ] Unit test: Hazelcast topic tenant-config-refresh triggers remote node's TenantConfig invalidation - [ ] Unit test: sensitive setting round-trip encrypt → store → decrypt on read → masked in list response - [ ] E2E test: Settings page renders all 6 tabs; sensitive fields masked; save + reload preserves; audit log shows the change - [ ] doc/developer/multitenancy-config-reference.md exists and classifies every property in tourlinq.properties + properties.d/ - [ ] config/db-changes/0073-system-settings.sql exists; scripts/apply-tenant-migrations.sh dry-run applies it to a fresh tenant clone

Deployment gate — ops-owned, performed in staging. - [ ] Provision a test tenant (Acme); set distinctive SMTP/Twilio/AI credentials via the Settings page - [ ] Issue a booking confirmation email from Acme → verify the From: and SMTP-relay match Acme's settings; repeat as seed tenant → verify seed's SMTP used - [ ] Send an SMS from Acme → verify Twilio sender ID is Acme's; seed tenant → seed's sender ID - [ ] Create an AI-outline request from Acme → verify uses Acme's ai.api.key (confirm in logs) - [ ] Restart one TQPro node — verify settings persist (DB-backed) and behavior is unchanged after restart


10. Phase 4 — S3 Document Storage Isolation

Effort: 1 week. Risk: Low. Depends on: Phase 0.

Scope (verified in repo): - tqapp/src/main/java/com/perun/tlinq/client/nts/service/media/BookingDocumentStorageService.java (exists) - tqapp/src/main/java/com/perun/tlinq/client/nts/service/media/VisaDocumentStorageService.java (exists) - Note: the original plan referenced MarketingMediaStorageService.java — this class does not exist in the repo today. If marketing-media S3 storage is later added, it must follow the same S3TenantPath discipline; for now there is nothing to migrate.

Modify the two existing services above:

  • Every S3 key construction goes through a new helper: S3TenantPath.prefix(tenantId) + "bookings/...", which returns "tenants/<tenantCode>/" where tenantCode comes from TenantRegistry.requireById(tenantId).getTenantCode()
  • Existing files: one-time migration script scripts/migrate-s3-to-tenant-prefix.sh that moves every existing object under tenants/<seed-tenant-code>/. Runs once against the seed tenant.

Verify: Upload a document as Tenant A → S3 path is tenants/tqpro-adm/bookings/.... Upload as Tenant B → tenants/acme-travel/bookings/.... No collisions possible.


11. Phase 5 — WhatsApp Multi-Tenancy (Supersedes WhatsApp Future Work)

Effort: 4 weeks (revised — original 2-week estimate undersized the scope). Risk: Medium-High. Depends on: Phase 0, Phase 1.

This phase replaces the FUT-3 "Multi-agent / multi-number" item in doc/plans/whatsapp/gaps.md — once delivered, remove FUT-3 from that file.

Phase 5 architecture (D-17)

The Python WhatsApp codebase splits into two deployable services sharing the same tqwhatsapp/app/ package:

Service Process Workload Tenant model
tqwhatsapp (existing) FastAPI (uvicorn) Inbound webhooks from Meta, on-demand dispatch endpoints called by Java Resolves tenant per-request via phone_number_id (webhook) or X-Tenant-ID header (Java calls)
tqwhatsapp-worker (NEW) Long-running Python process Retry loop (60s), cleanup loop (3600s), broadcast-dispatch consumer Iterates active tenants from the platform registry; opens per-tenant pools on demand

Both services: - Read the platform DB at startup (via the same tqwhatsapp/app/tenant.py module) - Maintain per-tenant asyncpg pools in tqwhatsapp/app/db.py (lazy creation, max 2 connections per tenant) - Run as separate systemd units; webhook handler scales on inbound rate, worker scales on tenant count

P5.1 — Webhook → tenant routing table

Already created in P0.1 as tqplatform.wa_phone_routing. Every onboarding writes a row: (META_PHONE_NUMBER_ID from tenant config, tenant_id).

P5.2 — Shared tenant-aware modules (used by both services)

Create: tqwhatsapp/app/tenant.py

from dataclasses import dataclass
from typing import Optional
import asyncpg

@dataclass(frozen=True)
class TenantInfo:
    tenant_id: str
    tenant_code: str
    db_name: str
    db_user: Optional[str]
    db_pass: Optional[str]   # decrypted at load time
    kc_realm: str
    status: str

class TenantRegistry:
    """Process-local cache of platform DB tenant rows.
    Indexed by tenant_id, tenant_code, kc_realm, AND phone_number_id."""

    def __init__(self, platform_dsn: str):
        self._dsn = platform_dsn
        self._by_id: dict[str, TenantInfo] = {}
        self._by_phone: dict[str, TenantInfo] = {}

    async def load(self) -> None:
        """Initial load and full refresh. Called at startup and on Hazelcast/HTTP refresh signal."""
        conn = await asyncpg.connect(self._dsn)
        try:
            tenants = await conn.fetch(
                "SELECT tenant_id, tenant_code, db_name, db_user, db_pass, "
                "kc_realm, status FROM tenant WHERE status='ACTIVE'")
            phones = await conn.fetch(
                "SELECT phone_number_id, tenant_id FROM wa_phone_routing")
        finally:
            await conn.close()
        new_by_id = {t['tenant_id']: TenantInfo(**dict(t)) for t in tenants}
        new_by_phone = {p['phone_number_id']: new_by_id[p['tenant_id']]
                        for p in phones if p['tenant_id'] in new_by_id}
        self._by_id, self._by_phone = new_by_id, new_by_phone

    def by_id(self, tid: str) -> Optional[TenantInfo]: return self._by_id.get(tid)
    def by_phone(self, pid: str) -> Optional[TenantInfo]: return self._by_phone.get(pid)
    def list_active(self): return list(self._by_id.values())

Modify: tqwhatsapp/app/config.py - Remove the hardcoded db_name = "tlinq". - Add: platform_db_dsn (env var TQPRO_PLATFORM_DB_DSN), tqpro_internal_api_key (existing), decryption_key (env var TQPRO_ENCRYPTION_KEY). - The single-DB DSN field is removed; all DSN construction is now per-tenant via TenantInfo.

Modify: tqwhatsapp/app/db.py

import asyncpg
from typing import Dict
from .tenant import TenantInfo

_pools: Dict[str, asyncpg.Pool] = {}

async def get_pool(tenant: TenantInfo) -> asyncpg.Pool:
    pool = _pools.get(tenant.tenant_id)
    if pool is None:
        pool = await asyncpg.create_pool(
            user=tenant.db_user, password=tenant.db_pass,
            database=tenant.db_name, host=DB_HOST, port=DB_PORT,
            min_size=1, max_size=2,
            server_settings={"search_path": "tqwa"},
        )
        _pools[tenant.tenant_id] = pool
    return pool

async def evict_pool(tenant_id: str) -> None:
    pool = _pools.pop(tenant_id, None)
    if pool: await pool.close()

async def close_all_pools() -> None:
    for pool in _pools.values(): await pool.close()
    _pools.clear()
The tqwa schema is unchanged (every tenant DB has the same tqwa.* tables — the tables themselves get the standard tenant migration via Phase 1's runner).

Modify: tqwhatsapp/app/tqpro_client.py - Every HTTP call to the Java API now includes X-Tenant-ID: <tenant_id> alongside the existing Authorization: Bearer .... - The tenant_id argument becomes mandatory on every public method.

P5.3 — Webhook handler tenant resolution

Modify: tqwhatsapp/app/webhooks/handler.py - For every incoming Meta webhook payload, extract phone_number_id from:

pid = entry["changes"][0]["value"]["metadata"]["phone_number_id"]
(This field is not read by the current code — it's new extraction logic, not a one-line patch.) - Look up tenant = registry.by_phone(pid); if missing → respond 200 (Meta requires 200 to avoid retries) but log a WARN with the unknown phone_number_id so ops sees it. - Pass the resolved TenantInfo down through every handler function (no global tenant). Use await get_pool(tenant) for any DB call. - All calls into tqpro_client pass tenant.tenant_id.

P5.4 — tqwhatsapp-worker service (NEW)

Create: tqwhatsapp/worker/__main__.py — entry point for the new long-running process. Requires tqwhatsapp/worker/__init__.py (empty file) so python3 -m tqwhatsapp.worker resolves. Bootstraps: 1. Loads TenantRegistry from the platform DB 2. Registry refresh — Python can't subscribe to Hazelcast directly. The worker polls the platform DB every 60s (TenantRegistry.load()) as the Phase 5 interim. Up to 60s of staleness between onboarding a tenant and the worker noticing. Documented acceptable for retry/cleanup cadences. A Hazelcast-to-HTTP bridge (Java side pushes to a worker webhook) can be added as a follow-up if 60s is too slow. 3. Starts two async background tasks: - retry_loop() — every 60s, iterate registry.list_active(), for each tenant open the pool and process pending retries from tqwa.campaign_messages WHERE status='RETRY' using FOR UPDATE SKIP LOCKED for multi-instance safety - cleanup_loop() — every 3600s, iterate active tenants, delete from tqwa.processed_webhook_messages WHERE created_at < NOW() - INTERVAL '7 days' wrapped in pg_try_advisory_lock(<tenant-specific key>) so multiple worker replicas don't double-delete

Broadcast dispatch flow (Java → Python): stays on the existing path. MetaBroadcastDispatcher (Java) POSTs to the tqwhatsapp FastAPI webhook service's /broadcast endpoint (already exists). The webhook service enqueues by writing a row into tqwa.campaign_messages WHERE status='QUEUED'. The worker's retry_loop (same loop, broader status filter: status IN ('QUEUED','RETRY')) picks up the queued row in the tenant's DB and dispatches to Meta. No separate broadcast_consumer — unifies queueing and retry into one loop.

Move the existing logic from tqwhatsapp/app/tasks.py: - _retry_loop, _process_retries → into tqwhatsapp/worker/retry.py, refactored to take a TenantInfo parameter. - _cleanup_loop → into tqwhatsapp/worker/cleanup.py, refactored similarly. - The webhook FastAPI service no longer starts these loops — they only run inside the worker process. The tqwhatsapp/app/tasks.py module is reduced to whatever (if anything) the webhook still needs.

Multi-instance safety: The retry loop already uses FOR UPDATE SKIP LOCKED for safe concurrent processing across worker instances. The cleanup loop is wrapped in a pg_try_advisory_lock(<tenant-specific key>) so multiple worker instances don't double-delete.

Deployment: - One systemd unit per service: tqwhatsapp.service (uvicorn) and tqwhatsapp-worker.service (ExecStart invokes the venv-activated interpreter; if no venv is configured for the unit, use python3 -m tqwhatsapp.worker so the system's Python 3 is used unambiguously — never bare python, which on modern distros may be missing). - Both read the same env file (/etc/tqwhatsapp/env) including TQPRO_PLATFORM_DB_DSN and TQPRO_ENCRYPTION_KEY. - Worker can run as multiple replicas; webhook handler typically scales horizontally too.

P5.5 — Java → Python tenant propagation

Modify: tqapp/src/main/java/com/perun/tlinq/entity/marketing/MetaBroadcastDispatcher.java - Add X-Tenant-ID: <RequestContext.current().getTenantId()> header on the HTTP call to the Python service. - Add "tenant_id": "<tid>" field to the dispatch request body. - Mirror change in BroadcastDispatchRequest Pydantic model on the Python side.

Verify (no new code — check that P0.5's change is still in effect): AuthenticationFilter.java rejects the internal API key path if X-Tenant-ID header is missing. If Phase 5 testing shows the check was removed or diluted between P0.5 and P5, add a regression test against this branch.

P5.6 — Tenant onboarding adds WhatsApp routing

Modify: KeycloakRealmProvisioner / TenantProvisioningFacade (Phase 1) — when a tenant is provisioned with a Meta phone_number_id (optional in the provision request body), insert into tqplatform.wa_phone_routing.

If a tenant adds WhatsApp later (separate admin operation), expose POST /platform/tenant/wa-routing/add and delete on PlatformAdminApi.

P5.7 — Acceptance gate

Repo gate — Claude Code verifies. - [ ] tqwhatsapp/app/tenant.py exists with TenantRegistry loading from platform DB - [ ] tqwhatsapp/app/db.py is per-tenant pool dict; get_pool(tenant) lazy-creates - [ ] Webhook handler extracts metadata.phone_number_id and routes correctly (unit test with two synthetic payloads) - [ ] tqwhatsapp-worker is a separate __main__ entry point; importable; starts retry+cleanup loops; works against a fixture with two tenants - [ ] tqwhatsapp/app/tqpro_client.py sets X-Tenant-ID on every outbound call - [ ] MetaBroadcastDispatcher Java unit test confirms header + body field - [ ] pytest tqwhatsapp/ passes; both services import cleanly; no global single-tenant state remains - [ ] Two systemd unit files documented in doc/operations/whatsapp-multitenancy.md

Deployment gate — ops-owned, performed in staging. - [ ] Provision a second tenant with its own Meta phone number ID; insert into wa_phone_routing - [ ] Send a test webhook to that phone number ID → verify it reads/writes only tlinq_acme_travel.tqwa.*, never tlinq.tqwa.* - [ ] Broadcast from tenant A → WAMIDs land in tenant A's campaign_messages table only - [ ] Worker service running with two replicas — retry queue processed without double-execution; cleanup loop runs cleanly - [ ] Stop the worker service entirely → confirm webhook keeps serving (no shared state); restart → retries resume


12. Phase 6 — Frontend Awareness

Effort: 1 week. Risk: Low. Depends on: Phase 0.

Tasks

Create: POST /auth/tenant-info — returns tenant display name, logo URL (from tqpro.company.logo.url in TenantConfig), and any other per-tenant UI data. Public endpoint (same as /auth/config) — the browser hits it before login to paint the login page with tenant branding.

Modify: config/api-roles.properties — add:

auth/tenant-info=guest,agent,admin

Modify: tqweb-adm/js/modules/globals.js (verified path — the file is 33 KB, currently holds getAuthHeaders(), callApi(), and session-token handling). Call /auth/tenant-info during page bootstrap (not only post-login — the login page needs it too). Cache in window.TQPro.tenant (introduce the namespace in this module if it doesn't exist yet).

Modify: tqweb-adm/index.html and partials — swap hardcoded logo/company name for the cached tenant values.

Verify: Log in to two different tenant subdomains → each shows the right logo and company name; the login page (pre-auth) also shows the correct tenant branding, proving /auth/tenant-info is reachable without a token.


13. Phase 7 — Defense-in-Depth

Effort: 1 week. Risk: Low. Depends on: All prior phases.

P7.1 — TenantAssert

Create: tqcommon/src/main/java/com/perun/tlinq/tenant/TenantAssert.java - requireTenant() — throws if no tenant in RequestContext (redundant with NTSDBSession.getSession() but callable at API entry for earlier failure) - requireDbMatch(Session s, String tenantId) — verifies the Hibernate session's JDBC URL matches the expected tenant DB name. Logs a severe error and throws CrossTenantAccessException on mismatch.

Insert at the top of every method in: - tqapp/.../nts/service/NTSEntityWriteService.java (create, write, delete) - tqapp/.../nts/service/NTSEntityReadService.java (read, search)

P7.2 — Background job safety

Note: the original draft of this section listed BookingMaintenanceApi as a suspect — that class is in fact a REST API (@Path("/blm/maintenance")) with on-demand POST endpoints, not a scheduled job. It does not need TenantScope wrapping (RequestContext is already established by AuthenticationFilter for every API call).

Audit step (do this first): Run these greps from the repo root and triage every result:

grep -rn --include='*.java' \
  -e 'ScheduledExecutorService' \
  -e 'Executors\.newScheduled' \
  -e 'ExecutorService\.submit' \
  -e 'Executors\.newFixedThreadPool' \
  -e 'Executors\.newSingleThreadExecutor' \
  -e 'new Thread(' \
  -e 'CompletableFuture\.runAsync' \
  -e '@Scheduled' \
  -e 'addEntryListener' \
  -e 'IExecutorService' \
  tqcommon tqapp tqapi tqamds tqodoo tqryb2b tqtiqets tqgglbl tqgflights

Produce the full hit list as doc/plans/multitenancy-background-job-audit.md (a P7.2 deliverable). For each hit, classify: - OK — runs strictly inside a JAX-RS request thread (RequestContext already set) - NEEDS-WRAP — runs outside a request thread; wrap the body in TenantScope.run(tenantId, ...) and ensure the dispatching code captures the current tenantId at submission time - NEEDS-DESIGN — has no natural tenant context (e.g. a global cron); decide whether to iterate TenantRegistry.listActive() or to make it tenant-triggered

Known suspects (start here): - tqcommon/.../entity/cache/TlinqClusterCache.java — Hazelcast addEntryListener and any IExecutorService use. Listener callbacks run on Hazelcast threads, not request threads. - Any code that calls Hazelcast's IScheduledExecutorService.schedule*. - The tqwhatsapp Python side is now governed by Phase 5 / D-17 (separate worker service, per-tenant pools); the per-loop iteration over TenantRegistry.list_active() IS the tenant boundary. Verify both worker loops scope correctly.

The audit's output drives a list of concrete TenantScope.run insertions; each insertion is its own commit on this branch.

P7.3 — Startup health check

Modify: tqapi/.../TQProApiServer.java startup sequence — after Hibernate and TenantRegistry init, iterate active tenants and SELECT 1 against each tenant DB. Log any unreachable tenant and flag it SUSPENDED with a startup warning, but do not fail the boot.


14. Phase 8 — Deprovisioning (D-10)

Effort: 1 week. Risk: Low. Depends on: Phase 1.

P8.1 — Deprovisioning (Java API + ops shell script)

Deprovisioning is a two-part flow: a Java API endpoint handles DB/registry/Keycloak state changes (Claude Code territory), and an ops shell script handles nginx/certbot/host-level cleanup (ops territory). Per §4 Roles, Java code must NOT shell out to nginx/certbot.

P8.1.a — Java API endpoint

Add to: tqapi/.../api/PlatformAdminApi.java

POST /platform/tenant/deprovision
Body: { tenantId: "...", reason: "contract expired" }

Steps performed by the Java handler: 1. Flip tqplatform.tenant.status to DEPROVISIONED, set deprovisioned_at = NOW() 2. Disable the Keycloak realm: PUT /admin/realms/${realm} with { enabled: false } (D-10 — do NOT delete) 3. TenantRegistry.refreshOne(tenantId) — publishes tenant-registry-refresh. Every node's subscriber (P0.3 onTenantRegistryRefresh) observes the post-refresh status as non-ACTIVE and automatically calls TenantSessionRegistry.evictAll(tenantId), closing pools for all 6 registered DB sessions on every node. No manual fan-out needed. Subsequent JWTs for this tenant fail validation because TenantRegistry.getByRealm now returns null (per P0.5 logic). 4. Log the action with platform-admin user identity and reason to tqplatform (new table tqplatform.deprovisioning_log — add via a platform-DB migration config/db-changes/platform/0002-deprovisioning-log.sql) 5. Return { status: "deprovisioned", nextStep: "ops must run scripts/tenant-deprovision.sh <tenant-code>" }

P8.1.b — Ops deprovisioning script (host-level cleanup)

Create: scripts/tenant-deprovision.sh

#!/usr/bin/env bash
# Usage: tenant-deprovision.sh <tenant-code>
# Run BY OPS, AFTER /platform/tenant/deprovision has been called successfully.
set -euo pipefail
CODE="$1"
PLATFORM_DOMAIN=$(grep '^platform.domain=' /etc/tqpro/tourlinq.properties | cut -d= -f2)
TENANT_HOST="${CODE}.${PLATFORM_DOMAIN}"

# 1. Remove nginx vhost from enabled (keep sites-available for audit)
rm -f "/etc/nginx/sites-enabled/${CODE}.conf"
nginx -t && systemctl reload nginx

# 2. Revoke and delete the TLS certificate
certbot revoke --cert-name "${TENANT_HOST}" --non-interactive || true
certbot delete --cert-name "${TENANT_HOST}" --non-interactive || true

# 3. DNS is NOT touched — ops decides whether to remove the A record externally.
# Incoming HTTPS requests now return "certificate error" + 502; HTTP requests redirect to HTTPS and fail the same way.
echo "Deprovisioning cleanup complete for ${CODE}. DB preserved per D-10."

P8.2 — Reactivation (Java API + ops shell script)

P8.2.a — Java API endpoint

POST /platform/tenant/reactivate
Body: { tenantId: "..." }
- Flip status back to ACTIVE - Re-enable the Keycloak realm: PUT /admin/realms/${realm} with { enabled: true } - TenantRegistry.refreshOne(tenantId) — publishes tenant-registry-refresh; subscribers re-add the tenant to the active map. Pools are lazily re-created on next request. - Return { status: "reactivated", nextStep: "ops must run scripts/tenant-reactivate.sh <tenant-code>" }

P8.2.b — Ops reactivation script

Create: scripts/tenant-reactivate.sh

#!/usr/bin/env bash
# Usage: tenant-reactivate.sh <tenant-code>
# Run BY OPS, AFTER /platform/tenant/reactivate has been called successfully.
set -euo pipefail
CODE="$1"
PLATFORM_DOMAIN=$(grep '^platform.domain=' /etc/tqpro/tourlinq.properties | cut -d= -f2)
TENANT_HOST="${CODE}.${PLATFORM_DOMAIN}"

# 1. Re-symlink the vhost
ln -sf "/etc/nginx/sites-available/${CODE}.conf" "/etc/nginx/sites-enabled/${CODE}.conf"
nginx -t && systemctl reload nginx

# 2. Re-issue the TLS certificate (HTTP-01 webroot — assumes DNS still points at this host)
certbot certonly --webroot --webroot-path /var/www/certbot \
    --non-interactive --agree-tos --email ops@${PLATFORM_DOMAIN} \
    --domain "${TENANT_HOST}" \
    --deploy-hook "systemctl reload nginx"

P8.3 — DB preservation

D-10 commits to keeping DBs forever. No DB drop, no archive, no automatic deletion. If storage pressure emerges (OD-3), revisit with a new plan — not in this execution.

P8.4 — Acceptance gate

Repo gate — Claude Code verifies. - [ ] PlatformAdminApi deprovision/reactivate endpoints unit-tested with mocked Keycloak client - [ ] TenantSessionRegistry.evictAll(tenantId) unit-tested: every registered DB session class has its evictFactory(tenantId) called exactly once; subsequent getSession() for that tenant fails loud - [ ] Unit test: TenantRegistry.refreshOne() subscriber fan-out invokes TenantSessionRegistry.evictAll automatically when a tenant flips to DEPROVISIONED - [ ] scripts/tenant-deprovision.sh and scripts/tenant-reactivate.sh pass shellcheck; both are idempotent (re-running on already-done state exits 0) - [ ] Platform-DB migration config/db-changes/platform/0002-deprovisioning-log.sql exists and creates tqplatform.deprovisioning_log - [ ] Runbook section added to doc/operations/tenant-provisioning.md for deprovision and reactivate flows, with explicit two-step guidance: (1) Java API call, (2) ops script

Deployment gate — ops-owned, performed in staging. - [ ] Ops provisions a throwaway test tenant, then deprovisions it → verifies: - DB still exists on disk, readable by a platform admin - Realm disabled (existing JWTs stop validating — realm_disabled from Keycloak, caught by JWTValidator) - nginx vhost symlink removed; sites-available/ copy retained for audit - TLS cert revoked via certbot revoke - [ ] Ops reactivates the same tenant → vhost re-symlinked, cert re-issued, realm re-enabled, existing data intact, login works again


15. Phase 9 — Documentation (Final)

Goal: Once every functional phase (0–8) is in production, the documentation across doc/ must match the new multi-tenant reality. Without this phase, the plan is not complete — future contributors and ops staff cannot operate the system without it.

Effort: 1 week. Risk: Low. Depends on: Phases 0–8 all deployed.

Every task here is a Claude Code artifact — pure documentation. Ops does nothing in this phase beyond review.

P9.1 — Deployment documentation

Create: doc/deployment/multitenancy-architecture.md - Block diagram: gateway nginx → tenant vhosts → TQPro API + static web → per-tenant Postgres DB + shared Keycloak + Python WhatsApp service - Component inventory: every process and host involved, with port numbers, file locations, and which tenant context each process honors - Request-flow walkthrough: browser → DNS → gateway nginx → /auth/config → Keycloak realm → authenticated call → AuthenticationFilterTenantRegistryNTSDBSession factory → tenant DB - Multi-tenancy requirements list: wildcard DNS capability, certbot + ACME-capable outbound network, max_connections ceiling, Keycloak realm count ceiling, disk space per tenant DB

Modify: doc/deployment/bare-metal-deployment.md - Add a multi-tenancy section: what changes in a bare-metal install when multi-tenancy is active (platform DB, nginx template location, certbot) - Link to multitenancy-architecture.md

Modify: doc/deployment/integration-inventory.md - Note which integrations are per-tenant (Odoo, Amadeus, Telr, Twilio, Meta WhatsApp) vs platform-wide (Keycloak instance, Let's Encrypt)

P9.2 — Feature documentation

Create: doc/features/multitenancy/ directory with:

  • requirements.md — business + technical requirements that drove the design. Functional (tenant isolation, own credentials, own branding), non-functional (scale: 30 tenants initially; isolation: DB-level; authentication: realm-level), constraints (same 5 realm roles across tenants — D-12).
  • implementation.md — technical implementation summary for developers. Cover: TenantRegistry, RequestContext.tenantId, multi-realm JWTValidator, tenant-keyed NTSDBSession, TenantConfig, TenantCacheKey, TenantScope, TenantAssert. Include code examples and point to the actual source files.
  • use-cases.md — UC-MT-001 tenant onboarding, UC-MT-002 login (tenant resolved from subdomain), UC-MT-003 cross-tenant data isolation, UC-MT-004 tenant suspension / reactivation, UC-MT-005 tenant deprovisioning, UC-MT-006 platform admin managing all tenants.
  • test-cases.md — the five-test regression suite from §17 plus unit-test inventory. Each test names the fixture, the steps, the expected result, and the source file.
  • api-reference.md — the /platform/tenant/* endpoints. Request/response schemas, error codes, role requirements (platform-admin).

Modify: doc/features/README.md — index entry for multitenancy.

P9.3 — Operations documentation

Create: doc/operations/multitenancy-setup.md — finalized version of the initial-install runbook that P0.8 started: - Platform DB bootstrap - PostgreSQL max_connections tuning - Platform service account setup in master realm - TQPRO_ENCRYPTION_KEY generation and storage - nginx template installation - Certbot bootstrap - Local-dev /etc/hosts recipes - Smoke test checklist

IMPORTANT — consolidated initial-configuration reference. The P9.3 version of multitenancy-setup.md must contain (or link to) a single authoritative table that enumerates every piece of configuration a fresh TQPro install needs to become multi-tenant-operational. Ops should be able to read one document and produce a working deployment without cross-referencing the code. Include:

  1. Property file ownership — the rule that Jetty / API-server keys live in tlinqapi.properties while application / platform / tenant-DB keys live in tourlinq.properties. Show every multi-tenancy-related key with its correct file:
Key File Introduced by Example
platform.db.url tourlinq.properties P0.3 jdbc:postgresql://localhost:5432/tqplatform
platform.db.user tourlinq.properties P0.3 tqpro_platform
platform.db.pass tourlinq.properties P0.3 (secret)
platform.domain tourlinq.properties P0.9 tourlinq.com
tenant.db.host tourlinq.properties P0.6 localhost
tenant.db.port tourlinq.properties P0.6 5432
oidc-keycloak-base-url tlinqapi.properties P0.5 https://dev-auth.vanevski.net
dev-tenant-id tlinqapi.properties P0.5 seed-tenant UUID
http-port (consumed by the vhost template) tlinqapi.properties existing 11080
  1. Environment variables / secrets on disk — every EnvironmentFile= entry the systemd unit for TQPro reads:
Variable Location Mode Consumed by
TQPRO_ENCRYPTION_KEY /etc/tqpro/tqpro.env 0600 TenantConfig.encrypt/decrypt
TQPRO_PLATFORM_ADMIN_SECRET /etc/tqpro/tqpro.env 0600 KeycloakRealmProvisioner (Phase 1)
TLINQ_HOME systemd unit Environment= AppConfig + ClientConfig to locate the config directory
  1. Database state — the rows and columns that must exist at bootstrap: tqplatform.tenant (seed row), tqplatform.wa_phone_routing, tqplatform.schema_migrations, plus the per-tenant public.schema_migrations ledger bootstrapped by scripts/bootstrap-schema-migrations.sh. Include the exact UPDATE tqplatform.tenant SET db_user=..., db_pass=encrypted:... command ops runs to move the seed row off plaintext credentials for prod.

  2. Keycloak state — the Parts A/B of §2 (platform service account in master, tqpro-admin-api client in every tenant realm, the 5 realm roles, and the tqweb-adm redirect URI pattern that must match platform.domain).

  3. nginx / certbot state — the per-tenant vhost symlinks expected under sites-enabled/, the cert paths under /etc/letsencrypt/live/, and certbot.timer being active.

  4. The post-install verification matrix — copy-pasteable curl and psql commands for each item above, with the expected response.

  5. Known correction — the P0.3 commit originally placed platform.db.* in tlinqapi.properties; the a3396038 correction moved them to tourlinq.properties. Environments bootstrapped between those commits must migrate their property values. Document the one-line sed move operators for ops.

This table is the canonical install reference for the project. Any change to where a key lives, or any new multi-tenancy property introduced in a later phase, updates this table as part of the same PR.

Create: doc/operations/tenant-provisioning.md — the ops runbook for onboarding a new tenant: - Prerequisites checklist (DNS A record, email for tenant admin, tenant code convention) - Step-by-step: DNS verification → scripts/tenant-provision.sh <code> "<name>" <email> → ops-side expectations at each script stage → what a successful run looks like → where to check logs - Rollback procedure (scripts/tenant-rollback.sh) and when to use it - Reactivation / suspension procedures (Phase 8) - Troubleshooting: DNS not propagated, certbot rate limit, Keycloak unreachable, DB clone disk-space issues

Create: doc/operations/multitenancy-monitoring.md - Which metrics to watch per tenant: DB connection-pool saturation, Hibernate factory count, Keycloak realm count, per-tenant request volume - Log locations: journalctl -u nginx, journalctl -u postgresql, per-tenant gateway access logs at /var/log/nginx/<tenant-code>-gw.*.log, TQPro application logs (with tenantId in every line) - Alerts: tenant DB unreachable at startup (health check from P7.3); max_connections utilization above 80%; certbot renewal failure

Modify: doc/operations/README.md — index the three new documents.

Modify: doc/operations/server-setup/ — if any existing file assumes a single-tenant deploy, add a multi-tenancy note pointing to the new docs. Do not rewrite; reference.

P9.4 — API documentation

Modify: doc/api/README.md — add the new /platform/* group to the index.

Create: doc/api/platform.md — complete reference for /platform/tenant/* endpoints (list, read, provision, suspend, activate, deprovision, reactivate). Request/response schemas, role requirements, error codes.

Update every existing API doc that might give a single-tenant impression: add a short preamble to each doc/api/<module>.md stating "all endpoints are tenant-scoped — the tenant is resolved from the caller's JWT". One-sentence insertion, no content rewrite.

P9.5 — Developer documentation

Create: doc/developer/multitenancy.md — developer-oriented architecture guide: - How to read tenant context from code (RequestContext.current().getTenantId()) - How to add a new per-tenant config value (put in TenantConfig, not AppConfig) - How to write a background job correctly (TenantScope.run(...)) - How NTSDBSession picks the right DB - Anti-patterns: direct AppConfig.getInstance().getProp() for tenant-sensitive values, storing global singletons that accumulate per-tenant data, forgetting TenantCacheKey on a new cache

Modify: doc/developer/README.md — index entry.

Modify: doc/developer/entity-framework/ — any doc that describes NTSDBSession, EntityFacade, or the service factory chain needs a note about tenant routing. Add a short section, do not rewrite.

Modify: doc/developer/getting-started/ — onboarding docs: new developers must understand tenant context from day one. Add a "multi-tenancy essentials" subsection.

P9.6 — User documentation

Modify: doc/user/README.md — mention that end-users log in via their tenant subdomain (e.g. https://<tenant>.tourlinq.com). The rest of the user experience is unchanged per tenant.

No module-specific user docs change in Phase 9 — the end-user workflows for bookings, cruises, hotels, etc. are tenant-agnostic. Staff management (which is user-facing) updates its docs in its own execution plan's Phase 10.

P9.7 — Gap tracker

Modify: doc/plans/gaps/gaps-summary.md — add a line under a new "Multi-Tenancy" section documenting that the capability has landed and linking to doc/features/multitenancy/. Update the statistics block if relevant.

P9.8 — Phase 9 acceptance gate

Repo gate — Claude Code verifies. - [ ] Every file listed in P9.1–P9.7 exists with substantive content (not just a stub) - [ ] doc/features/multitenancy/requirements.md, implementation.md, use-cases.md, test-cases.md, api-reference.md — all present - [ ] doc/deployment/multitenancy-architecture.md has the block diagram and component inventory - [ ] doc/operations/multitenancy-setup.md, tenant-provisioning.md, multitenancy-monitoring.md — all present - [ ] doc/operations/multitenancy-setup.md carries the consolidated initial-configuration reference from P9.3 (property file ownership table, env-var/secret table, DB-state table, Keycloak-state table, nginx/certbot table, verification matrix, known-correction notice). A fresh ops engineer must be able to bring a zero-state install to green using only this document. - [ ] Every existing doc/api/<module>.md has the tenant-scoping preamble - [ ] doc/developer/multitenancy.md exists - [ ] README indexes under doc/features/, doc/operations/, doc/developer/, doc/api/ reference the new files - [ ] Internal links between docs resolve (no broken [foo](../../bar.md) references)

Deployment gate — ops reads the tenant-provisioning runbook and confirms it matches what they actually did for the second-tenant dry-run in Phase 1. If not, ops files a correction PR back into the doc.


16. Coordination with Staff-Management Plan (D-1)

Order of delivery: 1. This plan's Phase 0 ships first. 2. Staff-management-plan is then executed against the tenant-aware foundation, with these specific amendments to doc/plans/management/staff-management-plan.md: - KeycloakAdminClient takes a realmName constructor parameter; reads clientSecret from TenantRegistry.requireById(RequestContext.current().getTenantId()).getKcAdminClientSecret() (decrypted via TenantConfig.decrypt()). - nts.staff_member table lives in each tenant DB (not tqplatform). D-13. - CTeamMember → CStaffMember migration (recommended Option B from the earlier review — subsume CTeamMember into CStaffMember with a staff_marketing_role join table) runs once per tenant DB as part of the normal scripts/apply-tenant-migrations.sh flow. - New AdminApi endpoints use the tenant-scoped KeycloakAdminClient automatically via RequestContext.

Update doc/plans/management/staff-management-plan.md with a "Tenant-aware addendum" section citing this execution plan before staff-management work begins.


17. Verification Plan (Overall)

Ship these five tests as a regression suite:

  1. Smoke (Phase 0 seed tenant): full existing regression passes with zero behavioral change.
  2. Isolation (Phase 1 second tenant): create data in Tenant A via its subdomain, log in as Tenant B, confirm invisibility via every list endpoint (bookings, customers, broadcasts, visa applications, tickets, groups, hotels).
  3. Cross-tenant attack: using Tenant A's JWT, POST /tlinq-api/blm/booking/read with body {"session":"", "bookingId":<tenant-B-booking-id>} — expect 404 (not 403 — avoid confirming the row exists). All TQPro APIs are POST with body per CLAUDE.md.
  4. Cache isolation (Phase 2): populate PricelistCache for Tenant A, switch context to Tenant B, verify cache miss.
  5. Deprovisioning (Phase 8): deprovision Tenant A, verify Tenant A's JWTs fail validation, Tenant A's DB is preserved.

Automate these as integration tests under tqapi/src/test/java/.../multitenancy/ with a test fixture that provisions two tenants against a disposable Keycloak + Postgres.

Ownership: The fixture is a Phase 1 deliverable (P1.5 — "Provision second tenant (proof of concept)"). Add the class tqapi/src/test/java/com/perun/tlinq/multitenancy/TwoTenantFixture.java then — it Testcontainers-starts Keycloak + Postgres and runs KeycloakRealmProvisioner.provisionTenant twice. Tests 2-5 above extend Phase 2, 3, and 8 respectively and plug into the shared fixture.


18. Rollout Risk Matrix

Risk Phase Mitigation
Missed RequestContext in a code path → cross-tenant leak 0 NTSDBSession.getSession() fail-loud; TenantAssert in Phase 7
Background job without tenant scope 7 TenantScope.run(tenantId, ...) audit; grep all executors
Platform DB unreachable → auth outage 0 In-memory cache + stale-read tolerance (D-6)
Connection pool exhaustion at scale 0 max_connections=500 + revisit at ~30 tenants (D-5)
Keycloak realm sprawl in admin UI 1 Deprovisioned realms disabled (not deleted); periodic audit
CTeamMember → CStaffMember migration inconsistency across tenants Staff-Mgmt (§16) Migration script runs per tenant DB via the Phase 1 ledger runner with per-file checksum verification
TQPRO_ENCRYPTION_KEY loss prereq Written rotation procedure (D-7); offline backup copy in sealed record
Webhook routing to wrong tenant 5 Explicit wa_phone_routing table; all cross-references verified before go-live
Subdomain DNS / nginx mis-config 0, 6 Staging environment with full subdomain setup before production
nginx template renders a broken vhost; production nginx refuses to reload 0, 1 nginx -t gate in tenant-provision.sh before any systemctl reload; rollback script restores previous state on failure
Certbot HTTP-01 challenge fails (DNS not propagated, firewall blocks port 80) 1 DNS pre-flight check in tenant-provision.sh; run certbot ... --dry-run first in staging; HTTP-01 location block already active because nginx was reloaded before certbot runs
Let's Encrypt rate limit (50 certs/registered domain/week) 1 Monitor cert issuance rate; if approaching limit, pause provisioning or request a rate-limit increase from Let's Encrypt
Orphan cert / vhost after failed provisioning 1 tenant-rollback.sh is idempotent; ops can re-run it safely; scripts/audit-nginx-vs-tenants.sh (future hygiene tool) lists vhosts and certs without matching ACTIVE tenant row

19. Closing and Archival

When every phase gate above is green — including the Phase 9 documentation gate:

  • Archive doc/plans/multitenancy.md (move to doc/plans/archive/)
  • Archive doc/plans/management/staff-management-plan.md if its work is also complete
  • Archive all remaining files under doc/plans/whatsapp/ per the decision captured in doc/plans/whatsapp/gaps.md
  • Delete this execution plan (doc/plans/multitenancy-execution.md) — its work is now the product, fully documented under doc/features/multitenancy/, doc/deployment/, doc/operations/, doc/developer/multitenancy.md

What remains in doc/plans/ after archival: only the gap trackers (doc/plans/gaps/*, doc/plans/whatsapp/gaps.md) and any freshly-opened plans for work not yet started.