TQPro deployment guidelines¶

This guide is the entry point for anyone who needs to understand how TQPro is deployed across its various environments. It is a map, not a runbook: each section describes what a deployment of a given shape looks like and why, and points at the detailed procedural documents that ops engineers follow when actually standing one up. For the static architectural picture (port assignments, directory layouts, network topology), the companion document is architecture/server-arch.md. For the AWS-specific automation proposal that replaces today's console clicking, see deployment/aws-cdk-automation.md.

TQPro currently runs in five distinct deployment shapes:

Local development — single workstation, optional Keycloak, dev-mode auth bypass.
Lab development & testing — multi-host VM landscape, internal-only DNS, single-tenant.
Lab multi-tenancy — same lab, dedicated platform DB and template DB, tenant-per-realm/per-DB onboarding.
AWS dev/test — public, internet-exposed pre-production with all integrations live, instances provisioned today by hand.
AWS production — live customer traffic, separate Odoo instance, network-isolated build agent.

The five shapes share a small set of building blocks; they only diverge in topology, security posture, and degree of automation. The next section walks through those shared blocks once, so the per-environment sections can stay short.

Common building blocks¶

Build artefacts¶

TQPro is a multi-module Gradle 8.10 project (Kotlin DSL). The API server is produced by the tqapi module as tqapi.jar; the runtime classpath also picks up every JAR in lib/ and libext/, so ./gradlew copyDependencies is what makes a deployable bundle. Tests are disabled by default and are only run when explicitly requested. The three frontends — tqweb-adm, tqweb-pub, and tqweb-b2b — are static HTML + ES6 + Bootstrap; they have no build step. A deploy is a copy of files plus a symlink swap. There is no Docker image for the API server today; this is a known gap (see Open items below).

Configuration root and the `TLINQ_HOME` symlink pattern¶

Every TQPro process reads its configuration from the directory TLINQ_HOME points at. On non-dev hosts that root is /var/tqpro/, organised so that environment-specific files (credentials, SSL keystores, prompt files, output directories) live under conf/ while each numbered build sits next to it under api-NNN/. The api/ symlink targets the latest build, and within each build directory a long list of symlinks pull configuration files back from conf/. The same pattern applies to web releases under /opt/web/release-NNN and /var/www/rel-NNN. The full directory layout is documented in architecture/server-arch.md. The point of the pattern is that a deploy never overwrites configuration, and a rollback is a single ln -sf rather than a copy.

What lives where matters more once multi-tenancy enters the picture. tourlinq-config.xml, nts-client.xml, hazelcast.xml, and most plugin definitions are shared across all tenants. tourlinq.properties, properties.d/*.properties, the per-plugin credentials (Amadeus, Tiqets, GoGlobal, Telr, Twilio, Anthropic), and the multi-tenancy keys (platform.db.*, tenant.db.*, template.db.name, upstream.api, upstream.web) are per-deployment. Today most of these still ship in plaintext property files; externalising them into a real secret store is the single largest unfinished piece of the deployment story.

Database layer¶

PostgreSQL 16 is the backbone. Every environment uses the same engine, the same pgaudit extension, and the same migration scripts.

The multi-tenant model is database-per-tenant on a shared instance: one platform database (tqplatform) holds the tenant registry, WhatsApp routing, and the platform-DB migration ledger; one database per tenant (tlinq_<tenant_code>) holds the actual customer data. New tenants are cloned from a tqpro_template database, which is itself maintained by the same migration runner so it never falls out of schema-sync.

Migrations are sequential numbered SQL files in config/db-changes/ (currently 0001–0072). A ledger table tracks which have been applied per database. Two scripts handle the work:

scripts/apply-platform-migrations.sh — applies platform-scoped migrations to tqplatform.
scripts/apply-tenant-migrations.sh — applies tenant-scoped migrations to tqpro_template and to every tlinq_<code> database in turn.

Both runners support --dry-run and are idempotent; reapplying after a failure picks up where the previous run stopped. The seed data needed to bootstrap an empty template (config/tenant-template/tqpro_schema.sql, tlinq-dimension-data.sql, tlinq-reference-data.sql) is committed to git so a fresh environment can be rebuilt deterministically. There is no Flyway, no Liquibase, no Spring migration plugin — deliberately, to keep the runner small and observable.

Identity layer¶

Authentication is Keycloak end-to-end. Every environment runs at least one Keycloak instance and one realm; multi-tenant environments add one realm per tenant.

Three realms are part of the long-term design: tqpro-adm for the admin console, tqpro-pub for the public website (not yet wired), and tqpro-b2b for partner access (also pending). The master realm hosts a single confidential client called tqpro-platform-admin whose service account holds only the create-realm role and is used by the Java platform API to provision tenant realms. Today only tqpro-adm is live in any environment; the other two are placeholders.

The authentication flow is in transition. The legacy mode used oauth2-proxy as a reverse-proxy gate that injected X-User, X-Roles, X-Email headers into requests reaching the API. The new mode is native OIDC, where the SPA holds the access token directly and the API validates JWTs against Keycloak's JWKS. The codebase currently runs in auth-mode=hybrid (both paths work), with a planned move to native-oidc once the legacy code is removed. Step-by-step migration notes are in operations/server-setup/oidc-migration.md. For local-dev convenience there is also dev-mode=true, which bypasses authentication entirely and is documented in developer/getting-started/dev-mode-setup.md. It must never be enabled in any environment that is reachable from anywhere except the operator's loopback.

Edge layer (nginx, three tiers)¶

Every non-dev environment uses the same three-tier nginx pattern:

Internet → Gateway nginx (TLS) → Web nginx (static + /tlinq-api proxy) → API nginx → tqapi.jar

The gateway terminates TLS, owns one server block per public hostname, and round-robins to the upstream pools defined in /etc/nginx/conf.d/tqpro-upstreams.conf. The web tier serves static files for tqweb-adm, tqweb-pub, and tqweb-b2b, and proxies /tlinq-api/* calls to the API tier. The API tier is just the Jetty 12 server inside tqapi.jar, listening on 11080 (lab) or 11180 (AWS) with no TLS of its own — it relies on being reachable only from the web tier within the same security zone.

Certificates come from Let's Encrypt via certbot in internet-facing environments, and from an internal step-ca instance in the closed lab. Both flows go through the same certbot HTTP-01 challenge — only the ACME directory URL changes (certbot.acme.server in tourlinq.properties). The lab step-ca walk-through is in operations/multitenancy-setup.md, Appendix B. Multi-tenant gateway and web vhosts are rendered from templates in config/Nginx Config/templates/ by scripts/render-vhost.py.

Service runtime¶

The API server runs as a JVM daemon under jsvc (Apache Commons Daemon). The wrapper is config/tlinq-service.sh, which exposes start, stop, restart, status, and rotate — the last is what logrotate invokes so logs roll without bouncing the process. jsvc runs as a non-root user, writes its PID file alongside the deploy, and pulls the classpath together from tqapi.jar plus everything under lib/ and libext/. A debug.ind marker file in conf/ switches on JDWP on port 5005 — a deliberate choice over a config-file flag because turning the debugger on or off then needs no restart, just a touch or rm. Setup details and the systemd integration are in operations/server-setup/jsvc-setup.md.

Two HTTP endpoints make the daemon visible to load balancers and orchestrators: /system/health returns 200 once Hazelcast and the database pools have come up, and /system/readiness returns 200 once plugin initialisation is also complete. Both are documented in api/system.md and are the right hooks for ALB/NLB target group health checks in any AWS deployment.

Distributed runtime (Hazelcast)¶

Once more than one API instance runs, Hazelcast handles user sessions, the shopping cart, the API roles cache, and the scheduled-task locks. The current cluster discovery is TCP-IP with hardcoded IPs in hazelcast.xml, which works fine on bare-metal lab and AWS but blocks K8s adoption. Migration to service-DNS discovery is documented in operations/scalability/hazelcast-kubernetes.md. For non-K8s environments, scaling the API tier is a matter of editing hazelcast.xml and rolling instances.

Local development¶

The local-dev shape is the simplest of the five and the only one not pinned to Linux. The required infrastructure is a JDK 17, a PostgreSQL 16 instance, and a local nginx (or, on Windows, a Linux VM running nginx). Authentication is normally bypassed via dev-mode=true because Keycloak adds setup that only matters once you start touching auth-related code. With dev-mode on, the API server treats every request as the synthetic user defined by dev-user-roles, so role-gated endpoints behave realistically without Keycloak being in the loop.

The fastest workflow is to point nginx straight at the source tree of tqweb-adm (so changes hit the browser without a build) and run the API from the IDE or ./gradlew :tqapi:run. The only environment variable that has to be right is TLINQ_HOME, pointing at the config/ directory in the working copy — the API is otherwise self-contained and uses the local database for everything. For the rare case where you do need Keycloak locally — for example, to debug an OIDC integration — the optional Docker setup is in operations/local-dev/linux-workstation.md, and a dedicated Keycloak realm/client/user walkthrough is in operations/local-dev/keycloak-configuration.md. Windows developers using a Linux VM follow operations/local-dev/windows-workstation.md.

There is no deployment to speak of in this shape: every artefact lives in the working copy, every restart is a Gradle command, every config change is a property file edit. The next section is where deployment actually starts.

Lab development & testing¶

The lab is the first environment where TQPro looks like a real distributed system. It runs on a closed network — disconnected from the public internet — across a small fleet of VMs that mirror the AWS production topology so that the bare-metal deploy procedure is the same procedure that runs in the cloud, just with internal hostnames and an internal CA.

The current lab landscape is documented in detail in architecture/server-arch.md, under "Lab development & testing deployment". The short version is: one PostgreSQL host (pgdb2.vanevski.net), one Keycloak host (dev-auth01), one API host (dev-api01), two web hosts (dev-web0[12]), two gateway hosts (dev-gw0[12]) running keepalived for an HA floating IP, and one TeamCity build agent (lab-cid) which doubles as the orchestration host for multi-tenant operations.

Three things distinguish the lab from local-dev. First, real Keycloak: the tqpro-adm realm with the tqweb-adm and tqpro-admin-api clients is configured the same way it is in production; oauth2-proxy is also present so the hybrid auth path can be exercised before AWS sees it. Second, real nginx three-tier: the gateway terminates TLS using a step-ca-issued certificate, the backend nginx serves static files and reverse-proxies the API, and the API tier listens on 11080. The full nginx setup including keepalived is in operations/server-setup/web-server.md, and the closed-network DNS infrastructure (two Raspberry Pis with dnsmasq + keepalived) is in operations/server-setup/dns-setup.md. Third, real jsvc: the API server runs as a daemon, with logs in /var/log/tqpro and the symlink-based deployment layout under /var/tqpro/.

The procedural reference for everything in this section is deployment/bare-metal-deployment.md. It is a 1300-line step-by-step that covers OS prerequisites, package installs, PostgreSQL provisioning, Keycloak install, nginx layout, jsvc + systemd integration, and S3 wiring. Read it when standing up a new lab host. Do not read it when trying to understand how the system is shaped — for that, this guide is the right entry point.

The build agent on lab-cid runs a TeamCity build agent that connects to the AWS-hosted TeamCity server, pulls source, and deploys outwards over SSH using the user-key pattern documented in operations/build-and-ci/teamcity-setup.md. The same host is used as the orchestration server for multi-tenant operations — it is the only place from which scripts/tenant-provision.sh is meant to run, because it is the only host that has both the platform-admin Keycloak token and SSH access to the gateway and web hosts.

Lab multi-tenancy¶

Lab multi-tenancy is a parallel deployment in the same lab landscape, set up specifically to exercise the tenant onboarding path before it is rolled out to AWS. The architecture is the same as single-tenant lab, with three additions: a platform database (tqplatform), a template database (tqpro_template), and a Keycloak master-realm service account (tqpro-platform-admin) that the Java platform API uses to create per-tenant realms. The full picture is in architecture/server-arch.md § Lab multi-tenancy.

Multi-tenant deployment is split into two distinct operations.

Foundation install is run once per environment. It bootstraps the platform DB, the template DB, the Keycloak service account, the encryption key for stored secrets, and the orchestration host's tooling. The runbook is operations/multitenancy-setup.md, organised into eight phases plus two appendices (Appendix A for migrating an existing single-tenant install, Appendix B for the closed-lab step-ca setup). It is the densest deployment document in the repo and pays back the careful read.

Tenant onboarding is run once per tenant. The operator runs scripts/tenant-provision.sh <code> "<name>" <admin_email> [<wa_phone_id>] from the orchestration host, and the script does six things in order: DNS pre-flight, clone of tqpro_template to tlinq_<code>, render and SSH-install of the gateway vhost, certbot HTTP-01 issuance, render and SSH-install of the web-tier vhost, and a POST to /platform/tenant/provision on the local API which creates the Keycloak realm and the tqplatform.tenant row. Any failure past the second step triggers scripts/tenant-rollback.sh automatically. The runbook is operations/tenant-provisioning.md, which also covers verification, manual rollback, suspension/reactivation via the platform admin API, and registry refresh as an escape hatch.

The orchestration model is worth highlighting because it informs the AWS automation proposal. The script lives on a host inside the protected subnet; the gateway host is in the DMZ and intentionally has neither database credentials nor TQPro processes. Everything that has to happen on the gateway happens over SSH, with a dedicated tqpro-deploy user that has NOPASSWD sudo only for the small set of commands the script invokes. Web hosts follow the same pattern, except the deploy user there gets to run nginx -t, systemctl reload nginx, and a couple of file copies. There is no agent, no central config server, no message bus — just SSH, certbot, and psql. This minimalism is what makes the script easy to port to AWS via SSM RunCommand without rewriting the orchestration logic.

AWS dev/test¶

AWS dev/test is a fully internet-exposed pre-production environment where every integration is wired to its real counterpart and every deployment touches real DNS, real certificates, and real third-party APIs. The current shape is collapsed: due to cost pressures, multiple roles share app01.dev.perunapps.com, which runs Odoo, the TQPro API, and the admin website concurrently. PostgreSQL lives on its own instance (pgdb01.dev.perunapps.com), Keycloak on its own (auth01.dev.perunapps.com), and the gateway/NAT lives on gw01.dev.perunapps.com with a separate management hostname mgw01.perunapps.com. The detailed inventory is in architecture/server-arch.md, under "AWS development + test environment".

What carries over from lab is most of the application: the /var/tqpro/ directory layout, the symlinked api/ deploy slot, the jsvc daemon, the three-tier nginx pattern, the migration runners. What changes is mostly above the application: a real VPC with security groups (perun-sg-dev, perun-sg-dev-db, perun-sg-public, perun-sg-auth), the gateway doubling as a NAT instance and bastion host, a separate management DNS for the gateway, and Tailscale on the database host so developers can hit it for migrations and debugging without SSH-tunnelling through the gateway. The API server listens on port 11180 here (not 11080), which is a vestige of the earlier tlinq server that has since been retired.

There is one structural change worth calling out. In the lab, the /tlinq-api/ proxy lives on the web nginx, because the web tier and API tier are on separate hosts and the proxy makes the web host the single point of contact for the SPA. On AWS, the /tlinq-api/ proxy lives on the gateway nginx, because the web and API tiers share an instance and the gateway is already the single point of contact. The web nginx server block on AWS therefore omits the /tlinq-api/ location entirely — see the "Application server configuration" sample in architecture/server-arch.md.

Everything in this environment was provisioned by hand in the AWS console and manually configured with shell scripts. There is no infrastructure-as-code today, no automated DNS provisioning, no automated security group changes, no automated AMI baking. Standing up a second AWS dev environment from scratch — say, to test a major OS upgrade — would mean repeating the entire console + SSH walkthrough. The proposed remediation is documented in deployment/aws-cdk-automation.md: a TypeScript CDK app that codifies the VPC, subnets, security groups, EC2 instances, EBS volumes, S3 buckets, Route53 records, Secrets Manager entries, and the orchestration host that runs the existing tenant-provisioning scripts.

AWS production¶

AWS production runs the live, single-tenant TQPro deployment that powers peruntours.com, bookmyholiday.ae, and the admin.peruntours.com admin console. The shape is similar to AWS dev/test, with three notable differences. First, Odoo runs on its own instance (odoo.prod.perunapps.com) rather than collocated, because the production CRM load demands it. Second, the API and the web tier are still collocated (on web01.prod.perunapps.com) but Keycloak is on its own instance (auth01.prod.perunapps.com), and the production build agent is collocated with Keycloak — deliberately, because the production network must have no path to the testbed network. Third, websites deploy under /var/www/rel-NNN/ rather than /var/tqpro/web-NNN/; the bookmyholiday.ae site lives outside the rel-NNN tree because it is not versioned alongside the TQPro releases. The full inventory and config samples are in architecture/server-arch.md, under "Production environment".

Two known gaps deserve attention before any new operator inherits this environment. Automated PostgreSQL backups are not configured — backups are run manually on a periodic basis, which is fine for the current data volume but not safe long-term. And Keycloak does not have MFA enabled for any of the admin clients, even though the realm definition supports it. Both items are tracked in plans/gaps/gaps-operations.md.

The same automation gap that affects AWS dev/test affects production: every host was provisioned by hand. The CDK proposal in deployment/aws-cdk-automation.md is intended to apply equally to both AWS environments, with the production stack instance behind a manual approval gate so a faulty CDK change cannot reach prod accidentally.

Promoting a change across environments¶

A single change typically traverses four environments before it is live: developer workstation → lab → AWS dev/test → AWS production. The mechanism is build-numbered deployment slots: a successful TeamCity build produces an api-NNN directory on the API host and a matching release-NNN (or rel-NNN) directory on the web host, then re-points the api/ and release/ symlinks. Rollback is a symlink swap. There is no canary mechanism today — a deploy is atomic from the operating system's point of view, even though some clients may briefly hold a connection to the old api-NNN.

Database migrations are decoupled from code deploys. The runner scripts apply pending migrations idempotently, and the application is expected to be backward-compatible with the migration immediately preceding it. In practice this means migrations are applied first, and the code that depends on them is deployed afterwards — the inverse order works only when the new schema is strictly additive. There is no automated coordinator that enforces the order; the pattern relies on the migration files being reviewed alongside the code that consumes them.

Configuration drift between environments is a real risk and the main reason this guide exists. The same tlinqapi.properties and tourlinq.properties exist in five places, and each copy is a snapshot of someone's vi session at the time the host was set up. The migration to AWS Secrets Manager described in the CDK proposal is the planned path out of this; until it is in place, the practical mitigation is to keep the platform-domain-specific values (URLs, ports, hostnames) in property files but to externalise the credentials into a sealed /etc/tqpro/tqpro.env file that systemd loads as EnvironmentFile=. The encryption key for at-rest credential encryption (TQPRO_ENCRYPTION_KEY) already follows this pattern; the rest should.

Open items and known gaps¶

The deployment story has rough edges, all of them tracked. The single most disruptive item is the absence of any infrastructure-as-code: every AWS environment is reproduced by clicking, with predictable consequences for drift, audit, and disaster recovery. The CDK proposal in deployment/aws-cdk-automation.md is the planned remediation. Closely related is the absence of a CI/CD pipeline that builds and deploys without operator intervention: the .teamcity/ DSL exists but is not running, and there are no GitHub Actions or GitLab workflows. Both gaps are documented in plans/gaps/gaps-operations.md §1, §2, §4.

Observability is the other large gap. Prometheus, Grafana, and Micrometer are fully designed in operations/observability/requirements.md and operations/observability/prometheus-grafana-setup.md, but no instrumentation is wired into the build and no scrape targets exist in any environment. The plugin-level metric work for tqamds is also blocked behind the same missing foundations — operations/observability/tqamds-observability.md. Until that lands, deployment health is only visible through /system/health and through reading log files.

Containerisation is on the roadmap but unstarted. The Hazelcast K8s discovery work in operations/scalability/hazelcast-kubernetes.md is the unblocking task; once that is done the steps in operations/scalability/kubernetes-deployment.md become tractable. The CDK proposal is deliberately container-free so progress is not blocked on this work.

The OIDC migration is in progress. The codebase runs in auth-mode=hybrid, the platform realm has the right clients, and oauth2-proxy is still in the loop on most environments. Steps 3–5 (switching to native-oidc, removing oauth2-proxy, deleting the legacy header path) are tracked in operations/server-setup/oidc-migration.md.

Finally, secrets management. Every credential currently lives in plaintext property files that are committed to git. AWS Secrets Manager (or any equivalent) plus a small instance-boot loader that materialises secrets into ${TLINQ_HOME} would close this off without any application change. The CDK proposal includes this as a first-class section.

Index of detailed runbooks¶

When you know what you need to do, this is the table that takes you to the right document.

Task	Runbook
Set up a Linux dev workstation (Docker Compose Keycloak + nginx)	`operations/local-dev/linux-workstation.md`
Set up a Windows dev workstation (Linux VM, systemd Keycloak)	`operations/local-dev/windows-workstation.md`
Configure Keycloak realms / clients / users for local dev	`operations/local-dev/keycloak-configuration.md`
Run the API with auth bypassed	`developer/getting-started/dev-mode-setup.md`
Stand up a bare-metal lab or AWS host end-to-end	`deployment/bare-metal-deployment.md`
Configure the three-tier nginx layout with HA keepalived	`operations/server-setup/web-server.md`
Install Keycloak on Ubuntu 22.04 (oauth2-proxy mode)	`operations/server-setup/authentication-keycloak.md`
Run Keycloak + nginx with Redis-backed oauth2-proxy sessions	`operations/server-setup/keycloak-nginx-setup.md`, `operations/server-setup/redis-oauth2-proxy.md`
Set up internal HA DNS for a closed lab	`operations/server-setup/dns-setup.md`
Wrap the API server as a jsvc daemon with logrotate	`operations/server-setup/jsvc-setup.md`
Migrate from oauth2-proxy to native OIDC	`operations/server-setup/oidc-migration.md`
Bootstrap multi-tenancy on a new environment	`operations/multitenancy-setup.md`
Provision a new tenant	`operations/tenant-provisioning.md`
Configure the AI Outline feature (Anthropic API)	`operations/ai-outline-admin-guide.md`
Import GoGlobal hotel static data	`operations/goglobal-hotel-import.md`
Configure WhatsApp / Twilio for a tenant	`operations/whatsapp-configuration-guide.md`, `operations/wa-service-first-deployment.md`
Inventory of third-party integrations and their credentials	`deployment/integration-inventory.md`
Set up TeamCity to build the multi-module Gradle project	`operations/build-and-ci/teamcity-setup.md`
Path-based VCS triggers and selective component builds	`operations/build-and-ci/build-strategy.md`
Replace AWS console clicking with CDK	`deployment/aws-cdk-automation.md`