AWS CDK automation proposal¶

This document proposes how to replace the current "click in the AWS console plus SSH-fan-out shell scripts" provisioning model with a non-clicking, code-defined AWS deployment based on AWS CDK in TypeScript. It is a design document, not a working repository. The architectural premises and the existing five-environment landscape are described in deployment/deployment-guide.md; the static topology of the AWS environments is in architecture/server-arch.md.

Goals and non-goals¶

The goal is non-clicking provisioning. Every long-lived AWS resource — VPCs, subnets, security groups, EC2 instances, EBS volumes, S3 buckets, Route53 records, ACM certificates, IAM roles, Secrets Manager entries — is declared in code, reviewed in pull requests, and applied through cdk deploy from CI. The AWS console is reserved for break-glass operations and read-only investigation.

Equally important is what is out of scope for this proposal. We are not containerising the API server. We are not migrating off self-hosted PostgreSQL to RDS. We are not introducing Kubernetes. We are not building a multi-region or active-active topology. Each of those is a separate, larger initiative tracked elsewhere — the relevant existing documents are linked from the Open questions section at the bottom. The point of staying narrow is that this work is unblocked by everything else: the lift-and-shift can ship in three to four weeks, after which the heavier modernisation conversations can happen with infrastructure-as-code already in place to stand on.

The premises that shape every decision below come directly from the operational reality of TQPro:

PostgreSQL stays on EC2. A dedicated instance with gp3 EBS, daily snapshots through Data Lifecycle Manager, the pgaudit extension, and the same multi-tenant database-per-tenant pattern that runs in the lab. RDS would change the backup story, the connection-pool tuning, and the Tailscale developer-access model — none of those changes are wanted in this iteration.
The API server stays as a jsvc daemon on EC2. No containers. UserData calls the existing config/tlinq-service.sh once secrets and configuration are in place, and from that point on the runtime behaves identically to lab.
The existing tenant-provisioning scripts keep working unchanged. scripts/tenant-provision.sh, scripts/apply-tenant-migrations.sh, and the Python vhost renderer all run on a small orchestration EC2 instance reachable only via SSM Session Manager. The CDK app builds the platform on which they run; it does not replace them.
No SSH from the public internet. All operator access is SSM Session Manager. The bastion role of the gateway instance is preserved only for in-VPC SSH between hosts, with the keys under instance profile control rather than personal accounts.

Why CDK and not CloudFormation or Terraform¶

CDK is the right tool for this codebase for three reasons. It is AWS-native: there is no third-party state file to manage and no parallel cloud target to support, so Terraform's portability is wasted weight. It compiles to CloudFormation under the hood, which keeps the deployment model auditable through the AWS-side tools the team already uses. And it gives us real TypeScript — loops, conditionals, classes — which matters when the same network and IAM scaffolding has to be instantiated for two distinct environments (tqpro-dev, tqpro-prod) with surgical differences. The same shape in raw CloudFormation YAML would either be two near-duplicate templates or a thicket of conditions and parameter overrides. Terraform's count/for_each would also work, but the operational overhead of running an S3+DynamoDB state backend is not worth taking on for an AWS-only deployment with a small number of stacks.

The flip side: the CI runner needs Node.js and a working aws cdk CLI, and developers have to learn enough TypeScript to read the stack code. Both are cheap.

Target topology¶

Each environment gets one VPC across two availability zones. Within the VPC there are three subnet groups: public (gateway and NAT egress), private app (API, web, Keycloak, orchestration), and private data (PostgreSQL). Security groups mirror the existing perun-sg-* naming so on-call operators recognise them; the CDK constructs that produce them are named identically (webSg, apiSg, dbSg, authSg, gwSg, orchSg).

The instance roles are:

Gateway tier — one instance per AZ, fronted by a Network Load Balancer with an Elastic IP per AZ, running nginx and certbot. The NLB exists so the public-facing IPs are stable across instance replacements; the gateway nginx still terminates TLS and reverse-proxies to the app tier. The gateway also remains the in-VPC bastion for emergency SSH between hosts (operator access is SSM Session Manager; the bastion role survives only as a same-VPC convenience for the API host to reach the gateway when running certbot challenges).
App tier — one Auto Scaling Group with min=desired=max=1 initially, sized so the same code path scales horizontally later. UserData installs Java 17, jsvc, and nginx, fetches the configuration bundle from Secrets Manager, lays out /var/tqpro/conf/, and starts the API and web nginx. The web tier is collocated on the same instance, matching the production model where /var/www/rel-NNN/ and /var/tqpro/api-NNN/ live side by side.
Data tier — one dedicated EC2 with gp3 EBS, pgaudit installed via UserData, and a Data Lifecycle Manager policy that snapshots the data volume daily and retains snapshots for 14 days (dev) or 35 days (prod). PostgreSQL listens on 5432 and is reachable only from apiSg and authSg.
Identity tier — one dedicated EC2 running Keycloak as a systemd service with a database backend on the data tier instance. Keycloak listens on 8080 internal-only; the gateway publishes auth.<env>.perunapps.com and reverse-proxies to it.
Orchestration tier — one small (t3.small) EC2 in the private app subnet with no inbound network access whatsoever. It holds the working copy of the TQPro repo, runs scripts/tenant-provision.sh on demand via SSM RunCommand, and pulls the platform-admin Keycloak token from Secrets Manager when it needs one. It is the AWS equivalent of the lab's lab-cid host.

Networking is minimal but deliberate. A single NAT gateway gives private subnets outbound internet. VPC endpoints for SSM, S3, Secrets Manager, and EC2 Messages let the orchestration host and the app tier talk to AWS APIs without traversing the NAT, both for cost and for not depending on the NAT path during a Secrets Manager read at boot. Route53 hosted zones own the public DNS, and ACM certificates cover anything fronted by the NLB or an ALB; certbot continues to issue per-tenant certificates because the NLB is L4 and does not terminate TLS.

Storage is split by purpose. One S3 bucket holds build artefacts (tqapi.jar, the lib/ bundle, the static frontend zips); another holds visa documents and inherits the SSE-KMS configuration documented in features/visa/secure-document-storage.md; a third holds CloudWatch log exports and EBS snapshot exports for off-site retention. The CDN is CloudFront, fronting the build-artefact bucket for the static frontends, matching the existing content.cdn-prefix=https://d117ew1ll41gas.cloudfront.net/ pattern.

Secrets are AWS Secrets Manager, fronted by a customer-managed KMS key per environment. Every credential that today lives in tourlinq.properties, properties.d/messaging.properties, properties.d/erp-booking.properties, odoo-server.properties, and the per-plugin XML files becomes one Secrets Manager entry. UserData for the app and identity instances pulls the bundle at boot, writes it into /var/tqpro/conf/ with 0640 ownership tqpro-svc:tqpro-svc, and starts tlinq-service.sh. No application code change is required on day one — the loader produces the same files the API already reads.

Mapping current components to AWS resources¶

Current component	AWS resource	CDK construct
`pgdb01.dev`, `db01.prod` (PostgreSQL on bare metal)	EC2 `t3.large` + `gp3` EBS	`ec2.Instance`, `ec2.Volume`, `dlm.LifecyclePolicy`
`auth01.dev`, `auth01.prod` (Keycloak on bare metal)	EC2 `t3.medium`	`ec2.Instance`
`app01.dev`, `web01.prod` (API + web collocated)	EC2 ASG (initially 1 instance)	`autoscaling.AutoScalingGroup`, `ec2.LaunchTemplate`
`gw01.dev`, `gw02.prod` (gateway, NAT, bastion)	EC2 ASG behind NLB, Elastic IP per AZ	`elbv2.NetworkLoadBalancer`, `autoscaling.AutoScalingGroup`, `ec2.CfnEIP`
`lab-cid` orchestration host	Small EC2, no inbound	`ec2.Instance`
`tourlinq.properties` credentials	Secrets Manager entry	`secretsmanager.Secret`
`tqweb-adm/pub/b2b` static assets	CloudFront → S3	`s3.Bucket`, `cloudfront.Distribution`, `s3deploy.BucketDeployment`
`tq-visa-documents` S3 bucket	S3 + SSE-KMS	`s3.Bucket`, `kms.Key`
Per-tenant DNS + cert (currently certbot-on-gateway)	Route53 records + certbot still on gateway	`route53.ARecord` (only for environment-level hostnames; tenant subdomains stay certbot-managed for now)
TeamCity build agent	(Out of scope — runs outside the CDK environment)	—
Public IPs (gateway)	One Elastic IP per AZ	`ec2.CfnEIP`
Outbound internet for private subnets	Single NAT gateway	`ec2.Vpc.NatGateways: 1`
AWS API access from private subnets	VPC endpoints for SSM, S3, Secrets Manager, EC2 Messages	`ec2.InterfaceVpcEndpoint`, `ec2.GatewayVpcEndpoint`
Operator shell access	SSM Session Manager	IAM policies on instance roles + SSM agent in the AMI

Project layout¶

The CDK app lives under infra/cdk/ at the repo root, kept separate from the Java code so build and deploy concerns never tangle:

infra/cdk/
├── bin/
│   └── tqpro.ts              # App entry point. One stack instance per environment.
├── lib/
│   ├── network-stack.ts      # VPC, subnets, security groups, NAT, VPC endpoints, Route53.
│   ├── data-stack.ts         # PostgreSQL EC2 + EBS + DLM, Keycloak EC2.
│   ├── app-stack.ts          # Gateway NLB + ASG, app ASG, orchestration EC2, IAM roles.
│   ├── secrets-stack.ts      # KMS key, Secrets Manager entries with per-instance access.
│   └── storage-stack.ts      # S3 buckets, CloudFront distribution.
├── userdata/
│   ├── postgres.sh           # Install Postgres 16 + pgaudit, restore from snapshot if any.
│   ├── keycloak.sh           # Install Keycloak 26.x as systemd service.
│   ├── app.sh                # Install Java 17 + jsvc + nginx; pull config from Secrets Manager.
│   ├── gateway.sh            # Install nginx + certbot; pull tenant configs from S3.
│   └── orchestration.sh      # Clone repo, install psql/jq/python, set up tqpro-deploy keys.
├── cdk.json
├── package.json
└── tsconfig.json

Cross-stack references go through stack outputs and Fn::ImportValue rather than hardcoded ARNs, so a stack can be redeployed without breaking its consumers. The split into five stacks rather than one monolithic stack matters because CloudFormation deploys a stack as a unit: keeping the network and data stacks small and stable means most code changes only redeploy the app and storage stacks.

Sample CDK snippets¶

The TypeScript below is illustrative: it shows the shape of each construct and the wiring between them, but it omits parameter validation, error handling, log retention configuration, and a number of production niceties. Treat it as a sketch, not a copy-paste-ready stack.

Network stack — VPC, subnets, security groups¶

// infra/cdk/lib/network-stack.ts
import { Stack, StackProps } from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { Vpc, SubnetType, SecurityGroup, Peer, Port,
         InterfaceVpcEndpointAwsService, GatewayVpcEndpointAwsService } from 'aws-cdk-lib/aws-ec2';

export interface NetworkStackProps extends StackProps {
    envName: 'dev' | 'prod';
    cidr: string;            // e.g. '10.20.0.0/16'
}

export class NetworkStack extends Stack {
    public readonly vpc: Vpc;
    public readonly gwSg: SecurityGroup;
    public readonly appSg: SecurityGroup;
    public readonly authSg: SecurityGroup;
    public readonly dbSg: SecurityGroup;
    public readonly orchSg: SecurityGroup;

    constructor(scope: Construct, id: string, props: NetworkStackProps) {
        super(scope, id, props);

        // Two AZs, three subnet groups. Single NAT gateway is fine for dev/prod
        // single-tenant; revisit once tenant count justifies HA NAT.
        this.vpc = new Vpc(this, 'Vpc', {
            ipAddresses: { cidrBlock: props.cidr } as any,
            maxAzs: 2,
            natGateways: 1,
            subnetConfiguration: [
                { name: 'public',  subnetType: SubnetType.PUBLIC,           cidrMask: 24 },
                { name: 'app',     subnetType: SubnetType.PRIVATE_WITH_EGRESS, cidrMask: 24 },
                { name: 'data',    subnetType: SubnetType.PRIVATE_ISOLATED, cidrMask: 24 },
            ],
        });

        // Security groups mirror perun-sg-* naming.
        this.gwSg   = new SecurityGroup(this, 'GwSg',   { vpc: this.vpc, description: 'gateway / nginx + certbot' });
        this.appSg  = new SecurityGroup(this, 'AppSg',  { vpc: this.vpc, description: 'tqpro api + web' });
        this.authSg = new SecurityGroup(this, 'AuthSg', { vpc: this.vpc, description: 'keycloak' });
        this.dbSg   = new SecurityGroup(this, 'DbSg',   { vpc: this.vpc, description: 'postgres' });
        this.orchSg = new SecurityGroup(this, 'OrchSg', { vpc: this.vpc, description: 'tenant provisioning host' });

        // Public ingress only on the gateway, only on 443 (and 80 for ACME challenges).
        this.gwSg.addIngressRule(Peer.anyIpv4(), Port.tcp(443));
        this.gwSg.addIngressRule(Peer.anyIpv4(), Port.tcp(80));

        // Gateway → app:11180 (api) and 80 (web).
        this.appSg.addIngressRule(this.gwSg, Port.tcp(11180));
        this.appSg.addIngressRule(this.gwSg, Port.tcp(80));

        // Gateway → keycloak:8080 (Keycloak is reverse-proxied through the gateway).
        this.authSg.addIngressRule(this.gwSg, Port.tcp(8080));

        // App + Keycloak → postgres:5432.
        this.dbSg.addIngressRule(this.appSg,  Port.tcp(5432));
        this.dbSg.addIngressRule(this.authSg, Port.tcp(5432));

        // Orchestration → gateway:22 + every web/app:22 for tenant-provision.sh fan-out.
        // No inbound rules on orchSg itself: SSM Session Manager is the only access path.
        this.gwSg.addIngressRule(this.orchSg, Port.tcp(22));
        this.appSg.addIngressRule(this.orchSg, Port.tcp(22));

        // VPC endpoints so private subnets reach AWS APIs without going through NAT.
        this.vpc.addGatewayEndpoint('S3',  { service: GatewayVpcEndpointAwsService.S3 });
        for (const svc of [
            InterfaceVpcEndpointAwsService.SSM,
            InterfaceVpcEndpointAwsService.SSM_MESSAGES,
            InterfaceVpcEndpointAwsService.EC2_MESSAGES,
            InterfaceVpcEndpointAwsService.SECRETS_MANAGER,
        ]) {
            this.vpc.addInterfaceEndpoint(svc.shortName, { service: svc });
        }
    }
}

Data stack — PostgreSQL EC2 with daily snapshots¶

// infra/cdk/lib/data-stack.ts (PostgreSQL only; Keycloak follows the same shape)
import { Stack, StackProps, Duration, Tags } from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { Instance, InstanceClass, InstanceSize, InstanceType,
         MachineImage, BlockDeviceVolume, EbsDeviceVolumeType,
         SecurityGroup, SubnetType } from 'aws-cdk-lib/aws-ec2';
import { Role, ServicePrincipal, ManagedPolicy } from 'aws-cdk-lib/aws-iam';
import { CfnLifecyclePolicy } from 'aws-cdk-lib/aws-dlm';
import { readFileSync } from 'fs';
import { NetworkStack } from './network-stack';

export interface DataStackProps extends StackProps {
    envName: 'dev' | 'prod';
    network: NetworkStack;
    instanceSize: InstanceSize;
    dataVolumeGiB: number;
    snapshotRetentionDays: number;   // 14 for dev, 35 for prod
}

export class DataStack extends Stack {
    public readonly postgres: Instance;

    constructor(scope: Construct, id: string, props: DataStackProps) {
        super(scope, id, props);

        const role = new Role(this, 'PgRole', {
            assumedBy: new ServicePrincipal('ec2.amazonaws.com'),
            managedPolicies: [ManagedPolicy.fromAwsManagedPolicyName('AmazonSSMManagedInstanceCore')],
        });

        this.postgres = new Instance(this, 'PostgresInstance', {
            vpc: props.network.vpc,
            vpcSubnets: { subnetType: SubnetType.PRIVATE_ISOLATED },
            instanceType: InstanceType.of(InstanceClass.T3, props.instanceSize),
            machineImage: MachineImage.latestAmazonLinux2023(),
            securityGroup: props.network.dbSg,
            role,
            blockDevices: [{
                deviceName: '/dev/xvdf',
                volume: BlockDeviceVolume.ebs(props.dataVolumeGiB, {
                    volumeType: EbsDeviceVolumeType.GP3,
                    encrypted: true,
                    deleteOnTermination: false,
                }),
            }],
            userData: undefined,   // set below so we can read the script from disk
        });
        this.postgres.addUserData(readFileSync('userdata/postgres.sh', 'utf8'));

        // Tag the data volume so the DLM policy below picks it up.
        Tags.of(this.postgres).add('tqpro:role', 'postgres');
        Tags.of(this.postgres).add('tqpro:env', props.envName);

        // Daily snapshots. The role this references should be the AWS-managed
        // 'AWSDataLifecycleManagerDefaultRole'; create or import as appropriate.
        new CfnLifecyclePolicy(this, 'DailySnapshots', {
            description: `tqpro-${props.envName} postgres data daily snapshots`,
            state: 'ENABLED',
            executionRoleArn: `arn:aws:iam::${this.account}:role/AWSDataLifecycleManagerDefaultRole`,
            policyDetails: {
                resourceTypes: ['VOLUME'],
                targetTags: [{ key: 'tqpro:role', value: 'postgres' }],
                schedules: [{
                    name: 'daily',
                    createRule: { interval: 24, intervalUnit: 'HOURS', times: ['03:00'] },
                    retainRule: { count: props.snapshotRetentionDays },
                }],
            },
        });
    }
}

The matching userdata/postgres.sh installs PostgreSQL 16 from the official PGDG repo, formats and mounts the data volume on /var/lib/postgresql, sets pgaudit in shared_preload_libraries, applies the connection-pool tuning from operations/multitenancy-setup.md §4, and exits. It does not create the platform DB or any tenant DB — that work is run from the orchestration host via scripts/apply-platform-migrations.sh and scripts/bootstrap-template-db.sh, which is what the lab does today and what we want unchanged.

App stack — API/web instance with Secrets Manager wiring¶

// infra/cdk/lib/app-stack.ts (the App ASG; the Gateway and Orchestration follow the same pattern)
import { Stack, StackProps, Duration } from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { AutoScalingGroup, HealthCheck } from 'aws-cdk-lib/aws-autoscaling';
import { Instance, InstanceClass, InstanceSize, InstanceType,
         MachineImage, LaunchTemplate, UserData, SubnetType } from 'aws-cdk-lib/aws-ec2';
import { Role, ServicePrincipal, ManagedPolicy, PolicyStatement } from 'aws-cdk-lib/aws-iam';
import { Secret } from 'aws-cdk-lib/aws-secretsmanager';
import { NetworkStack } from './network-stack';

export interface AppStackProps extends StackProps {
    envName: 'dev' | 'prod';
    network: NetworkStack;
    appConfigSecret: Secret;     // the Secrets Manager bundle from secrets-stack
    artifactBucketName: string;  // S3 bucket the deploy step will write to
}

export class AppStack extends Stack {
    public readonly appAsg: AutoScalingGroup;

    constructor(scope: Construct, id: string, props: AppStackProps) {
        super(scope, id, props);

        const role = new Role(this, 'AppRole', {
            assumedBy: new ServicePrincipal('ec2.amazonaws.com'),
            managedPolicies: [ManagedPolicy.fromAwsManagedPolicyName('AmazonSSMManagedInstanceCore')],
        });

        // Read the configuration bundle at boot.
        props.appConfigSecret.grantRead(role);

        // Pull the latest API + web artefacts from S3.
        role.addToPolicy(new PolicyStatement({
            actions: ['s3:GetObject', 's3:ListBucket'],
            resources: [
                `arn:aws:s3:::${props.artifactBucketName}`,
                `arn:aws:s3:::${props.artifactBucketName}/*`,
            ],
        }));

        const userData = UserData.forLinux();
        userData.addCommands(
            `set -euo pipefail`,
            `# Install Java 17, jsvc, nginx, awscli.`,
            `dnf install -y java-17-amazon-corretto jsvc nginx awscli jq`,
            ``,
            `# Materialise tourlinq.properties and friends from Secrets Manager into /var/tqpro/conf/.`,
            `mkdir -p /var/tqpro/conf/properties.d /var/log/tqpro`,
            `aws secretsmanager get-secret-value \\`,
            `  --secret-id "${props.appConfigSecret.secretArn}" \\`,
            `  --query SecretString --output text \\`,
            `  | jq -r 'to_entries[] | "\\(.key)=\\(.value)"' \\`,
            `  > /var/tqpro/conf/tourlinq.properties`,
            `chmod 0640 /var/tqpro/conf/tourlinq.properties`,
            `chown tqpro-svc:tqpro-svc /var/tqpro/conf/tourlinq.properties`,
            ``,
            `# Pull the latest API artefacts. The artefact uploader (CI) writes to`,
            `# s3://${props.artifactBucketName}/api/current/ as a fixed path.`,
            `mkdir -p /var/tqpro/api-current /var/tqpro/lib`,
            `aws s3 sync s3://${props.artifactBucketName}/api/current/ /var/tqpro/api-current/`,
            `aws s3 sync s3://${props.artifactBucketName}/lib/  /var/tqpro/lib/`,
            `ln -snf /var/tqpro/api-current /var/tqpro/api`,
            ``,
            `# Start the daemon. tlinq-service.sh reads TLINQ_HOME from its own dir.`,
            `cd /var/tqpro/api && bash ./tlinq-service.sh start`,
            ``,
            `# Web tier: nginx with the same vhost templates the gateway uses.`,
            `aws s3 sync s3://${props.artifactBucketName}/web/current/ /var/www/rel-current/`,
            `ln -snf /var/www/rel-current /var/www/html-adm`,
            `systemctl enable --now nginx`,
        );

        const launchTemplate = new LaunchTemplate(this, 'AppLaunchTemplate', {
            instanceType: InstanceType.of(InstanceClass.T3, InstanceSize.MEDIUM),
            machineImage: MachineImage.latestAmazonLinux2023(),
            securityGroup: props.network.appSg,
            role,
            userData,
        });

        this.appAsg = new AutoScalingGroup(this, 'AppAsg', {
            vpc: props.network.vpc,
            vpcSubnets: { subnetType: SubnetType.PRIVATE_WITH_EGRESS },
            launchTemplate,
            minCapacity: 1,
            desiredCapacity: 1,
            maxCapacity: 1,                    // raise once Hazelcast K8s discovery lands
            healthCheck: HealthCheck.elb({ grace: Duration.minutes(5) }),
        });
    }
}

Secrets stack — one bundle, KMS-encrypted¶

// infra/cdk/lib/secrets-stack.ts
import { Stack, StackProps } from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { Key } from 'aws-cdk-lib/aws-kms';
import { Secret } from 'aws-cdk-lib/aws-secretsmanager';

export class SecretsStack extends Stack {
    public readonly key: Key;
    public readonly appConfig: Secret;

    constructor(scope: Construct, id: string, props: StackProps & { envName: 'dev' | 'prod' }) {
        super(scope, id, props);

        this.key = new Key(this, 'TqproKey', {
            description: `tqpro-${props.envName} envelope key for Secrets Manager`,
            enableKeyRotation: true,
        });

        // Bundle every credential that lives in tourlinq.properties + properties.d/*
        // into a single JSON secret. The app-stack UserData expands it back into
        // /var/tqpro/conf/tourlinq.properties at boot.
        this.appConfig = new Secret(this, 'AppConfig', {
            secretName: `tqpro/${props.envName}/app-config`,
            description: 'tourlinq.properties + properties.d bundled as JSON',
            encryptionKey: this.key,
            // No initial value: populated by the migration script described
            // in the "Secrets migration plan" section below. CDK only owns the shape.
        });
    }
}

Orchestration EC2 with SSM-only access¶

// excerpt from app-stack.ts — the orchestration host
const orchRole = new Role(this, 'OrchRole', {
    assumedBy: new ServicePrincipal('ec2.amazonaws.com'),
    managedPolicies: [ManagedPolicy.fromAwsManagedPolicyName('AmazonSSMManagedInstanceCore')],
});

// Read the platform-admin token from Secrets Manager.
props.platformAdminTokenSecret.grantRead(orchRole);

// Allow the SSH key for tqpro-deploy to be pulled from Secrets Manager.
props.tqproDeployKeySecret.grantRead(orchRole);

// Allow tenant-provision.sh to write nginx vhosts on the gateway via SSM RunCommand.
orchRole.addToPolicy(new PolicyStatement({
    actions: ['ssm:SendCommand', 'ssm:GetCommandInvocation'],
    resources: ['*'],          // tighten to the gateway/web instance IDs in production
}));

const orchUserData = UserData.forLinux();
orchUserData.addCommands(
    `dnf install -y git python3 postgresql16 jq`,
    `git clone --depth=1 https://github.com/perunapps/tqpro.git /opt/tqpro`,
    `mkdir -p /etc/tqpro && chmod 0750 /etc/tqpro`,
    `# Pull the deploy key from Secrets Manager into /etc/tqpro/ssh/tqpro-deploy.`,
    `aws secretsmanager get-secret-value --secret-id ${props.tqproDeployKeySecret.secretArn} \\`,
    `  --query SecretString --output text > /etc/tqpro/ssh/tqpro-deploy`,
    `chmod 0600 /etc/tqpro/ssh/tqpro-deploy`,
);

new Instance(this, 'OrchHost', {
    vpc: props.network.vpc,
    vpcSubnets: { subnetType: SubnetType.PRIVATE_WITH_EGRESS },
    instanceType: InstanceType.of(InstanceClass.T3, InstanceSize.SMALL),
    machineImage: MachineImage.latestAmazonLinux2023(),
    securityGroup: props.network.orchSg,
    role: orchRole,
    userData: orchUserData,
});

Gateway with NLB + per-AZ Elastic IP¶

// excerpt from app-stack.ts — the gateway tier
import { CfnEIP, CfnEIPAssociation } from 'aws-cdk-lib/aws-ec2';
import { NetworkLoadBalancer, NetworkTargetGroup, Protocol } from 'aws-cdk-lib/aws-elasticloadbalancingv2';

const gwAsg = new AutoScalingGroup(this, 'GwAsg', { /* ...same pattern as AppAsg... */ });

// Two Elastic IPs, one per AZ, so DNS A records are stable even when a
// gateway instance is replaced. NLB picks them up via static IP allocation.
const eips = [0, 1].map(i => new CfnEIP(this, `GwEip${i}`));

const nlb = new NetworkLoadBalancer(this, 'GwNlb', {
    vpc: props.network.vpc,
    internetFacing: true,
    crossZoneEnabled: true,
});

const tg443 = new NetworkTargetGroup(this, 'GwTg443', {
    vpc: props.network.vpc,
    port: 443,
    protocol: Protocol.TCP,
    healthCheck: { protocol: Protocol.TCP },
});
gwAsg.attachToNetworkTargetGroup(tg443);

nlb.addListener('Listener443', {
    port: 443,
    protocol: Protocol.TCP,
    defaultTargetGroups: [tg443],
});
// (Listener for 80/TCP follows the same shape — used for ACME HTTP-01 challenges.)

The gateway is exposed via the NLB rather than a load-balanced Auto Scaling Group with public IPs because the floating Elastic IPs need to remain stable across instance replacements: per-tenant DNS records in Route53 (or in the customer's own DNS) are pinned to those IPs, and certbot HTTP-01 challenges depend on inbound 80/TCP being routable to whichever gateway is currently active.

Bootstrap and day-2 operations¶

The first deploy of an environment is a one-time sequence: run cdk bootstrap aws://<account>/<region> once per account-region; create the deploy IAM role used by the CI runner; deploy the network stack, then the secrets stack, then the data stack, then the storage stack, then the app stack. Tenant-side data (the platform DB, the template DB, every tenant DB) is then created from the orchestration host using the existing scripts — scripts/apply-platform-migrations.sh, scripts/bootstrap-template-db.sh, and scripts/tenant-provision.sh. The CDK app deliberately does not own this layer: schema changes ship more often than infrastructure changes, and decoupling them means a migration deploy never needs cdk deploy.

Day-to-day, three workflows run:

Infrastructure changes flow through a CI job triggered on commits to infra/cdk/**. The job runs cdk diff on every push and posts the diff to the pull request; merging triggers cdk deploy <stack> for the dev environment automatically. The prod environment is gated behind a manual approval step in CI.
Application deploys push the latest tqapi.jar, the dependency JARs, and the static frontends to the artefact S3 bucket. A trailing SSM RunCommand on the app instance runs aws s3 sync into a new /var/tqpro/api-NNN/ slot and atomically swaps the api/ symlink, then bash ./tlinq-service.sh restart. This is the same atomic-symlink pattern the lab uses, just with S3 as the artefact transport.
Tenant operations (provision, rollback, suspend, reactivate) are SSM RunCommand invocations against the orchestration host. The script invocations are exactly what the runbooks call out today — scripts/tenant-provision.sh acme-travel "Acme Travel" admin@acmetravel.example. A small wrapper (a shell function or, optionally, a CLI tool) hides the SSM call so an operator types one line and gets a streaming log.

Break-glass is SSM Session Manager. There are no public SSH keys on any instance, no bastion that operators log into, no shared admin password. The IAM role to run a session is gated by SSO group membership; every command is logged to CloudWatch and to the SSM session log bucket. If something has to be done on the AWS console — for example, attaching a snapshot from prod to dev for a forensic investigation — that is the legitimate use of the console, and the action should be reflected back into CDK code afterwards so subsequent deploys do not undo it.

Secrets migration plan¶

This is the one piece of the proposal that requires a one-time custodial step: every credential that lives in config/tourlinq.properties and config/properties.d/*.properties and the per-plugin XML configs has to land in Secrets Manager before the first CDK deploy succeeds, because the app instance UserData refuses to start the API without the bundle.

The path is mechanical. A small one-shot script (infra/cdk/scripts/seed-app-config.ts) reads the current property files from a trusted ops host, scrubs the values that have been replaced by ##placeholder markers in git, prompts the operator for any genuinely missing values, and writes the resulting JSON object into the Secrets Manager entry the CDK app provisioned. The script runs once per environment when the secrets stack is first deployed, and again any time a credential rotates. After the first run, the loader UserData on every app/auth boot reads from Secrets Manager and the property file on disk is regenerated each time.

What this does not require is any application code change. The API still reads tourlinq.properties from ${TLINQ_HOME}/. The fact that the file was generated by aws secretsmanager get-secret-value | jq two seconds before the JVM started is invisible to the application. Once the migration is in place and stable, a future iteration can replace the file-on-disk indirection with a JVM-side Secrets Manager fetch, but that is not blocking and it is not part of this proposal.

A subtlety worth flagging: the encryption key TQPRO_ENCRYPTION_KEY is already an environment variable read from /etc/tqpro/tqpro.env per operations/multitenancy-setup.md §5. The CDK app moves that variable into Secrets Manager and arranges for systemd's EnvironmentFile= to source it at boot. The application sees no difference.

CI/CD outline¶

The current .teamcity/ Kotlin DSL exists but is not running. Two pragmatic paths forward:

GitHub Actions is the lighter-weight path if the team is already on GitHub. One workflow on push to master runs ./gradlew build :tqapi:copyDependencies, packages the artefacts, uploads them to the dev S3 bucket, and triggers an SSM RunCommand to swap the symlink. A second workflow on infra/cdk/** runs cdk diff on PR and cdk deploy on merge. A third manual-trigger workflow promotes a known-good artefact set from the dev bucket to the prod bucket and runs the same SSM swap on prod, gated by a branch-protection environment approval.
Resurrect TeamCity is the path if the existing DSL has value. The shape is the same: build job → S3 upload → SSM RunCommand → restart. The difference is that the orchestration host stays the deploy origin for AWS, and the build agent runs externally; this matches today's AWS-prod arrangement where the build agent on auth01.prod connects out to the externally-hosted TeamCity server.

Either way, the CI runner needs three pieces: an OIDC trust relationship into the AWS account so it can assume a deploy role without long-lived keys, write access to the artefact S3 bucket, and ssm:SendCommand permission scoped to the relevant instance IDs. The current TeamCity scripts that cp files into /opt/tqpro/deploy/<module>/ are replaced by the S3 upload; the SSH-and-symlink step survives, just as an SSM RunCommand instead of a direct SSH.

Phased rollout and rough effort¶

A safe rollout has three phases.

Phase 1 (1–2 weeks) — network and data foundation. Stand up a brand-new tqpro-dev AWS account (or a separate region of the existing one) with the network, secrets, and data stacks. Bootstrap the platform DB and the template DB on the new PostgreSQL EC2 using the existing scripts. Verify the orchestration host can reach Keycloak and the database. No production-ish workload runs yet. The output of this phase is a CDK app that produces an empty-but-correct AWS environment, plus a seed-app-config.ts script that has populated Secrets Manager.

Phase 2 (1 week) — application cutover. Deploy the app stack and the storage stack. Sync the current dev artefacts to the new S3 bucket. Run the existing tenant-provision script against the new orchestration host to onboard a single test tenant. Validate end-to-end against the new gateway IPs (using a synthetic tenant subdomain that points at the new NLB). Cut DNS over for the existing dev tenant once smoke tests pass; keep the old app01.dev host running cold for a week as fallback.

Phase 3 (1 week) — CI/CD and prod. Wire the GitHub Actions (or TeamCity) workflows that automate the artefact upload and the SSM-driven restart. Repeat the Phase 1 + Phase 2 sequence against a tqpro-prod stack instance, behind manual approval gates on every cdk deploy and every artefact promote. Cut prod DNS over during a low-traffic window. Decommission the manually-provisioned hosts only after the environment has been stable for two weeks.

Total: three to four calendar weeks for a single experienced AWS engineer, plus one ops engineer reviewing each cutover. The effort estimate assumes the existing tenant-provisioning scripts are not modified — every change to that surface area is a separate item in the project plan.

Open questions¶

This proposal deliberately does not decide several things. Each one is tractable on its own once the lift-and-shift is in place, and each has its own existing planning document.

Containerisation timeline. The Hazelcast hardcoded-IP fix in operations/scalability/hazelcast-kubernetes.md is the unblocking task; the EKS path proper is in operations/scalability/kubernetes-deployment.md. The CDK app would gain an EKS stack and the API ASG would shrink to zero — neither is in scope here.

RDS migration. Self-hosted Postgres on EC2 is the right starting point because it preserves the existing operational model and the multi-tenant database-per-tenant pattern without surprises. RDS, RDS Proxy, or Aurora become attractive once tenant counts grow past the single-instance comfort zone, but the trade-offs (parameter groups, backup retention, IAM auth, version upgrade timing) belong in their own evaluation.

Multi-region / DR. Out of scope. The current backup story is daily EBS snapshots; cross-region snapshot replication and an active-passive replica are the next-natural step but not blocking for the move off console clicking.

Observability. The full Prometheus + Grafana stack is designed in operations/observability/requirements.md and operations/observability/prometheus-grafana-setup.md, and the per-plugin work for tqamds is in operations/observability/tqamds-observability.md. None of it is wired into the build today; the CDK app would later gain a monitoring-stack.ts for the Prometheus and Grafana EC2s (or a Managed Prometheus / Managed Grafana stack) once the application-side instrumentation lands.

GitHub Actions vs TeamCity. Both work; the choice is a team preference and a billing question. The proposal stays neutral and assumes whichever pipeline runs gets the same OIDC role and the same SSM permissions.