Skip to content

TQPro API Observability Implementation Guide

Prometheus + Grafana Monitoring Setup

Status: Implementation Planned Target Module: tqapi Created: 2025-11-23 Objective: Implement comprehensive observability for TQPro API with Prometheus metrics and Grafana dashboards


Table of Contents

  1. Overview
  2. Prerequisites
  3. Phase 1: Infrastructure Setup
  4. Phase 2: Code Implementation
  5. Phase 3: Prometheus Configuration
  6. Phase 4: Grafana Dashboard Setup
  7. Phase 5: Testing & Validation
  8. Metrics Reference
  9. Troubleshooting

Overview

This guide implements observability for the TQPro API module (tqapi) to track:

  • API Execution Times - Request duration, throughput, latency percentiles
  • Authorization Metrics - Successful/rejected authorization attempts, by API and role
  • Error Tracking - Exception rates, error types, error codes by endpoint

Architecture

┌─────────────┐      ┌──────────────┐      ┌─────────────┐
│  TQPro API  │─────>│  Prometheus  │─────>│   Grafana   │
│  (Metrics)  │ HTTP │  (Scraper)   │ HTTP │ (Dashboard) │
└─────────────┘      └──────────────┘      └─────────────┘

Technology Stack

  • Metrics Library: Micrometer (vendor-neutral facade)
  • Metrics Format: Prometheus
  • Storage: Prometheus TSDB
  • Visualization: Grafana
  • Deployment: Docker Compose (dev/test), Kubernetes (production)

Prerequisites

Required Software

  • JDK 11+ (already installed)
  • Docker & Docker Compose (for Prometheus/Grafana)
  • Gradle (already configured in project)

Required Knowledge

  • Basic understanding of Prometheus query language (PromQL)
  • Familiarity with Grafana dashboard creation
  • Understanding of JAX-RS filters and Jakarta servlets

Estimated Time

  • Phase 1 (Infrastructure): 1-2 hours
  • Phase 2 (Code Implementation): 4-6 hours
  • Phase 3 (Prometheus Config): 1 hour
  • Phase 4 (Grafana Dashboards): 2-3 hours
  • Phase 5 (Testing): 1-2 hours

Total: 9-14 hours


Phase 1: Infrastructure Setup

Step 1.1: Add Micrometer Dependencies

File: tqapi/build.gradle.kts

Location: Lines 12-20 (dependencies block)

dependencies {
    testImplementation(platform("org.junit:junit-bom:5.10.0"))
    testImplementation("org.junit.jupiter:junit-jupiter")
    testImplementation("org.mockito:mockito-junit-jupiter:5.12.0")
    testImplementation("org.mockito:mockito-inline:5.2.0")

    implementation(project(":tqapp"))
    implementation(project(":tqcommon"))

    // ADD: Micrometer metrics dependencies
    implementation("io.micrometer:micrometer-core:1.12.0")
    implementation("io.micrometer:micrometer-registry-prometheus:1.12.0")

    // ADD: Prometheus exposition format
    implementation("io.prometheus:simpleclient:0.16.0")
    implementation("io.prometheus:simpleclient_common:0.16.0")
}

Action:

cd tqapi
../gradlew build --refresh-dependencies

Step 1.2: Create Docker Compose for Prometheus & Grafana

File: tqapi/docker-compose-monitoring.yml (NEW FILE)

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: tqpro-prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
      - '--web.enable-lifecycle'
    networks:
      - monitoring
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: tqpro-grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana-data:/var/lib/grafana
      - ./monitoring/grafana/provisioning:/etc/grafana/provisioning
      - ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
    networks:
      - monitoring
    restart: unless-stopped
    depends_on:
      - prometheus

volumes:
  prometheus-data:
  grafana-data:

networks:
  monitoring:
    driver: bridge

Action:

mkdir -p tqapi/monitoring/grafana/{provisioning/datasources,provisioning/dashboards,dashboards}

Step 1.3: Create Prometheus Configuration

File: tqapi/monitoring/prometheus.yml (NEW FILE)

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'tqpro-dev'
    environment: 'development'

scrape_configs:
  - job_name: 'tqpro-api'
    metrics_path: '/tlinq-api/metrics'
    static_configs:
      - targets: ['host.docker.internal:11080']  # For Docker on Mac/Windows
        # - targets: ['172.17.0.1:11080']        # For Docker on Linux
        labels:
          service: 'tqpro-api'
          module: 'tqapi'

Note: Adjust targets based on your Docker host configuration.

Step 1.4: Create Grafana Datasource Provisioning

File: tqapi/monitoring/grafana/provisioning/datasources/prometheus.yml (NEW FILE)

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      timeInterval: "15s"

Step 1.5: Create Grafana Dashboard Provisioning

File: tqapi/monitoring/grafana/provisioning/dashboards/default.yml (NEW FILE)

apiVersion: 1

providers:
  - name: 'TQPro Dashboards'
    orgId: 1
    folder: 'TQPro'
    type: file
    disableDeletion: false
    updateIntervalSeconds: 10
    allowUiUpdates: true
    options:
      path: /var/lib/grafana/dashboards

Phase 2: Code Implementation

Step 2.1: Create Metrics Registry Manager

File: tqapi/src/main/java/com/perun/tlinq/metrics/MetricsManager.java (NEW FILE)

package com.perun.tlinq.metrics;

import io.micrometer.core.instrument.*;
import io.micrometer.prometheus.PrometheusConfig;
import io.micrometer.prometheus.PrometheusMeterRegistry;

import java.util.concurrent.ConcurrentHashMap;
import java.util.logging.Logger;

/**
 * Centralized metrics management for TQPro API.
 * Provides singleton access to Micrometer registry and helper methods for metrics.
 */
public class MetricsManager {

    private static final Logger logger = Logger.getLogger(MetricsManager.class.getName());
    private static MetricsManager instance;
    private final PrometheusMeterRegistry registry;

    // Cache for frequently used metrics
    private final ConcurrentHashMap<String, Counter> counterCache = new ConcurrentHashMap<>();
    private final ConcurrentHashMap<String, Timer> timerCache = new ConcurrentHashMap<>();

    private MetricsManager() {
        this.registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
        configureCommonTags();
        logger.info("MetricsManager initialized with Prometheus registry");
    }

    public static synchronized MetricsManager getInstance() {
        if (instance == null) {
            instance = new MetricsManager();
        }
        return instance;
    }

    private void configureCommonTags() {
        registry.config().commonTags(
            "application", "tqpro",
            "module", "tqapi"
        );
    }

    public PrometheusMeterRegistry getRegistry() {
        return registry;
    }

    /**
     * Get Prometheus-formatted metrics for exposition endpoint.
     */
    public String scrape() {
        return registry.scrape();
    }

    // ============= Counter Helpers =============

    public void incrementCounter(String name, String... tags) {
        String cacheKey = name + String.join("_", tags);
        Counter counter = counterCache.computeIfAbsent(cacheKey,
            k -> Counter.builder(name).tags(tags).register(registry));
        counter.increment();
    }

    public void incrementCounterBy(String name, double amount, String... tags) {
        String cacheKey = name + String.join("_", tags);
        Counter counter = counterCache.computeIfAbsent(cacheKey,
            k -> Counter.builder(name).tags(tags).register(registry));
        counter.increment(amount);
    }

    // ============= Timer Helpers =============

    public Timer.Sample startTimer() {
        return Timer.start(registry);
    }

    public void recordTimer(Timer.Sample sample, String name, String... tags) {
        String cacheKey = name + String.join("_", tags);
        Timer timer = timerCache.computeIfAbsent(cacheKey,
            k -> Timer.builder(name).tags(tags).register(registry));
        sample.stop(timer);
    }

    // ============= Gauge Helpers =============

    public <T> T registerGauge(String name, T obj, java.util.function.ToDoubleFunction<T> valueFunction, String... tags) {
        return Gauge.builder(name, obj, valueFunction)
            .tags(tags)
            .register(registry);
    }
}

Step 2.2: Create Metrics Filter for API Timing

File: tqapi/src/main/java/com/perun/tlinq/metrics/MetricsFilter.java (NEW FILE)

package com.perun.tlinq.metrics;

import io.micrometer.core.instrument.Timer;
import jakarta.ws.rs.container.ContainerRequestContext;
import jakarta.ws.rs.container.ContainerRequestFilter;
import jakarta.ws.rs.container.ContainerResponseContext;
import jakarta.ws.rs.container.ContainerResponseFilter;
import jakarta.ws.rs.ext.Provider;

import java.io.IOException;
import java.util.logging.Logger;

/**
 * JAX-RS filter to capture API request metrics:
 * - Request duration (timing)
 * - Request count
 * - Response status codes
 */
@Provider
public class MetricsFilter implements ContainerRequestFilter, ContainerResponseFilter {

    private static final Logger logger = Logger.getLogger(MetricsFilter.class.getName());
    private static final String TIMER_SAMPLE_PROPERTY = "metrics.timer.sample";
    private static final String REQUEST_PATH_PROPERTY = "metrics.request.path";

    private final MetricsManager metricsManager = MetricsManager.getInstance();

    @Override
    public void filter(ContainerRequestContext requestContext) throws IOException {
        // Start timing the request
        Timer.Sample sample = metricsManager.startTimer();
        requestContext.setProperty(TIMER_SAMPLE_PROPERTY, sample);

        // Store request path for use in response filter
        String path = requestContext.getUriInfo().getPath();
        requestContext.setProperty(REQUEST_PATH_PROPERTY, path);

        // Count incoming requests
        String method = requestContext.getMethod();
        metricsManager.incrementCounter("api_requests_total",
            "endpoint", path,
            "method", method);
    }

    @Override
    public void filter(ContainerRequestContext requestContext,
                      ContainerResponseContext responseContext) throws IOException {

        // Retrieve timer sample
        Timer.Sample sample = (Timer.Sample) requestContext.getProperty(TIMER_SAMPLE_PROPERTY);
        String path = (String) requestContext.getProperty(REQUEST_PATH_PROPERTY);

        if (sample != null && path != null) {
            String method = requestContext.getMethod();
            int status = responseContext.getStatus();
            String statusCategory = getStatusCategory(status);

            // Record request duration
            metricsManager.recordTimer(sample, "api_request_duration_seconds",
                "endpoint", path,
                "method", method,
                "status", String.valueOf(status),
                "status_category", statusCategory);

            // Count responses by status
            metricsManager.incrementCounter("api_responses_total",
                "endpoint", path,
                "method", method,
                "status", String.valueOf(status),
                "status_category", statusCategory);
        }
    }

    private String getStatusCategory(int status) {
        if (status >= 200 && status < 300) return "2xx";
        if (status >= 300 && status < 400) return "3xx";
        if (status >= 400 && status < 500) return "4xx";
        if (status >= 500 && status < 600) return "5xx";
        return "unknown";
    }
}

Step 2.3: Create Metrics Servlet for Prometheus Scraping

File: tqapi/src/main/java/com/perun/tlinq/metrics/MetricsServlet.java (NEW FILE)

package com.perun.tlinq.metrics;

import jakarta.servlet.http.HttpServlet;
import jakarta.servlet.http.HttpServletRequest;
import jakarta.servlet.http.HttpServletResponse;

import java.io.IOException;
import java.io.Writer;
import java.util.logging.Logger;

/**
 * Servlet to expose Prometheus metrics at /tlinq-api/metrics endpoint.
 */
public class MetricsServlet extends HttpServlet {

    private static final Logger logger = Logger.getLogger(MetricsServlet.class.getName());
    private final MetricsManager metricsManager = MetricsManager.getInstance();

    @Override
    protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws IOException {
        resp.setStatus(HttpServletResponse.SC_OK);
        resp.setContentType("text/plain; version=0.0.4; charset=utf-8");

        try (Writer writer = resp.getWriter()) {
            String metrics = metricsManager.scrape();
            writer.write(metrics);
            writer.flush();
        } catch (Exception e) {
            logger.severe("Error generating metrics: " + e.getMessage());
            resp.sendError(HttpServletResponse.SC_INTERNAL_SERVER_ERROR,
                "Error generating metrics");
        }
    }
}

Step 2.4: Update TQProApiServer to Register Metrics Components

File: tqapi/src/main/java/com/perun/tlinq/TQProApiServer.java

Location 1: Add imports (after line 22)

import com.perun.tlinq.metrics.MetricsFilter;
import com.perun.tlinq.metrics.MetricsServlet;
import org.eclipse.jetty.ee10.servlet.ServletHolder;

Location 2: Register MetricsFilter in ResourceConfig (line 207-211)

// BEFORE:
ResourceConfig rc =  new ResourceConfig()
        .packages("com.perun.tlinq.api")
        .register(MultiPartFeature.class)
        .register(authFilter)
        .register(CORSResponseFilter.class);

// AFTER:
ResourceConfig rc =  new ResourceConfig()
        .packages("com.perun.tlinq.api")
        .register(MultiPartFeature.class)
        .register(authFilter)
        .register(CORSResponseFilter.class)
        .register(MetricsFilter.class);  // ADD THIS LINE

Location 3: Add metrics endpoint servlet (after line 220, before server.setHandler)

// ADD: Metrics endpoint for Prometheus scraping
ServletHolder metricsServlet = new ServletHolder(new MetricsServlet());

ServletContextHandler ctx = new ServletContextHandler(ServletContextHandler.NO_SESSIONS);
ctx.setContextPath("/tlinq-api");
ctx.addServlet(jersey, "/*");
ctx.addServlet(metricsServlet, "/metrics");  // ADD THIS LINE
server.setHandler(ctx);

Step 2.5: Enhance AuthenticationFilter with Authorization Metrics

File: tqapi/src/main/java/com/perun/tlinq/AuthenticationFilter.java

Location 1: Add import (after line 23)

import com.perun.tlinq.metrics.MetricsManager;

Location 2: Add field (after line 31)

private DevModeConfig devModeConfig = DevModeConfig.disabled();
private final MetricsManager metricsManager = MetricsManager.getInstance();  // ADD THIS

Location 3: Track successful authentication (after line 81)

String apiPath = requestContext.getUriInfo().getPath();
logger.info("API call: " + apiPath + " from user "+userName+"/"+userEmail+" [" + finalUserId+"]");

// ADD: Track authentication metrics
metricsManager.incrementCounter("auth_requests_total",
    "api_path", apiPath,
    "user_id", finalUserId,
    "roles", userRoles);

Location 4: Track dev-mode bypass (after line 64)

userName = userName != null ? userName : devModeConfig.getDefaultUserName();
logger.info("DEV-MODE: Using default dev user - " + userId);

// ADD: Track dev mode usage
metricsManager.incrementCounter("auth_dev_mode_bypass_total",
    "user_id", userId);

Location 5: Track authorization failures (line 107-114)

if(!apiRoleManager.isUserAuthorized(apiPath, userRoles)) {
    logger.warning("User "+finalUserId+" is not authorized to access API "+apiPath);

    // ADD: Track authorization rejection
    metricsManager.incrementCounter("auth_rejected_total",
        "api_path", apiPath,
        "user_id", finalUserId,
        "user_roles", userRoles);

    TlinqApiResponse errorResponse = new TlinqApiResponse(TlinqErr.SESSION_ERROR,
        "User is not authorized to access this API.");
    requestContext.abortWith(
            Response.status(Response.Status.FORBIDDEN)
                    .entity(errorResponse)
                    .build());
}

Location 6: Track successful authorization (after authorization check, line 114)

if(!apiRoleManager.isUserAuthorized(apiPath, userRoles)) {
    // ... rejection code above ...
} else {
    // ADD: Track successful authorization
    metricsManager.incrementCounter("auth_authorized_total",
        "api_path", apiPath,
        "user_roles", userRoles);
}

Step 2.6: Create Error Tracking Utility

File: tqapi/src/main/java/com/perun/tlinq/metrics/ErrorMetrics.java (NEW FILE)

package com.perun.tlinq.metrics;

import com.perun.tlinq.util.TlinqClientException;

import java.util.logging.Level;
import java.util.logging.Logger;

/**
 * Utility for tracking error metrics consistently across all API endpoints.
 */
public class ErrorMetrics {

    private static final Logger logger = Logger.getLogger(ErrorMetrics.class.getName());
    private static final MetricsManager metricsManager = MetricsManager.getInstance();

    /**
     * Track a TlinqClientException with metrics.
     */
    public static void trackClientException(String endpoint, TlinqClientException ex) {
        metricsManager.incrementCounter("api_errors_total",
            "endpoint", endpoint,
            "error_type", "TlinqClientException",
            "error_code", ex.getErrorCode());

        logger.log(Level.WARNING,
            String.format("Client error at %s: [%s] %s", endpoint, ex.getErrorCode(), ex.getMessage()));
    }

    /**
     * Track a generic exception with metrics.
     */
    public static void trackException(String endpoint, Exception ex) {
        String exceptionType = ex.getClass().getSimpleName();

        metricsManager.incrementCounter("api_errors_total",
            "endpoint", endpoint,
            "error_type", exceptionType,
            "error_code", "GENERAL");

        logger.log(Level.SEVERE,
            String.format("Error at %s: [%s] %s", endpoint, exceptionType, ex.getMessage()), ex);
    }

    /**
     * Track error with custom error code.
     */
    public static void trackError(String endpoint, String errorCode, String message) {
        metricsManager.incrementCounter("api_errors_total",
            "endpoint", endpoint,
            "error_code", errorCode);

        logger.log(Level.WARNING,
            String.format("Error at %s: [%s] %s", endpoint, errorCode, message));
    }
}

Step 2.7: Update API Classes with Error Metrics

This step involves updating all API endpoint classes to track errors. Here's a template pattern to apply:

Template Pattern for Error Handling:

// BEFORE:
catch (TlinqClientException ex) {
    logger.warning(ex.getMessage());
    ar = new TlinqApiResponse(ex.getErrorCode(), ex.getMessage());
}

// AFTER:
catch (TlinqClientException ex) {
    ErrorMetrics.trackClientException("endpoint-name", ex);
    ar = new TlinqApiResponse(ex.getErrorCode(), ex.getMessage());
}

// BEFORE:
catch (Exception ex) {
    logger.warning(ex.getMessage());
    ar = new TlinqApiResponse(TlinqErr.GENERAL, ex.getMessage());
}

// AFTER:
catch (Exception ex) {
    ErrorMetrics.trackException("endpoint-name", ex);
    ar = new TlinqApiResponse(TlinqErr.GENERAL, ex.getMessage());
}

Files to Update (17 files):

  1. tqapi/src/main/java/com/perun/tlinq/api/UserApi.java - 13 exception handlers
  2. tqapi/src/main/java/com/perun/tlinq/api/BookingApi.java - 11 exception handlers
  3. tqapi/src/main/java/com/perun/tlinq/api/HotelApi.java - ~54 exception handlers
  4. tqapi/src/main/java/com/perun/tlinq/api/FlightApi.java - ~15 exception handlers
  5. tqapi/src/main/java/com/perun/tlinq/api/CartApi.java - ~17 exception handlers
  6. tqapi/src/main/java/com/perun/tlinq/api/TripMakerApi.java - ~91 exception handlers
  7. tqapi/src/main/java/com/perun/tlinq/api/ProductApi.java - ~33 exception handlers
  8. tqapi/src/main/java/com/perun/tlinq/api/CustomerApi.java - ~12 exception handlers
  9. tqapi/src/main/java/com/perun/tlinq/api/DocumentApi.java - ~27 exception handlers
  10. tqapi/src/main/java/com/perun/tlinq/api/GroupApi.java - ~33 exception handlers
  11. tqapi/src/main/java/com/perun/tlinq/api/CommonApi.java - ~7 exception handlers
  12. tqapi/src/main/java/com/perun/tlinq/api/CruiseApi.java - ~9 exception handlers
  13. tqapi/src/main/java/com/perun/tlinq/api/VisaApi.java - ~36 exception handlers
  14. tqapi/src/main/java/com/perun/tlinq/api/TripOfferApi.java - ~22 exception handlers
  15. tqapi/src/main/java/com/perun/tlinq/TQProApiServer.java - 5 exception handlers
  16. tqapi/src/main/java/com/perun/tlinq/ApiRoleManager.java - 1 exception handler
  17. tqapi/src/main/java/com/perun/tlinq/TlinqApiServer.java - if still in use

Action Items:

For each file: 1. Add import: import com.perun.tlinq.metrics.ErrorMetrics; 2. Find all catch (TlinqClientException ex) blocks 3. Replace logger call with ErrorMetrics.trackClientException("endpoint-name", ex); 4. Find all catch (Exception ex) blocks 5. Replace logger call with ErrorMetrics.trackException("endpoint-name", ex);

Endpoint Name Convention: - Use the API path for endpoint name: e.g., "user/authenticate", "booking/create", "hotel/search" - Extract from @Path annotation on class and method


Phase 3: Prometheus Configuration

Step 3.1: Start Prometheus and Grafana

cd tqapi
docker-compose -f docker-compose-monitoring.yml up -d

Verify: - Prometheus UI: http://localhost:9090 - Grafana UI: http://localhost:3000 (admin/admin)

Step 3.2: Verify Metrics Endpoint

Start TQPro API server, then:

curl http://localhost:11080/tlinq-api/metrics

Expected Output:

# HELP api_requests_total Total API requests
# TYPE api_requests_total counter
api_requests_total{application="tqpro",endpoint="user/authenticate",method="POST",module="tqapi",} 5.0
...

Step 3.3: Verify Prometheus Scraping

  1. Open Prometheus UI: http://localhost:9090
  2. Go to Status > Targets
  3. Verify tqpro-api target is UP
  4. Query metrics: api_requests_total

Troubleshooting: - If target is DOWN, check network connectivity - For Docker Desktop (Mac/Windows): use host.docker.internal:11080 - For Linux: use 172.17.0.1:11080 or host IP - Check TQPro API is running: curl http://localhost:11080/tlinq-api/metrics

Step 3.4: Configure Alerting Rules (Optional)

File: tqapi/monitoring/prometheus-rules.yml (NEW FILE)

groups:
  - name: tqpro_api_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: HighAPIErrorRate
        expr: rate(api_errors_total[5m]) > 0.1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High API error rate detected"
          description: "API error rate is {{ $value }} errors/sec for endpoint {{ $labels.endpoint }}"

      # Authorization failures
      - alert: HighAuthorizationFailureRate
        expr: rate(auth_rejected_total[5m]) > 0.05
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High authorization rejection rate"
          description: "Auth rejection rate is {{ $value }} for API {{ $labels.api_path }}"

      # Slow API responses
      - alert: SlowAPIResponses
        expr: histogram_quantile(0.95, rate(api_request_duration_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "API response time degradation"
          description: "95th percentile latency is {{ $value }}s for {{ $labels.endpoint }}"

Update tqapi/monitoring/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

# ADD:
rule_files:
  - "/etc/prometheus/prometheus-rules.yml"

scrape_configs:
  # ... existing config ...

Update tqapi/docker-compose-monitoring.yml:

prometheus:
  # ... existing config ...
  volumes:
    - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
    - ./monitoring/prometheus-rules.yml:/etc/prometheus/prometheus-rules.yml  # ADD
    - prometheus-data:/prometheus

Phase 4: Grafana Dashboard Setup

Step 4.1: Access Grafana

  1. Open: http://localhost:3000
  2. Login: admin / admin
  3. Change password when prompted (or skip)

Step 4.2: Verify Prometheus Datasource

  1. Go to Configuration > Data Sources
  2. Verify "Prometheus" is listed and working
  3. Click "Test" - should see "Data source is working"

Step 4.3: Create Main API Dashboard

File: tqapi/monitoring/grafana/dashboards/tqpro-api-overview.json (NEW FILE)

{
  "dashboard": {
    "title": "TQPro API Overview",
    "tags": ["tqpro", "api", "overview"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "Request Rate (req/sec)",
        "type": "graph",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "rate(api_requests_total{application=\"tqpro\"}[5m])",
            "legendFormat": "{{endpoint}} - {{method}}"
          }
        ]
      },
      {
        "id": 2,
        "title": "Error Rate (errors/sec)",
        "type": "graph",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "rate(api_errors_total{application=\"tqpro\"}[5m])",
            "legendFormat": "{{endpoint}} - {{error_code}}"
          }
        ]
      },
      {
        "id": 3,
        "title": "Request Duration (p50, p95, p99)",
        "type": "graph",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
        "targets": [
          {
            "expr": "histogram_quantile(0.50, rate(api_request_duration_seconds_bucket[5m]))",
            "legendFormat": "p50"
          },
          {
            "expr": "histogram_quantile(0.95, rate(api_request_duration_seconds_bucket[5m]))",
            "legendFormat": "p95"
          },
          {
            "expr": "histogram_quantile(0.99, rate(api_request_duration_seconds_bucket[5m]))",
            "legendFormat": "p99"
          }
        ]
      },
      {
        "id": 4,
        "title": "Response Status Distribution",
        "type": "piechart",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
        "targets": [
          {
            "expr": "sum(rate(api_responses_total{application=\"tqpro\"}[5m])) by (status_category)",
            "legendFormat": "{{status_category}}"
          }
        ]
      }
    ],
    "time": {"from": "now-1h", "to": "now"},
    "refresh": "30s"
  }
}

Step 4.4: Create Authorization Dashboard

File: tqapi/monitoring/grafana/dashboards/tqpro-authorization.json (NEW FILE)

{
  "dashboard": {
    "title": "TQPro Authorization Metrics",
    "tags": ["tqpro", "api", "authorization", "security"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "Authorization Success vs Rejection",
        "type": "graph",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "rate(auth_authorized_total{application=\"tqpro\"}[5m])",
            "legendFormat": "Authorized - {{api_path}}"
          },
          {
            "expr": "rate(auth_rejected_total{application=\"tqpro\"}[5m])",
            "legendFormat": "Rejected - {{api_path}}"
          }
        ]
      },
      {
        "id": 2,
        "title": "Authorization Rejection Rate by API",
        "type": "graph",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "rate(auth_rejected_total{application=\"tqpro\"}[5m])",
            "legendFormat": "{{api_path}} - {{user_roles}}"
          }
        ]
      },
      {
        "id": 3,
        "title": "Top Rejected Users",
        "type": "table",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
        "targets": [
          {
            "expr": "topk(10, sum(increase(auth_rejected_total{application=\"tqpro\"}[1h])) by (user_id, api_path))",
            "format": "table",
            "instant": true
          }
        ]
      },
      {
        "id": 4,
        "title": "Dev Mode Bypass Counter",
        "type": "stat",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
        "targets": [
          {
            "expr": "sum(auth_dev_mode_bypass_total{application=\"tqpro\"})"
          }
        ],
        "options": {
          "colorMode": "background",
          "graphMode": "area"
        }
      }
    ],
    "time": {"from": "now-1h", "to": "now"},
    "refresh": "30s"
  }
}

Step 4.5: Create Performance Dashboard

File: tqapi/monitoring/grafana/dashboards/tqpro-performance.json (NEW FILE)

{
  "dashboard": {
    "title": "TQPro API Performance",
    "tags": ["tqpro", "api", "performance"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "Request Duration Heatmap",
        "type": "heatmap",
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "rate(api_request_duration_seconds_bucket{application=\"tqpro\"}[5m])",
            "format": "heatmap",
            "legendFormat": "{{le}}"
          }
        ]
      },
      {
        "id": 2,
        "title": "Slowest Endpoints (p95)",
        "type": "bargauge",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
        "targets": [
          {
            "expr": "topk(10, histogram_quantile(0.95, rate(api_request_duration_seconds_bucket{application=\"tqpro\"}[5m])) by (endpoint))"
          }
        ]
      },
      {
        "id": 3,
        "title": "Request Rate by Endpoint",
        "type": "bargauge",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
        "targets": [
          {
            "expr": "topk(10, rate(api_requests_total{application=\"tqpro\"}[5m])) by (endpoint)"
          }
        ]
      }
    ],
    "time": {"from": "now-1h", "to": "now"},
    "refresh": "30s"
  }
}

Step 4.6: Import Dashboards

Method 1: Auto-provisioning (Recommended)

Dashboards should be auto-loaded from /var/lib/grafana/dashboards directory.

Restart Grafana:

docker-compose -f docker-compose-monitoring.yml restart grafana

Method 2: Manual Import

  1. Go to Grafana UI > Dashboards > Import
  2. Upload each JSON file
  3. Select "Prometheus" datasource
  4. Click "Import"

Step 4.7: Configure Alerts in Grafana (Optional)

  1. Edit dashboard panel
  2. Click "Alert" tab
  3. Configure alert rule
  4. Set notification channel (email, Slack, PagerDuty, etc.)

Phase 5: Testing & Validation

Step 5.1: Build and Deploy Updated Code

cd tqapi
../gradlew clean build

Verify build succeeds with no errors.

Step 5.2: Start TQPro API Server

# Set TLINQ_HOME if not already set
export TLINQ_HOME=/path/to/tqpro/config

# Run the API server
java -jar build/libs/tqapi.jar

Verify logs show: - "MetricsManager initialized with Prometheus registry" - No errors during startup

Step 5.3: Generate Test Traffic

Script: tqapi/test-metrics.sh (NEW FILE)

#!/bin/bash

# Test script to generate API traffic for metrics validation

API_BASE="http://localhost:11080/tlinq-api"

echo "Generating test API traffic..."

# Test 1: Successful requests
for i in {1..50}; do
  curl -X POST "$API_BASE/user/authenticate" \
    -H "Content-Type: application/json" \
    -H "X-User: test-user-$i" \
    -H "X-Roles: agent" \
    -d '{"username":"test","password":"test"}' \
    -s -o /dev/null
  sleep 0.1
done

# Test 2: Unauthorized requests (should be rejected)
for i in {1..20}; do
  curl -X POST "$API_BASE/booking/create" \
    -H "Content-Type: application/json" \
    -H "X-User: guest-$i" \
    -H "X-Roles: guest" \
    -d '{"session":"test"}' \
    -s -o /dev/null
  sleep 0.1
done

# Test 3: Error scenarios
for i in {1..10}; do
  curl -X POST "$API_BASE/user/authenticate" \
    -H "Content-Type: application/json" \
    -d '{"invalid":"data"}' \
    -s -o /dev/null
  sleep 0.1
done

echo "Test traffic generated. Check metrics at $API_BASE/metrics"

Run:

chmod +x tqapi/test-metrics.sh
./tqapi/test-metrics.sh

Step 5.4: Verify Metrics Endpoint

curl http://localhost:11080/tlinq-api/metrics | grep -E "(api_requests_total|auth_rejected_total|api_errors_total)"

Expected Output:

api_requests_total{application="tqpro",endpoint="user/authenticate",method="POST",module="tqapi",} 50.0
auth_rejected_total{api_path="booking/create",application="tqpro",module="tqapi",user_id="guest-1",user_roles="guest",} 20.0
api_errors_total{endpoint="user/authenticate",error_code="MISSING_PARAMETER",error_type="TlinqClientException",application="tqpro",module="tqapi",} 10.0

Step 5.5: Verify Prometheus Scraping

  1. Open: http://localhost:9090
  2. Query: api_requests_total
  3. Verify data is present and updating

Step 5.6: Verify Grafana Dashboards

  1. Open: http://localhost:3000
  2. Navigate to "TQPro API Overview" dashboard
  3. Verify panels show data:
  4. Request Rate graph shows activity
  5. Error Rate shows errors from test
  6. Request Duration shows latency
  7. Status Distribution shows 2xx, 4xx responses

  8. Navigate to "TQPro Authorization Metrics"

  9. Verify authorization rejection data

Step 5.7: Test Alert Rules (if configured)

  1. Generate high error rate:

    for i in {1..100}; do
      curl -X POST http://localhost:11080/tlinq-api/user/authenticate \
        -H "Content-Type: application/json" \
        -d '{"invalid":"data"}' -s -o /dev/null
    done
    

  2. Check Prometheus > Alerts

  3. Verify "HighAPIErrorRate" alert fires

Metrics Reference

Core Metrics

Metric Name Type Labels Description
api_requests_total Counter endpoint, method Total API requests received
api_responses_total Counter endpoint, method, status, status_category Total API responses sent
api_request_duration_seconds Histogram endpoint, method, status, status_category Request processing duration
auth_requests_total Counter api_path, user_id, roles Total authentication attempts
auth_authorized_total Counter api_path, user_roles Successful authorizations
auth_rejected_total Counter api_path, user_id, user_roles Rejected authorization attempts
auth_dev_mode_bypass_total Counter user_id Dev mode authentication bypasses
api_errors_total Counter endpoint, error_type, error_code Total API errors

Useful PromQL Queries

Request Rate:

rate(api_requests_total{application="tqpro"}[5m])

Error Rate:

rate(api_errors_total{application="tqpro"}[5m])

95th Percentile Latency:

histogram_quantile(0.95, rate(api_request_duration_seconds_bucket[5m]))

Authorization Rejection Ratio:

rate(auth_rejected_total[5m]) / rate(auth_requests_total[5m])

Top 10 Slowest Endpoints:

topk(10, histogram_quantile(0.95, rate(api_request_duration_seconds_bucket[5m])) by (endpoint))

Error Rate by Endpoint:

sum(rate(api_errors_total[5m])) by (endpoint)


Troubleshooting

Issue: Metrics endpoint returns 404

Symptoms: curl http://localhost:11080/tlinq-api/metrics returns 404

Solutions: 1. Verify MetricsServlet is registered in TQProApiServer.java 2. Check server logs for startup errors 3. Verify servlet path: ctx.addServlet(metricsServlet, "/metrics"); 4. Test: curl http://localhost:11080/tlinq-api/user/authenticate (should work)

Issue: Prometheus target is DOWN

Symptoms: Prometheus UI shows tqpro-api target as DOWN

Solutions: 1. Check TQPro API is running: curl http://localhost:11080/tlinq-api/metrics 2. Check Docker network connectivity: - Mac/Windows: Use host.docker.internal:11080 - Linux: Use 172.17.0.1:11080 or host IP 3. Check prometheus.yml target configuration 4. View Prometheus logs: docker logs tqpro-prometheus

Issue: No data in Grafana dashboards

Symptoms: Dashboards load but panels show "No data"

Solutions: 1. Verify Prometheus datasource is working (Test button) 2. Check Prometheus is scraping: http://localhost:9090/targets 3. Verify metrics exist in Prometheus: query api_requests_total 4. Generate test traffic: ./test-metrics.sh 5. Check time range in Grafana (last 1 hour)

Issue: Metrics not incrementing

Symptoms: Metrics endpoint shows metrics but values don't change

Solutions: 1. Verify filters are registered in ResourceConfig 2. Check server logs for filter errors 3. Make actual API calls (not just /metrics) 4. Verify MetricsManager.getInstance() is called 5. Check no exceptions during metric recording

Issue: High memory usage

Symptoms: API server memory grows over time

Solutions: 1. Check metric cardinality (too many label combinations) 2. Avoid high-cardinality labels (user IDs, timestamps, etc.) 3. Use caching in MetricsManager (already implemented) 4. Consider aggregating metrics before recording

Issue: Build errors after adding Micrometer

Symptoms: Gradle build fails with dependency errors

Solutions: 1. Refresh dependencies: ../gradlew build --refresh-dependencies 2. Verify Micrometer version compatibility with Java 11+ 3. Check for conflicting dependencies 4. Clean build: ../gradlew clean build


Production Deployment Considerations

Security

  1. Secure Metrics Endpoint
  2. Add authentication to /tlinq-api/metrics
  3. Use firewall rules to restrict access to Prometheus server
  4. Consider TLS for scraping

  5. Sensitive Data in Labels

  6. Do NOT include passwords, tokens, or PII in metric labels
  7. Sanitize user_id labels (hash or use internal IDs)
  8. Avoid exposing internal system details

Performance

  1. Metrics Cardinality
  2. Limit unique label combinations to < 10,000 per metric
  3. Avoid user-specific or request-specific labels
  4. Use aggregation where possible

  5. Scrape Interval

  6. Production: 30-60s recommended
  7. High-traffic: Consider sampling

  8. Retention

  9. Prometheus default: 15 days
  10. For long-term: Use Thanos, Cortex, or Victoria Metrics

High Availability

  1. Prometheus HA
  2. Deploy multiple Prometheus instances
  3. Use federation for aggregation

  4. Grafana HA

  5. Use external database (PostgreSQL/MySQL)
  6. Deploy multiple Grafana instances behind load balancer

Kubernetes Deployment

Example:

apiVersion: v1
kind: Service
metadata:
  name: tqpro-api-metrics
spec:
  ports:
  - port: 11080
    targetPort: 11080
  selector:
    app: tqpro-api
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tqpro-api
spec:
  selector:
    matchLabels:
      app: tqpro-api
  endpoints:
  - port: http
    path: /tlinq-api/metrics
    interval: 30s

Next Steps After Implementation

  1. Establish Baselines
  2. Run for 1-2 weeks to establish normal behavior
  3. Document baseline metrics (request rate, error rate, latency)

  4. Set Alert Thresholds

  5. Based on baselines, configure meaningful alerts
  6. Avoid alert fatigue (tune thresholds)

  7. Create Runbooks

  8. Document response procedures for each alert
  9. Include investigation steps and remediation

  10. Integrate with Incident Management

  11. Connect alerts to PagerDuty, OpsGenie, etc.
  12. Set up on-call rotation

  13. Continuous Improvement

  14. Review metrics weekly
  15. Identify optimization opportunities
  16. Add business-specific metrics

  17. Capacity Planning

  18. Use historical data for capacity planning
  19. Predict scaling needs

References


Appendix: Quick Start Commands

# Build API with metrics support
cd tqapi
../gradlew clean build

# Start monitoring stack
docker-compose -f docker-compose-monitoring.yml up -d

# Start TQPro API
export TLINQ_HOME=/path/to/config
java -jar build/libs/tqapi.jar

# Generate test traffic
./test-metrics.sh

# View metrics
curl http://localhost:11080/tlinq-api/metrics

# Access UIs
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000 (admin/admin)

End of Document