TQPro API Observability Implementation Guide¶
Prometheus + Grafana Monitoring Setup¶
Status: Implementation Planned Target Module: tqapi Created: 2025-11-23 Objective: Implement comprehensive observability for TQPro API with Prometheus metrics and Grafana dashboards
Table of Contents¶
- Overview
- Prerequisites
- Phase 1: Infrastructure Setup
- Phase 2: Code Implementation
- Phase 3: Prometheus Configuration
- Phase 4: Grafana Dashboard Setup
- Phase 5: Testing & Validation
- Metrics Reference
- Troubleshooting
Overview¶
This guide implements observability for the TQPro API module (tqapi) to track:
- API Execution Times - Request duration, throughput, latency percentiles
- Authorization Metrics - Successful/rejected authorization attempts, by API and role
- Error Tracking - Exception rates, error types, error codes by endpoint
Architecture¶
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ TQPro API │─────>│ Prometheus │─────>│ Grafana │
│ (Metrics) │ HTTP │ (Scraper) │ HTTP │ (Dashboard) │
└─────────────┘ └──────────────┘ └─────────────┘
Technology Stack¶
- Metrics Library: Micrometer (vendor-neutral facade)
- Metrics Format: Prometheus
- Storage: Prometheus TSDB
- Visualization: Grafana
- Deployment: Docker Compose (dev/test), Kubernetes (production)
Prerequisites¶
Required Software¶
- JDK 11+ (already installed)
- Docker & Docker Compose (for Prometheus/Grafana)
- Gradle (already configured in project)
Required Knowledge¶
- Basic understanding of Prometheus query language (PromQL)
- Familiarity with Grafana dashboard creation
- Understanding of JAX-RS filters and Jakarta servlets
Estimated Time¶
- Phase 1 (Infrastructure): 1-2 hours
- Phase 2 (Code Implementation): 4-6 hours
- Phase 3 (Prometheus Config): 1 hour
- Phase 4 (Grafana Dashboards): 2-3 hours
- Phase 5 (Testing): 1-2 hours
Total: 9-14 hours
Phase 1: Infrastructure Setup¶
Step 1.1: Add Micrometer Dependencies¶
File: tqapi/build.gradle.kts
Location: Lines 12-20 (dependencies block)
dependencies {
testImplementation(platform("org.junit:junit-bom:5.10.0"))
testImplementation("org.junit.jupiter:junit-jupiter")
testImplementation("org.mockito:mockito-junit-jupiter:5.12.0")
testImplementation("org.mockito:mockito-inline:5.2.0")
implementation(project(":tqapp"))
implementation(project(":tqcommon"))
// ADD: Micrometer metrics dependencies
implementation("io.micrometer:micrometer-core:1.12.0")
implementation("io.micrometer:micrometer-registry-prometheus:1.12.0")
// ADD: Prometheus exposition format
implementation("io.prometheus:simpleclient:0.16.0")
implementation("io.prometheus:simpleclient_common:0.16.0")
}
Action:
Step 1.2: Create Docker Compose for Prometheus & Grafana¶
File: tqapi/docker-compose-monitoring.yml (NEW FILE)
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: tqpro-prometheus
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--web.enable-lifecycle'
networks:
- monitoring
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: tqpro-grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana-data:/var/lib/grafana
- ./monitoring/grafana/provisioning:/etc/grafana/provisioning
- ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
networks:
- monitoring
restart: unless-stopped
depends_on:
- prometheus
volumes:
prometheus-data:
grafana-data:
networks:
monitoring:
driver: bridge
Action:
Step 1.3: Create Prometheus Configuration¶
File: tqapi/monitoring/prometheus.yml (NEW FILE)
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'tqpro-dev'
environment: 'development'
scrape_configs:
- job_name: 'tqpro-api'
metrics_path: '/tlinq-api/metrics'
static_configs:
- targets: ['host.docker.internal:11080'] # For Docker on Mac/Windows
# - targets: ['172.17.0.1:11080'] # For Docker on Linux
labels:
service: 'tqpro-api'
module: 'tqapi'
Note: Adjust targets based on your Docker host configuration.
Step 1.4: Create Grafana Datasource Provisioning¶
File: tqapi/monitoring/grafana/provisioning/datasources/prometheus.yml (NEW FILE)
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
jsonData:
timeInterval: "15s"
Step 1.5: Create Grafana Dashboard Provisioning¶
File: tqapi/monitoring/grafana/provisioning/dashboards/default.yml (NEW FILE)
apiVersion: 1
providers:
- name: 'TQPro Dashboards'
orgId: 1
folder: 'TQPro'
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards
Phase 2: Code Implementation¶
Step 2.1: Create Metrics Registry Manager¶
File: tqapi/src/main/java/com/perun/tlinq/metrics/MetricsManager.java (NEW FILE)
package com.perun.tlinq.metrics;
import io.micrometer.core.instrument.*;
import io.micrometer.prometheus.PrometheusConfig;
import io.micrometer.prometheus.PrometheusMeterRegistry;
import java.util.concurrent.ConcurrentHashMap;
import java.util.logging.Logger;
/**
* Centralized metrics management for TQPro API.
* Provides singleton access to Micrometer registry and helper methods for metrics.
*/
public class MetricsManager {
private static final Logger logger = Logger.getLogger(MetricsManager.class.getName());
private static MetricsManager instance;
private final PrometheusMeterRegistry registry;
// Cache for frequently used metrics
private final ConcurrentHashMap<String, Counter> counterCache = new ConcurrentHashMap<>();
private final ConcurrentHashMap<String, Timer> timerCache = new ConcurrentHashMap<>();
private MetricsManager() {
this.registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
configureCommonTags();
logger.info("MetricsManager initialized with Prometheus registry");
}
public static synchronized MetricsManager getInstance() {
if (instance == null) {
instance = new MetricsManager();
}
return instance;
}
private void configureCommonTags() {
registry.config().commonTags(
"application", "tqpro",
"module", "tqapi"
);
}
public PrometheusMeterRegistry getRegistry() {
return registry;
}
/**
* Get Prometheus-formatted metrics for exposition endpoint.
*/
public String scrape() {
return registry.scrape();
}
// ============= Counter Helpers =============
public void incrementCounter(String name, String... tags) {
String cacheKey = name + String.join("_", tags);
Counter counter = counterCache.computeIfAbsent(cacheKey,
k -> Counter.builder(name).tags(tags).register(registry));
counter.increment();
}
public void incrementCounterBy(String name, double amount, String... tags) {
String cacheKey = name + String.join("_", tags);
Counter counter = counterCache.computeIfAbsent(cacheKey,
k -> Counter.builder(name).tags(tags).register(registry));
counter.increment(amount);
}
// ============= Timer Helpers =============
public Timer.Sample startTimer() {
return Timer.start(registry);
}
public void recordTimer(Timer.Sample sample, String name, String... tags) {
String cacheKey = name + String.join("_", tags);
Timer timer = timerCache.computeIfAbsent(cacheKey,
k -> Timer.builder(name).tags(tags).register(registry));
sample.stop(timer);
}
// ============= Gauge Helpers =============
public <T> T registerGauge(String name, T obj, java.util.function.ToDoubleFunction<T> valueFunction, String... tags) {
return Gauge.builder(name, obj, valueFunction)
.tags(tags)
.register(registry);
}
}
Step 2.2: Create Metrics Filter for API Timing¶
File: tqapi/src/main/java/com/perun/tlinq/metrics/MetricsFilter.java (NEW FILE)
package com.perun.tlinq.metrics;
import io.micrometer.core.instrument.Timer;
import jakarta.ws.rs.container.ContainerRequestContext;
import jakarta.ws.rs.container.ContainerRequestFilter;
import jakarta.ws.rs.container.ContainerResponseContext;
import jakarta.ws.rs.container.ContainerResponseFilter;
import jakarta.ws.rs.ext.Provider;
import java.io.IOException;
import java.util.logging.Logger;
/**
* JAX-RS filter to capture API request metrics:
* - Request duration (timing)
* - Request count
* - Response status codes
*/
@Provider
public class MetricsFilter implements ContainerRequestFilter, ContainerResponseFilter {
private static final Logger logger = Logger.getLogger(MetricsFilter.class.getName());
private static final String TIMER_SAMPLE_PROPERTY = "metrics.timer.sample";
private static final String REQUEST_PATH_PROPERTY = "metrics.request.path";
private final MetricsManager metricsManager = MetricsManager.getInstance();
@Override
public void filter(ContainerRequestContext requestContext) throws IOException {
// Start timing the request
Timer.Sample sample = metricsManager.startTimer();
requestContext.setProperty(TIMER_SAMPLE_PROPERTY, sample);
// Store request path for use in response filter
String path = requestContext.getUriInfo().getPath();
requestContext.setProperty(REQUEST_PATH_PROPERTY, path);
// Count incoming requests
String method = requestContext.getMethod();
metricsManager.incrementCounter("api_requests_total",
"endpoint", path,
"method", method);
}
@Override
public void filter(ContainerRequestContext requestContext,
ContainerResponseContext responseContext) throws IOException {
// Retrieve timer sample
Timer.Sample sample = (Timer.Sample) requestContext.getProperty(TIMER_SAMPLE_PROPERTY);
String path = (String) requestContext.getProperty(REQUEST_PATH_PROPERTY);
if (sample != null && path != null) {
String method = requestContext.getMethod();
int status = responseContext.getStatus();
String statusCategory = getStatusCategory(status);
// Record request duration
metricsManager.recordTimer(sample, "api_request_duration_seconds",
"endpoint", path,
"method", method,
"status", String.valueOf(status),
"status_category", statusCategory);
// Count responses by status
metricsManager.incrementCounter("api_responses_total",
"endpoint", path,
"method", method,
"status", String.valueOf(status),
"status_category", statusCategory);
}
}
private String getStatusCategory(int status) {
if (status >= 200 && status < 300) return "2xx";
if (status >= 300 && status < 400) return "3xx";
if (status >= 400 && status < 500) return "4xx";
if (status >= 500 && status < 600) return "5xx";
return "unknown";
}
}
Step 2.3: Create Metrics Servlet for Prometheus Scraping¶
File: tqapi/src/main/java/com/perun/tlinq/metrics/MetricsServlet.java (NEW FILE)
package com.perun.tlinq.metrics;
import jakarta.servlet.http.HttpServlet;
import jakarta.servlet.http.HttpServletRequest;
import jakarta.servlet.http.HttpServletResponse;
import java.io.IOException;
import java.io.Writer;
import java.util.logging.Logger;
/**
* Servlet to expose Prometheus metrics at /tlinq-api/metrics endpoint.
*/
public class MetricsServlet extends HttpServlet {
private static final Logger logger = Logger.getLogger(MetricsServlet.class.getName());
private final MetricsManager metricsManager = MetricsManager.getInstance();
@Override
protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws IOException {
resp.setStatus(HttpServletResponse.SC_OK);
resp.setContentType("text/plain; version=0.0.4; charset=utf-8");
try (Writer writer = resp.getWriter()) {
String metrics = metricsManager.scrape();
writer.write(metrics);
writer.flush();
} catch (Exception e) {
logger.severe("Error generating metrics: " + e.getMessage());
resp.sendError(HttpServletResponse.SC_INTERNAL_SERVER_ERROR,
"Error generating metrics");
}
}
}
Step 2.4: Update TQProApiServer to Register Metrics Components¶
File: tqapi/src/main/java/com/perun/tlinq/TQProApiServer.java
Location 1: Add imports (after line 22)
import com.perun.tlinq.metrics.MetricsFilter;
import com.perun.tlinq.metrics.MetricsServlet;
import org.eclipse.jetty.ee10.servlet.ServletHolder;
Location 2: Register MetricsFilter in ResourceConfig (line 207-211)
// BEFORE:
ResourceConfig rc = new ResourceConfig()
.packages("com.perun.tlinq.api")
.register(MultiPartFeature.class)
.register(authFilter)
.register(CORSResponseFilter.class);
// AFTER:
ResourceConfig rc = new ResourceConfig()
.packages("com.perun.tlinq.api")
.register(MultiPartFeature.class)
.register(authFilter)
.register(CORSResponseFilter.class)
.register(MetricsFilter.class); // ADD THIS LINE
Location 3: Add metrics endpoint servlet (after line 220, before server.setHandler)
// ADD: Metrics endpoint for Prometheus scraping
ServletHolder metricsServlet = new ServletHolder(new MetricsServlet());
ServletContextHandler ctx = new ServletContextHandler(ServletContextHandler.NO_SESSIONS);
ctx.setContextPath("/tlinq-api");
ctx.addServlet(jersey, "/*");
ctx.addServlet(metricsServlet, "/metrics"); // ADD THIS LINE
server.setHandler(ctx);
Step 2.5: Enhance AuthenticationFilter with Authorization Metrics¶
File: tqapi/src/main/java/com/perun/tlinq/AuthenticationFilter.java
Location 1: Add import (after line 23)
Location 2: Add field (after line 31)
private DevModeConfig devModeConfig = DevModeConfig.disabled();
private final MetricsManager metricsManager = MetricsManager.getInstance(); // ADD THIS
Location 3: Track successful authentication (after line 81)
String apiPath = requestContext.getUriInfo().getPath();
logger.info("API call: " + apiPath + " from user "+userName+"/"+userEmail+" [" + finalUserId+"]");
// ADD: Track authentication metrics
metricsManager.incrementCounter("auth_requests_total",
"api_path", apiPath,
"user_id", finalUserId,
"roles", userRoles);
Location 4: Track dev-mode bypass (after line 64)
userName = userName != null ? userName : devModeConfig.getDefaultUserName();
logger.info("DEV-MODE: Using default dev user - " + userId);
// ADD: Track dev mode usage
metricsManager.incrementCounter("auth_dev_mode_bypass_total",
"user_id", userId);
Location 5: Track authorization failures (line 107-114)
if(!apiRoleManager.isUserAuthorized(apiPath, userRoles)) {
logger.warning("User "+finalUserId+" is not authorized to access API "+apiPath);
// ADD: Track authorization rejection
metricsManager.incrementCounter("auth_rejected_total",
"api_path", apiPath,
"user_id", finalUserId,
"user_roles", userRoles);
TlinqApiResponse errorResponse = new TlinqApiResponse(TlinqErr.SESSION_ERROR,
"User is not authorized to access this API.");
requestContext.abortWith(
Response.status(Response.Status.FORBIDDEN)
.entity(errorResponse)
.build());
}
Location 6: Track successful authorization (after authorization check, line 114)
if(!apiRoleManager.isUserAuthorized(apiPath, userRoles)) {
// ... rejection code above ...
} else {
// ADD: Track successful authorization
metricsManager.incrementCounter("auth_authorized_total",
"api_path", apiPath,
"user_roles", userRoles);
}
Step 2.6: Create Error Tracking Utility¶
File: tqapi/src/main/java/com/perun/tlinq/metrics/ErrorMetrics.java (NEW FILE)
package com.perun.tlinq.metrics;
import com.perun.tlinq.util.TlinqClientException;
import java.util.logging.Level;
import java.util.logging.Logger;
/**
* Utility for tracking error metrics consistently across all API endpoints.
*/
public class ErrorMetrics {
private static final Logger logger = Logger.getLogger(ErrorMetrics.class.getName());
private static final MetricsManager metricsManager = MetricsManager.getInstance();
/**
* Track a TlinqClientException with metrics.
*/
public static void trackClientException(String endpoint, TlinqClientException ex) {
metricsManager.incrementCounter("api_errors_total",
"endpoint", endpoint,
"error_type", "TlinqClientException",
"error_code", ex.getErrorCode());
logger.log(Level.WARNING,
String.format("Client error at %s: [%s] %s", endpoint, ex.getErrorCode(), ex.getMessage()));
}
/**
* Track a generic exception with metrics.
*/
public static void trackException(String endpoint, Exception ex) {
String exceptionType = ex.getClass().getSimpleName();
metricsManager.incrementCounter("api_errors_total",
"endpoint", endpoint,
"error_type", exceptionType,
"error_code", "GENERAL");
logger.log(Level.SEVERE,
String.format("Error at %s: [%s] %s", endpoint, exceptionType, ex.getMessage()), ex);
}
/**
* Track error with custom error code.
*/
public static void trackError(String endpoint, String errorCode, String message) {
metricsManager.incrementCounter("api_errors_total",
"endpoint", endpoint,
"error_code", errorCode);
logger.log(Level.WARNING,
String.format("Error at %s: [%s] %s", endpoint, errorCode, message));
}
}
Step 2.7: Update API Classes with Error Metrics¶
This step involves updating all API endpoint classes to track errors. Here's a template pattern to apply:
Template Pattern for Error Handling:
// BEFORE:
catch (TlinqClientException ex) {
logger.warning(ex.getMessage());
ar = new TlinqApiResponse(ex.getErrorCode(), ex.getMessage());
}
// AFTER:
catch (TlinqClientException ex) {
ErrorMetrics.trackClientException("endpoint-name", ex);
ar = new TlinqApiResponse(ex.getErrorCode(), ex.getMessage());
}
// BEFORE:
catch (Exception ex) {
logger.warning(ex.getMessage());
ar = new TlinqApiResponse(TlinqErr.GENERAL, ex.getMessage());
}
// AFTER:
catch (Exception ex) {
ErrorMetrics.trackException("endpoint-name", ex);
ar = new TlinqApiResponse(TlinqErr.GENERAL, ex.getMessage());
}
Files to Update (17 files):
tqapi/src/main/java/com/perun/tlinq/api/UserApi.java- 13 exception handlerstqapi/src/main/java/com/perun/tlinq/api/BookingApi.java- 11 exception handlerstqapi/src/main/java/com/perun/tlinq/api/HotelApi.java- ~54 exception handlerstqapi/src/main/java/com/perun/tlinq/api/FlightApi.java- ~15 exception handlerstqapi/src/main/java/com/perun/tlinq/api/CartApi.java- ~17 exception handlerstqapi/src/main/java/com/perun/tlinq/api/TripMakerApi.java- ~91 exception handlerstqapi/src/main/java/com/perun/tlinq/api/ProductApi.java- ~33 exception handlerstqapi/src/main/java/com/perun/tlinq/api/CustomerApi.java- ~12 exception handlerstqapi/src/main/java/com/perun/tlinq/api/DocumentApi.java- ~27 exception handlerstqapi/src/main/java/com/perun/tlinq/api/GroupApi.java- ~33 exception handlerstqapi/src/main/java/com/perun/tlinq/api/CommonApi.java- ~7 exception handlerstqapi/src/main/java/com/perun/tlinq/api/CruiseApi.java- ~9 exception handlerstqapi/src/main/java/com/perun/tlinq/api/VisaApi.java- ~36 exception handlerstqapi/src/main/java/com/perun/tlinq/api/TripOfferApi.java- ~22 exception handlerstqapi/src/main/java/com/perun/tlinq/TQProApiServer.java- 5 exception handlerstqapi/src/main/java/com/perun/tlinq/ApiRoleManager.java- 1 exception handlertqapi/src/main/java/com/perun/tlinq/TlinqApiServer.java- if still in use
Action Items:
For each file:
1. Add import: import com.perun.tlinq.metrics.ErrorMetrics;
2. Find all catch (TlinqClientException ex) blocks
3. Replace logger call with ErrorMetrics.trackClientException("endpoint-name", ex);
4. Find all catch (Exception ex) blocks
5. Replace logger call with ErrorMetrics.trackException("endpoint-name", ex);
Endpoint Name Convention:
- Use the API path for endpoint name: e.g., "user/authenticate", "booking/create", "hotel/search"
- Extract from @Path annotation on class and method
Phase 3: Prometheus Configuration¶
Step 3.1: Start Prometheus and Grafana¶
Verify: - Prometheus UI: http://localhost:9090 - Grafana UI: http://localhost:3000 (admin/admin)
Step 3.2: Verify Metrics Endpoint¶
Start TQPro API server, then:
Expected Output:
# HELP api_requests_total Total API requests
# TYPE api_requests_total counter
api_requests_total{application="tqpro",endpoint="user/authenticate",method="POST",module="tqapi",} 5.0
...
Step 3.3: Verify Prometheus Scraping¶
- Open Prometheus UI: http://localhost:9090
- Go to Status > Targets
- Verify
tqpro-apitarget is UP - Query metrics:
api_requests_total
Troubleshooting:
- If target is DOWN, check network connectivity
- For Docker Desktop (Mac/Windows): use host.docker.internal:11080
- For Linux: use 172.17.0.1:11080 or host IP
- Check TQPro API is running: curl http://localhost:11080/tlinq-api/metrics
Step 3.4: Configure Alerting Rules (Optional)¶
File: tqapi/monitoring/prometheus-rules.yml (NEW FILE)
groups:
- name: tqpro_api_alerts
interval: 30s
rules:
# High error rate
- alert: HighAPIErrorRate
expr: rate(api_errors_total[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High API error rate detected"
description: "API error rate is {{ $value }} errors/sec for endpoint {{ $labels.endpoint }}"
# Authorization failures
- alert: HighAuthorizationFailureRate
expr: rate(auth_rejected_total[5m]) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "High authorization rejection rate"
description: "Auth rejection rate is {{ $value }} for API {{ $labels.api_path }}"
# Slow API responses
- alert: SlowAPIResponses
expr: histogram_quantile(0.95, rate(api_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "API response time degradation"
description: "95th percentile latency is {{ $value }}s for {{ $labels.endpoint }}"
Update tqapi/monitoring/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
# ADD:
rule_files:
- "/etc/prometheus/prometheus-rules.yml"
scrape_configs:
# ... existing config ...
Update tqapi/docker-compose-monitoring.yml:
prometheus:
# ... existing config ...
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
- ./monitoring/prometheus-rules.yml:/etc/prometheus/prometheus-rules.yml # ADD
- prometheus-data:/prometheus
Phase 4: Grafana Dashboard Setup¶
Step 4.1: Access Grafana¶
- Open: http://localhost:3000
- Login: admin / admin
- Change password when prompted (or skip)
Step 4.2: Verify Prometheus Datasource¶
- Go to Configuration > Data Sources
- Verify "Prometheus" is listed and working
- Click "Test" - should see "Data source is working"
Step 4.3: Create Main API Dashboard¶
File: tqapi/monitoring/grafana/dashboards/tqpro-api-overview.json (NEW FILE)
{
"dashboard": {
"title": "TQPro API Overview",
"tags": ["tqpro", "api", "overview"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Request Rate (req/sec)",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "rate(api_requests_total{application=\"tqpro\"}[5m])",
"legendFormat": "{{endpoint}} - {{method}}"
}
]
},
{
"id": 2,
"title": "Error Rate (errors/sec)",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "rate(api_errors_total{application=\"tqpro\"}[5m])",
"legendFormat": "{{endpoint}} - {{error_code}}"
}
]
},
{
"id": 3,
"title": "Request Duration (p50, p95, p99)",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
"targets": [
{
"expr": "histogram_quantile(0.50, rate(api_request_duration_seconds_bucket[5m]))",
"legendFormat": "p50"
},
{
"expr": "histogram_quantile(0.95, rate(api_request_duration_seconds_bucket[5m]))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.99, rate(api_request_duration_seconds_bucket[5m]))",
"legendFormat": "p99"
}
]
},
{
"id": 4,
"title": "Response Status Distribution",
"type": "piechart",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
"targets": [
{
"expr": "sum(rate(api_responses_total{application=\"tqpro\"}[5m])) by (status_category)",
"legendFormat": "{{status_category}}"
}
]
}
],
"time": {"from": "now-1h", "to": "now"},
"refresh": "30s"
}
}
Step 4.4: Create Authorization Dashboard¶
File: tqapi/monitoring/grafana/dashboards/tqpro-authorization.json (NEW FILE)
{
"dashboard": {
"title": "TQPro Authorization Metrics",
"tags": ["tqpro", "api", "authorization", "security"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Authorization Success vs Rejection",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "rate(auth_authorized_total{application=\"tqpro\"}[5m])",
"legendFormat": "Authorized - {{api_path}}"
},
{
"expr": "rate(auth_rejected_total{application=\"tqpro\"}[5m])",
"legendFormat": "Rejected - {{api_path}}"
}
]
},
{
"id": 2,
"title": "Authorization Rejection Rate by API",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "rate(auth_rejected_total{application=\"tqpro\"}[5m])",
"legendFormat": "{{api_path}} - {{user_roles}}"
}
]
},
{
"id": 3,
"title": "Top Rejected Users",
"type": "table",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
"targets": [
{
"expr": "topk(10, sum(increase(auth_rejected_total{application=\"tqpro\"}[1h])) by (user_id, api_path))",
"format": "table",
"instant": true
}
]
},
{
"id": 4,
"title": "Dev Mode Bypass Counter",
"type": "stat",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
"targets": [
{
"expr": "sum(auth_dev_mode_bypass_total{application=\"tqpro\"})"
}
],
"options": {
"colorMode": "background",
"graphMode": "area"
}
}
],
"time": {"from": "now-1h", "to": "now"},
"refresh": "30s"
}
}
Step 4.5: Create Performance Dashboard¶
File: tqapi/monitoring/grafana/dashboards/tqpro-performance.json (NEW FILE)
{
"dashboard": {
"title": "TQPro API Performance",
"tags": ["tqpro", "api", "performance"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Request Duration Heatmap",
"type": "heatmap",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 0},
"targets": [
{
"expr": "rate(api_request_duration_seconds_bucket{application=\"tqpro\"}[5m])",
"format": "heatmap",
"legendFormat": "{{le}}"
}
]
},
{
"id": 2,
"title": "Slowest Endpoints (p95)",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
"targets": [
{
"expr": "topk(10, histogram_quantile(0.95, rate(api_request_duration_seconds_bucket{application=\"tqpro\"}[5m])) by (endpoint))"
}
]
},
{
"id": 3,
"title": "Request Rate by Endpoint",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
"targets": [
{
"expr": "topk(10, rate(api_requests_total{application=\"tqpro\"}[5m])) by (endpoint)"
}
]
}
],
"time": {"from": "now-1h", "to": "now"},
"refresh": "30s"
}
}
Step 4.6: Import Dashboards¶
Method 1: Auto-provisioning (Recommended)
Dashboards should be auto-loaded from /var/lib/grafana/dashboards directory.
Restart Grafana:
Method 2: Manual Import
- Go to Grafana UI > Dashboards > Import
- Upload each JSON file
- Select "Prometheus" datasource
- Click "Import"
Step 4.7: Configure Alerts in Grafana (Optional)¶
- Edit dashboard panel
- Click "Alert" tab
- Configure alert rule
- Set notification channel (email, Slack, PagerDuty, etc.)
Phase 5: Testing & Validation¶
Step 5.1: Build and Deploy Updated Code¶
Verify build succeeds with no errors.
Step 5.2: Start TQPro API Server¶
# Set TLINQ_HOME if not already set
export TLINQ_HOME=/path/to/tqpro/config
# Run the API server
java -jar build/libs/tqapi.jar
Verify logs show: - "MetricsManager initialized with Prometheus registry" - No errors during startup
Step 5.3: Generate Test Traffic¶
Script: tqapi/test-metrics.sh (NEW FILE)
#!/bin/bash
# Test script to generate API traffic for metrics validation
API_BASE="http://localhost:11080/tlinq-api"
echo "Generating test API traffic..."
# Test 1: Successful requests
for i in {1..50}; do
curl -X POST "$API_BASE/user/authenticate" \
-H "Content-Type: application/json" \
-H "X-User: test-user-$i" \
-H "X-Roles: agent" \
-d '{"username":"test","password":"test"}' \
-s -o /dev/null
sleep 0.1
done
# Test 2: Unauthorized requests (should be rejected)
for i in {1..20}; do
curl -X POST "$API_BASE/booking/create" \
-H "Content-Type: application/json" \
-H "X-User: guest-$i" \
-H "X-Roles: guest" \
-d '{"session":"test"}' \
-s -o /dev/null
sleep 0.1
done
# Test 3: Error scenarios
for i in {1..10}; do
curl -X POST "$API_BASE/user/authenticate" \
-H "Content-Type: application/json" \
-d '{"invalid":"data"}' \
-s -o /dev/null
sleep 0.1
done
echo "Test traffic generated. Check metrics at $API_BASE/metrics"
Run:
Step 5.4: Verify Metrics Endpoint¶
curl http://localhost:11080/tlinq-api/metrics | grep -E "(api_requests_total|auth_rejected_total|api_errors_total)"
Expected Output:
api_requests_total{application="tqpro",endpoint="user/authenticate",method="POST",module="tqapi",} 50.0
auth_rejected_total{api_path="booking/create",application="tqpro",module="tqapi",user_id="guest-1",user_roles="guest",} 20.0
api_errors_total{endpoint="user/authenticate",error_code="MISSING_PARAMETER",error_type="TlinqClientException",application="tqpro",module="tqapi",} 10.0
Step 5.5: Verify Prometheus Scraping¶
- Open: http://localhost:9090
- Query:
api_requests_total - Verify data is present and updating
Step 5.6: Verify Grafana Dashboards¶
- Open: http://localhost:3000
- Navigate to "TQPro API Overview" dashboard
- Verify panels show data:
- Request Rate graph shows activity
- Error Rate shows errors from test
- Request Duration shows latency
-
Status Distribution shows 2xx, 4xx responses
-
Navigate to "TQPro Authorization Metrics"
- Verify authorization rejection data
Step 5.7: Test Alert Rules (if configured)¶
-
Generate high error rate:
-
Check Prometheus > Alerts
- Verify "HighAPIErrorRate" alert fires
Metrics Reference¶
Core Metrics¶
| Metric Name | Type | Labels | Description |
|---|---|---|---|
api_requests_total |
Counter | endpoint, method | Total API requests received |
api_responses_total |
Counter | endpoint, method, status, status_category | Total API responses sent |
api_request_duration_seconds |
Histogram | endpoint, method, status, status_category | Request processing duration |
auth_requests_total |
Counter | api_path, user_id, roles | Total authentication attempts |
auth_authorized_total |
Counter | api_path, user_roles | Successful authorizations |
auth_rejected_total |
Counter | api_path, user_id, user_roles | Rejected authorization attempts |
auth_dev_mode_bypass_total |
Counter | user_id | Dev mode authentication bypasses |
api_errors_total |
Counter | endpoint, error_type, error_code | Total API errors |
Useful PromQL Queries¶
Request Rate:
Error Rate:
95th Percentile Latency:
Authorization Rejection Ratio:
Top 10 Slowest Endpoints:
Error Rate by Endpoint:
Troubleshooting¶
Issue: Metrics endpoint returns 404¶
Symptoms: curl http://localhost:11080/tlinq-api/metrics returns 404
Solutions:
1. Verify MetricsServlet is registered in TQProApiServer.java
2. Check server logs for startup errors
3. Verify servlet path: ctx.addServlet(metricsServlet, "/metrics");
4. Test: curl http://localhost:11080/tlinq-api/user/authenticate (should work)
Issue: Prometheus target is DOWN¶
Symptoms: Prometheus UI shows tqpro-api target as DOWN
Solutions:
1. Check TQPro API is running: curl http://localhost:11080/tlinq-api/metrics
2. Check Docker network connectivity:
- Mac/Windows: Use host.docker.internal:11080
- Linux: Use 172.17.0.1:11080 or host IP
3. Check prometheus.yml target configuration
4. View Prometheus logs: docker logs tqpro-prometheus
Issue: No data in Grafana dashboards¶
Symptoms: Dashboards load but panels show "No data"
Solutions:
1. Verify Prometheus datasource is working (Test button)
2. Check Prometheus is scraping: http://localhost:9090/targets
3. Verify metrics exist in Prometheus: query api_requests_total
4. Generate test traffic: ./test-metrics.sh
5. Check time range in Grafana (last 1 hour)
Issue: Metrics not incrementing¶
Symptoms: Metrics endpoint shows metrics but values don't change
Solutions: 1. Verify filters are registered in ResourceConfig 2. Check server logs for filter errors 3. Make actual API calls (not just /metrics) 4. Verify MetricsManager.getInstance() is called 5. Check no exceptions during metric recording
Issue: High memory usage¶
Symptoms: API server memory grows over time
Solutions: 1. Check metric cardinality (too many label combinations) 2. Avoid high-cardinality labels (user IDs, timestamps, etc.) 3. Use caching in MetricsManager (already implemented) 4. Consider aggregating metrics before recording
Issue: Build errors after adding Micrometer¶
Symptoms: Gradle build fails with dependency errors
Solutions:
1. Refresh dependencies: ../gradlew build --refresh-dependencies
2. Verify Micrometer version compatibility with Java 11+
3. Check for conflicting dependencies
4. Clean build: ../gradlew clean build
Production Deployment Considerations¶
Security¶
- Secure Metrics Endpoint
- Add authentication to
/tlinq-api/metrics - Use firewall rules to restrict access to Prometheus server
-
Consider TLS for scraping
-
Sensitive Data in Labels
- Do NOT include passwords, tokens, or PII in metric labels
- Sanitize user_id labels (hash or use internal IDs)
- Avoid exposing internal system details
Performance¶
- Metrics Cardinality
- Limit unique label combinations to < 10,000 per metric
- Avoid user-specific or request-specific labels
-
Use aggregation where possible
-
Scrape Interval
- Production: 30-60s recommended
-
High-traffic: Consider sampling
-
Retention
- Prometheus default: 15 days
- For long-term: Use Thanos, Cortex, or Victoria Metrics
High Availability¶
- Prometheus HA
- Deploy multiple Prometheus instances
-
Use federation for aggregation
-
Grafana HA
- Use external database (PostgreSQL/MySQL)
- Deploy multiple Grafana instances behind load balancer
Kubernetes Deployment¶
Example:
apiVersion: v1
kind: Service
metadata:
name: tqpro-api-metrics
spec:
ports:
- port: 11080
targetPort: 11080
selector:
app: tqpro-api
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: tqpro-api
spec:
selector:
matchLabels:
app: tqpro-api
endpoints:
- port: http
path: /tlinq-api/metrics
interval: 30s
Next Steps After Implementation¶
- Establish Baselines
- Run for 1-2 weeks to establish normal behavior
-
Document baseline metrics (request rate, error rate, latency)
-
Set Alert Thresholds
- Based on baselines, configure meaningful alerts
-
Avoid alert fatigue (tune thresholds)
-
Create Runbooks
- Document response procedures for each alert
-
Include investigation steps and remediation
-
Integrate with Incident Management
- Connect alerts to PagerDuty, OpsGenie, etc.
-
Set up on-call rotation
-
Continuous Improvement
- Review metrics weekly
- Identify optimization opportunities
-
Add business-specific metrics
-
Capacity Planning
- Use historical data for capacity planning
- Predict scaling needs
References¶
- Micrometer Documentation
- Prometheus Documentation
- Grafana Documentation
- Prometheus Best Practices
- PromQL Cheat Sheet
Appendix: Quick Start Commands¶
# Build API with metrics support
cd tqapi
../gradlew clean build
# Start monitoring stack
docker-compose -f docker-compose-monitoring.yml up -d
# Start TQPro API
export TLINQ_HOME=/path/to/config
java -jar build/libs/tqapi.jar
# Generate test traffic
./test-metrics.sh
# View metrics
curl http://localhost:11080/tlinq-api/metrics
# Access UIs
# Prometheus: http://localhost:9090
# Grafana: http://localhost:3000 (admin/admin)
End of Document