# Monitoring Dashboard Setup This document provides guidance for setting up monitoring infrastructure for the Customer Portal. --- ## Health Endpoints The BFF exposes several health check endpoints for monitoring: | Endpoint | Purpose | Authentication | | ------------------------------- | ------------------------------------------ | -------------- | | `GET /health` | Core system health (database, cache) | Public | | `GET /health/queues` | Request queue metrics (WHMCS, Salesforce) | Public | | `GET /health/queues/whmcs` | WHMCS queue details | Public | | `GET /health/queues/salesforce` | Salesforce queue details | Public | | `GET /health/catalog/cache` | Catalog cache metrics | Public | | `GET /auth/health-check` | Integration health (DB, WHMCS, Salesforce) | Public | ### Core Health Response ```json { "status": "ok", "checks": { "database": "ok", "cache": "ok" } } ``` **Status Values:** - `ok` - All systems healthy - `degraded` - One or more systems failing ### Queue Health Response ```json { "timestamp": "2025-01-15T10:30:00.000Z", "whmcs": { "health": "healthy", "metrics": { "totalRequests": 1500, "completedRequests": 1495, "failedRequests": 5, "queueSize": 0, "pendingRequests": 2, "averageWaitTime": 50, "averageExecutionTime": 250 } }, "salesforce": { "health": "healthy", "metrics": { ... }, "dailyUsage": { "used": 5000, "limit": 15000 } } } ``` --- ## Key Metrics to Monitor ### Application Metrics | Metric | Source | Warning | Critical | Description | | ------------------- | --------------- | ------------- | ---------------- | --------------------- | | Health status | `/health` | `degraded` | Any check `fail` | Core system health | | Response time (p95) | Logs/APM | >2s | >5s | API response latency | | Error rate | Logs/APM | >1% | >5% | HTTP 5xx responses | | Active connections | Node.js metrics | >80% capacity | >95% capacity | Connection pool usage | ### Database Metrics | Metric | Source | Warning | Critical | Description | | --------------------- | --------------------- | --------- | --------- | --------------------------- | | Connection pool usage | PostgreSQL | >80% | >95% | Active connections vs limit | | Query duration | PostgreSQL logs | >500ms | >2s | Slow query detection | | Database size | PostgreSQL | >80% disk | >90% disk | Storage capacity | | Dead tuples | `pg_stat_user_tables` | >10% | >25% | Vacuum needed | ### Cache Metrics | Metric | Source | Warning | Critical | Description | | -------------- | ---------------- | -------------- | -------------- | ------------------------- | | Redis memory | Redis INFO | >80% maxmemory | >95% maxmemory | Memory pressure | | Cache hit rate | Application logs | <80% | <60% | Cache effectiveness | | Redis latency | Redis CLI | >10ms | >50ms | Command latency | | Evictions | Redis INFO | Any | High rate | Memory pressure indicator | ### Queue Metrics | Metric | Source | Warning | Critical | Description | | --------------------- | ---------------- | ---------- | ---------- | ---------------------- | | WHMCS queue size | `/health/queues` | >10 | >50 | Pending WHMCS requests | | WHMCS failed requests | `/health/queues` | >5 | >20 | Failed API calls | | SF daily API usage | `/health/queues` | >80% limit | >95% limit | Salesforce API quota | | BullMQ wait queue | Redis | >10 | >50 | Job backlog | | BullMQ failed jobs | Redis | >5 | >20 | Processing failures | ### External Dependency Metrics | Metric | Source | Warning | Critical | Description | | ------------------------ | ------ | ------- | -------- | -------------------- | | Salesforce response time | Logs | >2s | >5s | SF API latency | | WHMCS response time | Logs | >2s | >5s | WHMCS API latency | | Freebit response time | Logs | >3s | >10s | Freebit API latency | | External error rate | Logs | >1% | >5% | Integration failures | --- ## Structured Logging for Metrics The BFF uses Pino for structured JSON logging. Key fields for metrics extraction: ```json { "timestamp": "2025-01-15T10:30:00.000Z", "level": "info", "service": "customer-portal-bff", "correlationId": "req-123", "message": "API call completed", "duration": 250, "path": "/api/invoices", "method": "GET", "statusCode": 200 } ``` ### Log Queries for Metrics **Error Rate (last hour):** ```bash grep '"level":50' /var/log/bff/combined.log | wc -l ``` **Slow Requests (>2s):** ```bash grep '"duration":[0-9]\{4,\}' /var/log/bff/combined.log | tail -20 ``` **External API Errors:** ```bash grep -E '(WHMCS|Salesforce|Freebit).*error' /var/log/bff/error.log | tail -20 ``` --- ## Grafana Dashboard Setup ### Data Sources 1. **Prometheus** - For application metrics 2. **Loki** - For log aggregation 3. **PostgreSQL** - For database metrics ### Recommended Panels #### Overview Dashboard 1. **System Health** (Stat panel) - Query: `/health` endpoint status - Show: ok/degraded indicator 2. **Request Rate** (Graph panel) - Source: Prometheus/Loki - Show: Requests per second 3. **Error Rate** (Graph panel) - Source: Loki log count - Filter: `level >= 50` 4. **Response Time (p95)** (Graph panel) - Source: Prometheus histogram - Show: 95th percentile latency #### Queue Dashboard 1. **Queue Depths** (Graph panel) - Source: `/health/queues` endpoint - Show: WHMCS and SF queue sizes 2. **Failed Jobs** (Stat panel) - Source: Redis BullMQ metrics - Show: Failed job count 3. **Salesforce API Usage** (Gauge panel) - Source: `/health/queues/salesforce` - Show: Daily usage vs limit #### Database Dashboard 1. **Connection Pool** (Gauge panel) - Source: PostgreSQL `pg_stat_activity` - Show: Active connections 2. **Query Performance** (Table panel) - Source: PostgreSQL `pg_stat_statements` - Show: Slowest queries ### Sample Prometheus Scrape Config ```yaml scrape_configs: - job_name: "portal-bff" static_configs: - targets: ["bff:4000"] metrics_path: "/health" scrape_interval: 30s ``` --- ## CloudWatch Setup (AWS) ### Custom Metrics Push metrics from health endpoints to CloudWatch: ```bash # Example: Push queue depth metric aws cloudwatch put-metric-data \ --namespace "CustomerPortal" \ --metric-name "WhmcsQueueDepth" \ --value $(curl -s http://localhost:4000/health/queues | jq '.whmcs.metrics.queueSize') \ --dimensions Environment=production ``` ### Recommended CloudWatch Alarms | Alarm | Metric | Threshold | Period | Action | | ------------- | ---------------- | --------- | ------ | ---------------- | | HighErrorRate | ErrorCount | >10 | 5 min | SNS notification | | HighLatency | p95 ResponseTime | >2000ms | 5 min | SNS notification | | QueueBacklog | WhmcsQueueDepth | >50 | 5 min | SNS notification | | DatabaseDown | HealthStatus | !=ok | 1 min | PagerDuty | | CacheDown | HealthStatus | !=ok | 1 min | PagerDuty | ### Log Insights Queries **Error Summary:** ```sql fields @timestamp, @message | filter level >= 50 | stats count() by bin(5m) ``` **Slow Requests:** ```sql fields @timestamp, path, duration | filter duration > 2000 | sort duration desc | limit 20 ``` --- ## DataDog Setup ### Agent Configuration ```yaml # datadog.yaml logs_enabled: true logs: - type: file path: /var/log/bff/combined.log service: customer-portal-bff source: nodejs ``` ### Custom Metrics ```typescript // Example: Report queue metrics to DataDog import { StatsD } from "hot-shots"; const dogstatsd = new StatsD({ host: "localhost", port: 8125 }); // Report queue depth dogstatsd.gauge("portal.whmcs.queue_depth", metrics.queueSize); dogstatsd.gauge("portal.whmcs.failed_requests", metrics.failedRequests); ``` ### Recommended Monitors 1. **Health Check Monitor** - Check: HTTP check on `/health` - Alert: When status != ok for 2 minutes 2. **Error Rate Monitor** - Metric: `portal.errors.count` - Alert: When >5% for 5 minutes 3. **Queue Depth Monitor** - Metric: `portal.whmcs.queue_depth` - Alert: When >50 for 5 minutes --- ## Alerting Best Practices ### Alert Priority Levels | Priority | Response Time | Examples | | ----------- | ------------- | --------------------------------------------- | | P1 Critical | 15 minutes | Portal down, database unreachable | | P2 High | 1 hour | Provisioning failing, payment processing down | | P3 Medium | 4 hours | Degraded performance, high error rate | | P4 Low | 24 hours | Minor issues, informational alerts | ### Alert Routing ```yaml # Example PagerDuty routing routes: - match: severity: critical receiver: pagerduty-oncall - match: severity: warning receiver: slack-ops - match: severity: info receiver: email-team ``` ### Runbook Links Include runbook links in all alerts: - Health check failures → [Incident Response](./incident-response.md) - Database issues → [Database Operations](./database-operations.md) - Queue problems → [Queue Management](./queue-management.md) - External API failures → [External Dependencies](./external-dependencies.md) --- ## Monitoring Checklist ### Initial Setup - [ ] Configure health endpoint scraping (every 30s) - [ ] Set up log aggregation (Loki, CloudWatch, or DataDog) - [ ] Create overview dashboard with key metrics - [ ] Configure P1/P2 alerts for critical failures - [ ] Test alert routing to on-call ### Ongoing Maintenance - [ ] Review alert thresholds quarterly - [ ] Check for alert fatigue (too many false positives) - [ ] Update dashboards when new features are deployed - [ ] Validate runbook links are current --- ## Related Documents - [Incident Response](./incident-response.md) - [Logging Guide](./logging.md) - [External Dependencies](./external-dependencies.md) - [Queue Management](./queue-management.md) --- **Last Updated:** December 2025