Assist_Design/docs/operations/monitoring-setup.md
barsa 90ab71b94d Update README.md to Enhance Documentation Clarity and Add New Sections
- Added a new section for Release Procedures, detailing deployment and rollback processes.
- Updated the System Operations section to include Monitoring Setup, Rate Limit Tuning, and Customer Data Management for improved operational guidance.
- Reformatted the table structure for better readability and consistency across documentation.
2025-12-23 16:08:15 +09:00

11 KiB

Monitoring Dashboard Setup

This document provides guidance for setting up monitoring infrastructure for the Customer Portal.


Health Endpoints

The BFF exposes several health check endpoints for monitoring:

Endpoint Purpose Authentication
GET /health Core system health (database, cache) Public
GET /health/queues Request queue metrics (WHMCS, Salesforce) Public
GET /health/queues/whmcs WHMCS queue details Public
GET /health/queues/salesforce Salesforce queue details Public
GET /health/catalog/cache Catalog cache metrics Public
GET /auth/health-check Integration health (DB, WHMCS, Salesforce) Public

Core Health Response

{
  "status": "ok",
  "checks": {
    "database": "ok",
    "cache": "ok"
  }
}

Status Values:

  • ok - All systems healthy
  • degraded - One or more systems failing

Queue Health Response

{
  "timestamp": "2025-01-15T10:30:00.000Z",
  "whmcs": {
    "health": "healthy",
    "metrics": {
      "totalRequests": 1500,
      "completedRequests": 1495,
      "failedRequests": 5,
      "queueSize": 0,
      "pendingRequests": 2,
      "averageWaitTime": 50,
      "averageExecutionTime": 250
    }
  },
  "salesforce": {
    "health": "healthy",
    "metrics": { ... },
    "dailyUsage": { "used": 5000, "limit": 15000 }
  }
}

Key Metrics to Monitor

Application Metrics

Metric Source Warning Critical Description
Health status /health degraded Any check fail Core system health
Response time (p95) Logs/APM >2s >5s API response latency
Error rate Logs/APM >1% >5% HTTP 5xx responses
Active connections Node.js metrics >80% capacity >95% capacity Connection pool usage

Database Metrics

Metric Source Warning Critical Description
Connection pool usage PostgreSQL >80% >95% Active connections vs limit
Query duration PostgreSQL logs >500ms >2s Slow query detection
Database size PostgreSQL >80% disk >90% disk Storage capacity
Dead tuples pg_stat_user_tables >10% >25% Vacuum needed

Cache Metrics

Metric Source Warning Critical Description
Redis memory Redis INFO >80% maxmemory >95% maxmemory Memory pressure
Cache hit rate Application logs <80% <60% Cache effectiveness
Redis latency Redis CLI >10ms >50ms Command latency
Evictions Redis INFO Any High rate Memory pressure indicator

Queue Metrics

Metric Source Warning Critical Description
WHMCS queue size /health/queues >10 >50 Pending WHMCS requests
WHMCS failed requests /health/queues >5 >20 Failed API calls
SF daily API usage /health/queues >80% limit >95% limit Salesforce API quota
BullMQ wait queue Redis >10 >50 Job backlog
BullMQ failed jobs Redis >5 >20 Processing failures

External Dependency Metrics

Metric Source Warning Critical Description
Salesforce response time Logs >2s >5s SF API latency
WHMCS response time Logs >2s >5s WHMCS API latency
Freebit response time Logs >3s >10s Freebit API latency
External error rate Logs >1% >5% Integration failures

Structured Logging for Metrics

The BFF uses Pino for structured JSON logging. Key fields for metrics extraction:

{
  "timestamp": "2025-01-15T10:30:00.000Z",
  "level": "info",
  "service": "customer-portal-bff",
  "correlationId": "req-123",
  "message": "API call completed",
  "duration": 250,
  "path": "/api/invoices",
  "method": "GET",
  "statusCode": 200
}

Log Queries for Metrics

Error Rate (last hour):

grep '"level":50' /var/log/bff/combined.log | wc -l

Slow Requests (>2s):

grep '"duration":[0-9]\{4,\}' /var/log/bff/combined.log | tail -20

External API Errors:

grep -E '(WHMCS|Salesforce|Freebit).*error' /var/log/bff/error.log | tail -20

Grafana Dashboard Setup

Data Sources

  1. Prometheus - For application metrics
  2. Loki - For log aggregation
  3. PostgreSQL - For database metrics

Overview Dashboard

  1. System Health (Stat panel)

    • Query: /health endpoint status
    • Show: ok/degraded indicator
  2. Request Rate (Graph panel)

    • Source: Prometheus/Loki
    • Show: Requests per second
  3. Error Rate (Graph panel)

    • Source: Loki log count
    • Filter: level >= 50
  4. Response Time (p95) (Graph panel)

    • Source: Prometheus histogram
    • Show: 95th percentile latency

Queue Dashboard

  1. Queue Depths (Graph panel)

    • Source: /health/queues endpoint
    • Show: WHMCS and SF queue sizes
  2. Failed Jobs (Stat panel)

    • Source: Redis BullMQ metrics
    • Show: Failed job count
  3. Salesforce API Usage (Gauge panel)

    • Source: /health/queues/salesforce
    • Show: Daily usage vs limit

Database Dashboard

  1. Connection Pool (Gauge panel)

    • Source: PostgreSQL pg_stat_activity
    • Show: Active connections
  2. Query Performance (Table panel)

    • Source: PostgreSQL pg_stat_statements
    • Show: Slowest queries

Sample Prometheus Scrape Config

scrape_configs:
  - job_name: "portal-bff"
    static_configs:
      - targets: ["bff:4000"]
    metrics_path: "/health"
    scrape_interval: 30s

CloudWatch Setup (AWS)

Custom Metrics

Push metrics from health endpoints to CloudWatch:

# Example: Push queue depth metric
aws cloudwatch put-metric-data \
  --namespace "CustomerPortal" \
  --metric-name "WhmcsQueueDepth" \
  --value $(curl -s http://localhost:4000/health/queues | jq '.whmcs.metrics.queueSize') \
  --dimensions Environment=production
Alarm Metric Threshold Period Action
HighErrorRate ErrorCount >10 5 min SNS notification
HighLatency p95 ResponseTime >2000ms 5 min SNS notification
QueueBacklog WhmcsQueueDepth >50 5 min SNS notification
DatabaseDown HealthStatus !=ok 1 min PagerDuty
CacheDown HealthStatus !=ok 1 min PagerDuty

Log Insights Queries

Error Summary:

fields @timestamp, @message
| filter level >= 50
| stats count() by bin(5m)

Slow Requests:

fields @timestamp, path, duration
| filter duration > 2000
| sort duration desc
| limit 20

DataDog Setup

Agent Configuration

# datadog.yaml
logs_enabled: true

logs:
  - type: file
    path: /var/log/bff/combined.log
    service: customer-portal-bff
    source: nodejs

Custom Metrics

// Example: Report queue metrics to DataDog
import { StatsD } from "hot-shots";

const dogstatsd = new StatsD({ host: "localhost", port: 8125 });

// Report queue depth
dogstatsd.gauge("portal.whmcs.queue_depth", metrics.queueSize);
dogstatsd.gauge("portal.whmcs.failed_requests", metrics.failedRequests);
  1. Health Check Monitor

    • Check: HTTP check on /health
    • Alert: When status != ok for 2 minutes
  2. Error Rate Monitor

    • Metric: portal.errors.count
    • Alert: When >5% for 5 minutes
  3. Queue Depth Monitor

    • Metric: portal.whmcs.queue_depth
    • Alert: When >50 for 5 minutes

Alerting Best Practices

Alert Priority Levels

Priority Response Time Examples
P1 Critical 15 minutes Portal down, database unreachable
P2 High 1 hour Provisioning failing, payment processing down
P3 Medium 4 hours Degraded performance, high error rate
P4 Low 24 hours Minor issues, informational alerts

Alert Routing

# Example PagerDuty routing
routes:
  - match:
      severity: critical
    receiver: pagerduty-oncall
  - match:
      severity: warning
    receiver: slack-ops
  - match:
      severity: info
    receiver: email-team

Include runbook links in all alerts:


Monitoring Checklist

Initial Setup

  • Configure health endpoint scraping (every 30s)
  • Set up log aggregation (Loki, CloudWatch, or DataDog)
  • Create overview dashboard with key metrics
  • Configure P1/P2 alerts for critical failures
  • Test alert routing to on-call

Ongoing Maintenance

  • Review alert thresholds quarterly
  • Check for alert fatigue (too many false positives)
  • Update dashboards when new features are deployed
  • Validate runbook links are current


Last Updated: December 2025