barsa 90ab71b94d Update README.md to Enhance Documentation Clarity and Add New Sections

- Added a new section for Release Procedures, detailing deployment and rollback processes.
- Updated the System Operations section to include Monitoring Setup, Rate Limit Tuning, and Customer Data Management for improved operational guidance.
- Reformatted the table structure for better readability and consistency across documentation.

2025-12-23 16:08:15 +09:00

11 KiB

Raw Blame History

Monitoring Dashboard Setup

This document provides guidance for setting up monitoring infrastructure for the Customer Portal.

Health Endpoints

The BFF exposes several health check endpoints for monitoring:

Endpoint	Purpose	Authentication
`GET /health`	Core system health (database, cache)	Public
`GET /health/queues`	Request queue metrics (WHMCS, Salesforce)	Public
`GET /health/queues/whmcs`	WHMCS queue details	Public
`GET /health/queues/salesforce`	Salesforce queue details	Public
`GET /health/catalog/cache`	Catalog cache metrics	Public
`GET /auth/health-check`	Integration health (DB, WHMCS, Salesforce)	Public

Core Health Response

{
  "status": "ok",
  "checks": {
    "database": "ok",
    "cache": "ok"
  }
}

Status Values:

ok - All systems healthy
degraded - One or more systems failing

Queue Health Response

{
  "timestamp": "2025-01-15T10:30:00.000Z",
  "whmcs": {
    "health": "healthy",
    "metrics": {
      "totalRequests": 1500,
      "completedRequests": 1495,
      "failedRequests": 5,
      "queueSize": 0,
      "pendingRequests": 2,
      "averageWaitTime": 50,
      "averageExecutionTime": 250
    }
  },
  "salesforce": {
    "health": "healthy",
    "metrics": { ... },
    "dailyUsage": { "used": 5000, "limit": 15000 }
  }
}

Key Metrics to Monitor

Application Metrics

Metric	Source	Warning	Critical	Description
Health status	`/health`	`degraded`	Any check `fail`	Core system health
Response time (p95)	Logs/APM	>2s	>5s	API response latency
Error rate	Logs/APM	>1%	>5%	HTTP 5xx responses
Active connections	Node.js metrics	>80% capacity	>95% capacity	Connection pool usage

Database Metrics

Metric	Source	Warning	Critical	Description
Connection pool usage	PostgreSQL	>80%	>95%	Active connections vs limit
Query duration	PostgreSQL logs	>500ms	>2s	Slow query detection
Database size	PostgreSQL	>80% disk	>90% disk	Storage capacity
Dead tuples	`pg_stat_user_tables`	>10%	>25%	Vacuum needed

Cache Metrics

Metric	Source	Warning	Critical	Description
Redis memory	Redis INFO	>80% maxmemory	>95% maxmemory	Memory pressure
Cache hit rate	Application logs	<80%	<60%	Cache effectiveness
Redis latency	Redis CLI	>10ms	>50ms	Command latency
Evictions	Redis INFO	Any	High rate	Memory pressure indicator

Queue Metrics

Metric	Source	Warning	Critical	Description
WHMCS queue size	`/health/queues`	>10	>50	Pending WHMCS requests
WHMCS failed requests	`/health/queues`	>5	>20	Failed API calls
SF daily API usage	`/health/queues`	>80% limit	>95% limit	Salesforce API quota
BullMQ wait queue	Redis	>10	>50	Job backlog
BullMQ failed jobs	Redis	>5	>20	Processing failures

External Dependency Metrics

Metric	Source	Warning	Critical	Description
Salesforce response time	Logs	>2s	>5s	SF API latency
WHMCS response time	Logs	>2s	>5s	WHMCS API latency
Freebit response time	Logs	>3s	>10s	Freebit API latency
External error rate	Logs	>1%	>5%	Integration failures

Structured Logging for Metrics

The BFF uses Pino for structured JSON logging. Key fields for metrics extraction:

{
  "timestamp": "2025-01-15T10:30:00.000Z",
  "level": "info",
  "service": "customer-portal-bff",
  "correlationId": "req-123",
  "message": "API call completed",
  "duration": 250,
  "path": "/api/invoices",
  "method": "GET",
  "statusCode": 200
}

Log Queries for Metrics

Error Rate (last hour):

grep '"level":50' /var/log/bff/combined.log | wc -l

Slow Requests (>2s):

grep '"duration":[0-9]\{4,\}' /var/log/bff/combined.log | tail -20

External API Errors:

grep -E '(WHMCS|Salesforce|Freebit).*error' /var/log/bff/error.log | tail -20

Grafana Dashboard Setup

Data Sources

Prometheus - For application metrics
Loki - For log aggregation
PostgreSQL - For database metrics

Recommended Panels

Overview Dashboard

System Health (Stat panel)
- Query: /health endpoint status
- Show: ok/degraded indicator
Request Rate (Graph panel)
- Source: Prometheus/Loki
- Show: Requests per second
Error Rate (Graph panel)
- Source: Loki log count
- Filter: level >= 50
Response Time (p95) (Graph panel)
- Source: Prometheus histogram
- Show: 95th percentile latency

Queue Dashboard

Queue Depths (Graph panel)
- Source: /health/queues endpoint
- Show: WHMCS and SF queue sizes
Failed Jobs (Stat panel)
- Source: Redis BullMQ metrics
- Show: Failed job count
Salesforce API Usage (Gauge panel)
- Source: /health/queues/salesforce
- Show: Daily usage vs limit

Database Dashboard

Connection Pool (Gauge panel)
- Source: PostgreSQL pg_stat_activity
- Show: Active connections
Query Performance (Table panel)
- Source: PostgreSQL pg_stat_statements
- Show: Slowest queries

Sample Prometheus Scrape Config

scrape_configs:
  - job_name: "portal-bff"
    static_configs:
      - targets: ["bff:4000"]
    metrics_path: "/health"
    scrape_interval: 30s

CloudWatch Setup (AWS)

Custom Metrics

Push metrics from health endpoints to CloudWatch:

# Example: Push queue depth metric
aws cloudwatch put-metric-data \
  --namespace "CustomerPortal" \
  --metric-name "WhmcsQueueDepth" \
  --value $(curl -s http://localhost:4000/health/queues | jq '.whmcs.metrics.queueSize') \
  --dimensions Environment=production

Recommended CloudWatch Alarms

Alarm	Metric	Threshold	Period	Action
HighErrorRate	ErrorCount	>10	5 min	SNS notification
HighLatency	p95 ResponseTime	>2000ms	5 min	SNS notification
QueueBacklog	WhmcsQueueDepth	>50	5 min	SNS notification
DatabaseDown	HealthStatus	!=ok	1 min	PagerDuty
CacheDown	HealthStatus	!=ok	1 min	PagerDuty

Log Insights Queries

Error Summary:

fields @timestamp, @message
| filter level >= 50
| stats count() by bin(5m)

Slow Requests:

fields @timestamp, path, duration
| filter duration > 2000
| sort duration desc
| limit 20

DataDog Setup

Agent Configuration

# datadog.yaml
logs_enabled: true

logs:
  - type: file
    path: /var/log/bff/combined.log
    service: customer-portal-bff
    source: nodejs

Custom Metrics

// Example: Report queue metrics to DataDog
import { StatsD } from "hot-shots";

const dogstatsd = new StatsD({ host: "localhost", port: 8125 });

// Report queue depth
dogstatsd.gauge("portal.whmcs.queue_depth", metrics.queueSize);
dogstatsd.gauge("portal.whmcs.failed_requests", metrics.failedRequests);

Recommended Monitors

Health Check Monitor
- Check: HTTP check on /health
- Alert: When status != ok for 2 minutes
Error Rate Monitor
- Metric: portal.errors.count
- Alert: When >5% for 5 minutes
Queue Depth Monitor
- Metric: portal.whmcs.queue_depth
- Alert: When >50 for 5 minutes

Alerting Best Practices

Alert Priority Levels

Priority	Response Time	Examples
P1 Critical	15 minutes	Portal down, database unreachable
P2 High	1 hour	Provisioning failing, payment processing down
P3 Medium	4 hours	Degraded performance, high error rate
P4 Low	24 hours	Minor issues, informational alerts

Alert Routing

# Example PagerDuty routing
routes:
  - match:
      severity: critical
    receiver: pagerduty-oncall
  - match:
      severity: warning
    receiver: slack-ops
  - match:
      severity: info
    receiver: email-team

Runbook Links

Include runbook links in all alerts:

Health check failures → Incident Response
Database issues → Database Operations
Queue problems → Queue Management
External API failures → External Dependencies

Monitoring Checklist

Initial Setup

Configure health endpoint scraping (every 30s)
Set up log aggregation (Loki, CloudWatch, or DataDog)
Create overview dashboard with key metrics
Configure P1/P2 alerts for critical failures
Test alert routing to on-call

Ongoing Maintenance

Review alert thresholds quarterly
Check for alert fatigue (too many false positives)
Update dashboards when new features are deployed
Validate runbook links are current

Last Updated: December 2025

11 KiB Raw Blame History

Monitoring Dashboard Setup

Health Endpoints

Core Health Response

Queue Health Response

Key Metrics to Monitor

Application Metrics

Database Metrics

Cache Metrics

Queue Metrics

External Dependency Metrics

Structured Logging for Metrics

Log Queries for Metrics

Grafana Dashboard Setup

Data Sources

Recommended Panels

Overview Dashboard

Queue Dashboard

Database Dashboard

Sample Prometheus Scrape Config

CloudWatch Setup (AWS)

Custom Metrics

Recommended CloudWatch Alarms

Log Insights Queries

DataDog Setup

Agent Configuration

Custom Metrics

Recommended Monitors

Alerting Best Practices

Alert Priority Levels

Alert Routing

Runbook Links

Monitoring Checklist

Initial Setup

Ongoing Maintenance

Related Documents

11 KiB

Raw Blame History