- Added a new section for Release Procedures, detailing deployment and rollback processes. - Updated the System Operations section to include Monitoring Setup, Rate Limit Tuning, and Customer Data Management for improved operational guidance. - Reformatted the table structure for better readability and consistency across documentation.
11 KiB
Monitoring Dashboard Setup
This document provides guidance for setting up monitoring infrastructure for the Customer Portal.
Health Endpoints
The BFF exposes several health check endpoints for monitoring:
| Endpoint | Purpose | Authentication |
|---|---|---|
GET /health |
Core system health (database, cache) | Public |
GET /health/queues |
Request queue metrics (WHMCS, Salesforce) | Public |
GET /health/queues/whmcs |
WHMCS queue details | Public |
GET /health/queues/salesforce |
Salesforce queue details | Public |
GET /health/catalog/cache |
Catalog cache metrics | Public |
GET /auth/health-check |
Integration health (DB, WHMCS, Salesforce) | Public |
Core Health Response
{
"status": "ok",
"checks": {
"database": "ok",
"cache": "ok"
}
}
Status Values:
ok- All systems healthydegraded- One or more systems failing
Queue Health Response
{
"timestamp": "2025-01-15T10:30:00.000Z",
"whmcs": {
"health": "healthy",
"metrics": {
"totalRequests": 1500,
"completedRequests": 1495,
"failedRequests": 5,
"queueSize": 0,
"pendingRequests": 2,
"averageWaitTime": 50,
"averageExecutionTime": 250
}
},
"salesforce": {
"health": "healthy",
"metrics": { ... },
"dailyUsage": { "used": 5000, "limit": 15000 }
}
}
Key Metrics to Monitor
Application Metrics
| Metric | Source | Warning | Critical | Description |
|---|---|---|---|---|
| Health status | /health |
degraded |
Any check fail |
Core system health |
| Response time (p95) | Logs/APM | >2s | >5s | API response latency |
| Error rate | Logs/APM | >1% | >5% | HTTP 5xx responses |
| Active connections | Node.js metrics | >80% capacity | >95% capacity | Connection pool usage |
Database Metrics
| Metric | Source | Warning | Critical | Description |
|---|---|---|---|---|
| Connection pool usage | PostgreSQL | >80% | >95% | Active connections vs limit |
| Query duration | PostgreSQL logs | >500ms | >2s | Slow query detection |
| Database size | PostgreSQL | >80% disk | >90% disk | Storage capacity |
| Dead tuples | pg_stat_user_tables |
>10% | >25% | Vacuum needed |
Cache Metrics
| Metric | Source | Warning | Critical | Description |
|---|---|---|---|---|
| Redis memory | Redis INFO | >80% maxmemory | >95% maxmemory | Memory pressure |
| Cache hit rate | Application logs | <80% | <60% | Cache effectiveness |
| Redis latency | Redis CLI | >10ms | >50ms | Command latency |
| Evictions | Redis INFO | Any | High rate | Memory pressure indicator |
Queue Metrics
| Metric | Source | Warning | Critical | Description |
|---|---|---|---|---|
| WHMCS queue size | /health/queues |
>10 | >50 | Pending WHMCS requests |
| WHMCS failed requests | /health/queues |
>5 | >20 | Failed API calls |
| SF daily API usage | /health/queues |
>80% limit | >95% limit | Salesforce API quota |
| BullMQ wait queue | Redis | >10 | >50 | Job backlog |
| BullMQ failed jobs | Redis | >5 | >20 | Processing failures |
External Dependency Metrics
| Metric | Source | Warning | Critical | Description |
|---|---|---|---|---|
| Salesforce response time | Logs | >2s | >5s | SF API latency |
| WHMCS response time | Logs | >2s | >5s | WHMCS API latency |
| Freebit response time | Logs | >3s | >10s | Freebit API latency |
| External error rate | Logs | >1% | >5% | Integration failures |
Structured Logging for Metrics
The BFF uses Pino for structured JSON logging. Key fields for metrics extraction:
{
"timestamp": "2025-01-15T10:30:00.000Z",
"level": "info",
"service": "customer-portal-bff",
"correlationId": "req-123",
"message": "API call completed",
"duration": 250,
"path": "/api/invoices",
"method": "GET",
"statusCode": 200
}
Log Queries for Metrics
Error Rate (last hour):
grep '"level":50' /var/log/bff/combined.log | wc -l
Slow Requests (>2s):
grep '"duration":[0-9]\{4,\}' /var/log/bff/combined.log | tail -20
External API Errors:
grep -E '(WHMCS|Salesforce|Freebit).*error' /var/log/bff/error.log | tail -20
Grafana Dashboard Setup
Data Sources
- Prometheus - For application metrics
- Loki - For log aggregation
- PostgreSQL - For database metrics
Recommended Panels
Overview Dashboard
-
System Health (Stat panel)
- Query:
/healthendpoint status - Show: ok/degraded indicator
- Query:
-
Request Rate (Graph panel)
- Source: Prometheus/Loki
- Show: Requests per second
-
Error Rate (Graph panel)
- Source: Loki log count
- Filter:
level >= 50
-
Response Time (p95) (Graph panel)
- Source: Prometheus histogram
- Show: 95th percentile latency
Queue Dashboard
-
Queue Depths (Graph panel)
- Source:
/health/queuesendpoint - Show: WHMCS and SF queue sizes
- Source:
-
Failed Jobs (Stat panel)
- Source: Redis BullMQ metrics
- Show: Failed job count
-
Salesforce API Usage (Gauge panel)
- Source:
/health/queues/salesforce - Show: Daily usage vs limit
- Source:
Database Dashboard
-
Connection Pool (Gauge panel)
- Source: PostgreSQL
pg_stat_activity - Show: Active connections
- Source: PostgreSQL
-
Query Performance (Table panel)
- Source: PostgreSQL
pg_stat_statements - Show: Slowest queries
- Source: PostgreSQL
Sample Prometheus Scrape Config
scrape_configs:
- job_name: "portal-bff"
static_configs:
- targets: ["bff:4000"]
metrics_path: "/health"
scrape_interval: 30s
CloudWatch Setup (AWS)
Custom Metrics
Push metrics from health endpoints to CloudWatch:
# Example: Push queue depth metric
aws cloudwatch put-metric-data \
--namespace "CustomerPortal" \
--metric-name "WhmcsQueueDepth" \
--value $(curl -s http://localhost:4000/health/queues | jq '.whmcs.metrics.queueSize') \
--dimensions Environment=production
Recommended CloudWatch Alarms
| Alarm | Metric | Threshold | Period | Action |
|---|---|---|---|---|
| HighErrorRate | ErrorCount | >10 | 5 min | SNS notification |
| HighLatency | p95 ResponseTime | >2000ms | 5 min | SNS notification |
| QueueBacklog | WhmcsQueueDepth | >50 | 5 min | SNS notification |
| DatabaseDown | HealthStatus | !=ok | 1 min | PagerDuty |
| CacheDown | HealthStatus | !=ok | 1 min | PagerDuty |
Log Insights Queries
Error Summary:
fields @timestamp, @message
| filter level >= 50
| stats count() by bin(5m)
Slow Requests:
fields @timestamp, path, duration
| filter duration > 2000
| sort duration desc
| limit 20
DataDog Setup
Agent Configuration
# datadog.yaml
logs_enabled: true
logs:
- type: file
path: /var/log/bff/combined.log
service: customer-portal-bff
source: nodejs
Custom Metrics
// Example: Report queue metrics to DataDog
import { StatsD } from "hot-shots";
const dogstatsd = new StatsD({ host: "localhost", port: 8125 });
// Report queue depth
dogstatsd.gauge("portal.whmcs.queue_depth", metrics.queueSize);
dogstatsd.gauge("portal.whmcs.failed_requests", metrics.failedRequests);
Recommended Monitors
-
Health Check Monitor
- Check: HTTP check on
/health - Alert: When status != ok for 2 minutes
- Check: HTTP check on
-
Error Rate Monitor
- Metric:
portal.errors.count - Alert: When >5% for 5 minutes
- Metric:
-
Queue Depth Monitor
- Metric:
portal.whmcs.queue_depth - Alert: When >50 for 5 minutes
- Metric:
Alerting Best Practices
Alert Priority Levels
| Priority | Response Time | Examples |
|---|---|---|
| P1 Critical | 15 minutes | Portal down, database unreachable |
| P2 High | 1 hour | Provisioning failing, payment processing down |
| P3 Medium | 4 hours | Degraded performance, high error rate |
| P4 Low | 24 hours | Minor issues, informational alerts |
Alert Routing
# Example PagerDuty routing
routes:
- match:
severity: critical
receiver: pagerduty-oncall
- match:
severity: warning
receiver: slack-ops
- match:
severity: info
receiver: email-team
Runbook Links
Include runbook links in all alerts:
- Health check failures → Incident Response
- Database issues → Database Operations
- Queue problems → Queue Management
- External API failures → External Dependencies
Monitoring Checklist
Initial Setup
- Configure health endpoint scraping (every 30s)
- Set up log aggregation (Loki, CloudWatch, or DataDog)
- Create overview dashboard with key metrics
- Configure P1/P2 alerts for critical failures
- Test alert routing to on-call
Ongoing Maintenance
- Review alert thresholds quarterly
- Check for alert fatigue (too many false positives)
- Update dashboards when new features are deployed
- Validate runbook links are current
Related Documents
Last Updated: December 2025