- Added a new section for Release Procedures, detailing deployment and rollback processes. - Updated the System Operations section to include Monitoring Setup, Rate Limit Tuning, and Customer Data Management for improved operational guidance. - Reformatted the table structure for better readability and consistency across documentation.
376 lines
11 KiB
Markdown
376 lines
11 KiB
Markdown
# Monitoring Dashboard Setup
|
|
|
|
This document provides guidance for setting up monitoring infrastructure for the Customer Portal.
|
|
|
|
---
|
|
|
|
## Health Endpoints
|
|
|
|
The BFF exposes several health check endpoints for monitoring:
|
|
|
|
| Endpoint | Purpose | Authentication |
|
|
| ------------------------------- | ------------------------------------------ | -------------- |
|
|
| `GET /health` | Core system health (database, cache) | Public |
|
|
| `GET /health/queues` | Request queue metrics (WHMCS, Salesforce) | Public |
|
|
| `GET /health/queues/whmcs` | WHMCS queue details | Public |
|
|
| `GET /health/queues/salesforce` | Salesforce queue details | Public |
|
|
| `GET /health/catalog/cache` | Catalog cache metrics | Public |
|
|
| `GET /auth/health-check` | Integration health (DB, WHMCS, Salesforce) | Public |
|
|
|
|
### Core Health Response
|
|
|
|
```json
|
|
{
|
|
"status": "ok",
|
|
"checks": {
|
|
"database": "ok",
|
|
"cache": "ok"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Status Values:**
|
|
|
|
- `ok` - All systems healthy
|
|
- `degraded` - One or more systems failing
|
|
|
|
### Queue Health Response
|
|
|
|
```json
|
|
{
|
|
"timestamp": "2025-01-15T10:30:00.000Z",
|
|
"whmcs": {
|
|
"health": "healthy",
|
|
"metrics": {
|
|
"totalRequests": 1500,
|
|
"completedRequests": 1495,
|
|
"failedRequests": 5,
|
|
"queueSize": 0,
|
|
"pendingRequests": 2,
|
|
"averageWaitTime": 50,
|
|
"averageExecutionTime": 250
|
|
}
|
|
},
|
|
"salesforce": {
|
|
"health": "healthy",
|
|
"metrics": { ... },
|
|
"dailyUsage": { "used": 5000, "limit": 15000 }
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Key Metrics to Monitor
|
|
|
|
### Application Metrics
|
|
|
|
| Metric | Source | Warning | Critical | Description |
|
|
| ------------------- | --------------- | ------------- | ---------------- | --------------------- |
|
|
| Health status | `/health` | `degraded` | Any check `fail` | Core system health |
|
|
| Response time (p95) | Logs/APM | >2s | >5s | API response latency |
|
|
| Error rate | Logs/APM | >1% | >5% | HTTP 5xx responses |
|
|
| Active connections | Node.js metrics | >80% capacity | >95% capacity | Connection pool usage |
|
|
|
|
### Database Metrics
|
|
|
|
| Metric | Source | Warning | Critical | Description |
|
|
| --------------------- | --------------------- | --------- | --------- | --------------------------- |
|
|
| Connection pool usage | PostgreSQL | >80% | >95% | Active connections vs limit |
|
|
| Query duration | PostgreSQL logs | >500ms | >2s | Slow query detection |
|
|
| Database size | PostgreSQL | >80% disk | >90% disk | Storage capacity |
|
|
| Dead tuples | `pg_stat_user_tables` | >10% | >25% | Vacuum needed |
|
|
|
|
### Cache Metrics
|
|
|
|
| Metric | Source | Warning | Critical | Description |
|
|
| -------------- | ---------------- | -------------- | -------------- | ------------------------- |
|
|
| Redis memory | Redis INFO | >80% maxmemory | >95% maxmemory | Memory pressure |
|
|
| Cache hit rate | Application logs | <80% | <60% | Cache effectiveness |
|
|
| Redis latency | Redis CLI | >10ms | >50ms | Command latency |
|
|
| Evictions | Redis INFO | Any | High rate | Memory pressure indicator |
|
|
|
|
### Queue Metrics
|
|
|
|
| Metric | Source | Warning | Critical | Description |
|
|
| --------------------- | ---------------- | ---------- | ---------- | ---------------------- |
|
|
| WHMCS queue size | `/health/queues` | >10 | >50 | Pending WHMCS requests |
|
|
| WHMCS failed requests | `/health/queues` | >5 | >20 | Failed API calls |
|
|
| SF daily API usage | `/health/queues` | >80% limit | >95% limit | Salesforce API quota |
|
|
| BullMQ wait queue | Redis | >10 | >50 | Job backlog |
|
|
| BullMQ failed jobs | Redis | >5 | >20 | Processing failures |
|
|
|
|
### External Dependency Metrics
|
|
|
|
| Metric | Source | Warning | Critical | Description |
|
|
| ------------------------ | ------ | ------- | -------- | -------------------- |
|
|
| Salesforce response time | Logs | >2s | >5s | SF API latency |
|
|
| WHMCS response time | Logs | >2s | >5s | WHMCS API latency |
|
|
| Freebit response time | Logs | >3s | >10s | Freebit API latency |
|
|
| External error rate | Logs | >1% | >5% | Integration failures |
|
|
|
|
---
|
|
|
|
## Structured Logging for Metrics
|
|
|
|
The BFF uses Pino for structured JSON logging. Key fields for metrics extraction:
|
|
|
|
```json
|
|
{
|
|
"timestamp": "2025-01-15T10:30:00.000Z",
|
|
"level": "info",
|
|
"service": "customer-portal-bff",
|
|
"correlationId": "req-123",
|
|
"message": "API call completed",
|
|
"duration": 250,
|
|
"path": "/api/invoices",
|
|
"method": "GET",
|
|
"statusCode": 200
|
|
}
|
|
```
|
|
|
|
### Log Queries for Metrics
|
|
|
|
**Error Rate (last hour):**
|
|
|
|
```bash
|
|
grep '"level":50' /var/log/bff/combined.log | wc -l
|
|
```
|
|
|
|
**Slow Requests (>2s):**
|
|
|
|
```bash
|
|
grep '"duration":[0-9]\{4,\}' /var/log/bff/combined.log | tail -20
|
|
```
|
|
|
|
**External API Errors:**
|
|
|
|
```bash
|
|
grep -E '(WHMCS|Salesforce|Freebit).*error' /var/log/bff/error.log | tail -20
|
|
```
|
|
|
|
---
|
|
|
|
## Grafana Dashboard Setup
|
|
|
|
### Data Sources
|
|
|
|
1. **Prometheus** - For application metrics
|
|
2. **Loki** - For log aggregation
|
|
3. **PostgreSQL** - For database metrics
|
|
|
|
### Recommended Panels
|
|
|
|
#### Overview Dashboard
|
|
|
|
1. **System Health** (Stat panel)
|
|
- Query: `/health` endpoint status
|
|
- Show: ok/degraded indicator
|
|
|
|
2. **Request Rate** (Graph panel)
|
|
- Source: Prometheus/Loki
|
|
- Show: Requests per second
|
|
|
|
3. **Error Rate** (Graph panel)
|
|
- Source: Loki log count
|
|
- Filter: `level >= 50`
|
|
|
|
4. **Response Time (p95)** (Graph panel)
|
|
- Source: Prometheus histogram
|
|
- Show: 95th percentile latency
|
|
|
|
#### Queue Dashboard
|
|
|
|
1. **Queue Depths** (Graph panel)
|
|
- Source: `/health/queues` endpoint
|
|
- Show: WHMCS and SF queue sizes
|
|
|
|
2. **Failed Jobs** (Stat panel)
|
|
- Source: Redis BullMQ metrics
|
|
- Show: Failed job count
|
|
|
|
3. **Salesforce API Usage** (Gauge panel)
|
|
- Source: `/health/queues/salesforce`
|
|
- Show: Daily usage vs limit
|
|
|
|
#### Database Dashboard
|
|
|
|
1. **Connection Pool** (Gauge panel)
|
|
- Source: PostgreSQL `pg_stat_activity`
|
|
- Show: Active connections
|
|
|
|
2. **Query Performance** (Table panel)
|
|
- Source: PostgreSQL `pg_stat_statements`
|
|
- Show: Slowest queries
|
|
|
|
### Sample Prometheus Scrape Config
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: "portal-bff"
|
|
static_configs:
|
|
- targets: ["bff:4000"]
|
|
metrics_path: "/health"
|
|
scrape_interval: 30s
|
|
```
|
|
|
|
---
|
|
|
|
## CloudWatch Setup (AWS)
|
|
|
|
### Custom Metrics
|
|
|
|
Push metrics from health endpoints to CloudWatch:
|
|
|
|
```bash
|
|
# Example: Push queue depth metric
|
|
aws cloudwatch put-metric-data \
|
|
--namespace "CustomerPortal" \
|
|
--metric-name "WhmcsQueueDepth" \
|
|
--value $(curl -s http://localhost:4000/health/queues | jq '.whmcs.metrics.queueSize') \
|
|
--dimensions Environment=production
|
|
```
|
|
|
|
### Recommended CloudWatch Alarms
|
|
|
|
| Alarm | Metric | Threshold | Period | Action |
|
|
| ------------- | ---------------- | --------- | ------ | ---------------- |
|
|
| HighErrorRate | ErrorCount | >10 | 5 min | SNS notification |
|
|
| HighLatency | p95 ResponseTime | >2000ms | 5 min | SNS notification |
|
|
| QueueBacklog | WhmcsQueueDepth | >50 | 5 min | SNS notification |
|
|
| DatabaseDown | HealthStatus | !=ok | 1 min | PagerDuty |
|
|
| CacheDown | HealthStatus | !=ok | 1 min | PagerDuty |
|
|
|
|
### Log Insights Queries
|
|
|
|
**Error Summary:**
|
|
|
|
```sql
|
|
fields @timestamp, @message
|
|
| filter level >= 50
|
|
| stats count() by bin(5m)
|
|
```
|
|
|
|
**Slow Requests:**
|
|
|
|
```sql
|
|
fields @timestamp, path, duration
|
|
| filter duration > 2000
|
|
| sort duration desc
|
|
| limit 20
|
|
```
|
|
|
|
---
|
|
|
|
## DataDog Setup
|
|
|
|
### Agent Configuration
|
|
|
|
```yaml
|
|
# datadog.yaml
|
|
logs_enabled: true
|
|
|
|
logs:
|
|
- type: file
|
|
path: /var/log/bff/combined.log
|
|
service: customer-portal-bff
|
|
source: nodejs
|
|
```
|
|
|
|
### Custom Metrics
|
|
|
|
```typescript
|
|
// Example: Report queue metrics to DataDog
|
|
import { StatsD } from "hot-shots";
|
|
|
|
const dogstatsd = new StatsD({ host: "localhost", port: 8125 });
|
|
|
|
// Report queue depth
|
|
dogstatsd.gauge("portal.whmcs.queue_depth", metrics.queueSize);
|
|
dogstatsd.gauge("portal.whmcs.failed_requests", metrics.failedRequests);
|
|
```
|
|
|
|
### Recommended Monitors
|
|
|
|
1. **Health Check Monitor**
|
|
- Check: HTTP check on `/health`
|
|
- Alert: When status != ok for 2 minutes
|
|
|
|
2. **Error Rate Monitor**
|
|
- Metric: `portal.errors.count`
|
|
- Alert: When >5% for 5 minutes
|
|
|
|
3. **Queue Depth Monitor**
|
|
- Metric: `portal.whmcs.queue_depth`
|
|
- Alert: When >50 for 5 minutes
|
|
|
|
---
|
|
|
|
## Alerting Best Practices
|
|
|
|
### Alert Priority Levels
|
|
|
|
| Priority | Response Time | Examples |
|
|
| ----------- | ------------- | --------------------------------------------- |
|
|
| P1 Critical | 15 minutes | Portal down, database unreachable |
|
|
| P2 High | 1 hour | Provisioning failing, payment processing down |
|
|
| P3 Medium | 4 hours | Degraded performance, high error rate |
|
|
| P4 Low | 24 hours | Minor issues, informational alerts |
|
|
|
|
### Alert Routing
|
|
|
|
```yaml
|
|
# Example PagerDuty routing
|
|
routes:
|
|
- match:
|
|
severity: critical
|
|
receiver: pagerduty-oncall
|
|
- match:
|
|
severity: warning
|
|
receiver: slack-ops
|
|
- match:
|
|
severity: info
|
|
receiver: email-team
|
|
```
|
|
|
|
### Runbook Links
|
|
|
|
Include runbook links in all alerts:
|
|
|
|
- Health check failures → [Incident Response](./incident-response.md)
|
|
- Database issues → [Database Operations](./database-operations.md)
|
|
- Queue problems → [Queue Management](./queue-management.md)
|
|
- External API failures → [External Dependencies](./external-dependencies.md)
|
|
|
|
---
|
|
|
|
## Monitoring Checklist
|
|
|
|
### Initial Setup
|
|
|
|
- [ ] Configure health endpoint scraping (every 30s)
|
|
- [ ] Set up log aggregation (Loki, CloudWatch, or DataDog)
|
|
- [ ] Create overview dashboard with key metrics
|
|
- [ ] Configure P1/P2 alerts for critical failures
|
|
- [ ] Test alert routing to on-call
|
|
|
|
### Ongoing Maintenance
|
|
|
|
- [ ] Review alert thresholds quarterly
|
|
- [ ] Check for alert fatigue (too many false positives)
|
|
- [ ] Update dashboards when new features are deployed
|
|
- [ ] Validate runbook links are current
|
|
|
|
---
|
|
|
|
## Related Documents
|
|
|
|
- [Incident Response](./incident-response.md)
|
|
- [Logging Guide](./logging.md)
|
|
- [External Dependencies](./external-dependencies.md)
|
|
- [Queue Management](./queue-management.md)
|
|
|
|
---
|
|
|
|
**Last Updated:** December 2025
|