Assist_Design/docs/operations/monitoring-setup.md

# Monitoring Dashboard Setup

This document provides guidance for setting up monitoring infrastructure for the Customer Portal.

---

## Health Endpoints

The BFF exposes several health check endpoints for monitoring:

| Endpoint                        | Purpose                                    | Authentication |
| ------------------------------- | ------------------------------------------ | -------------- |
| `GET /health`                   | Core system health (database, cache)       | Public         |
| `GET /health/queues`            | Request queue metrics (WHMCS, Salesforce)  | Public         |
| `GET /health/queues/whmcs`      | WHMCS queue details                        | Public         |
| `GET /health/queues/salesforce` | Salesforce queue details                   | Public         |
| `GET /health/catalog/cache`     | Catalog cache metrics                      | Public         |
| `GET /auth/health-check`        | Integration health (DB, WHMCS, Salesforce) | Public         |

### Core Health Response

```json
{
  "status": "ok",
  "checks": {
    "database": "ok",
    "cache": "ok"
  }
}
```

**Status Values:**

- `ok` - All systems healthy
- `degraded` - One or more systems failing

### Queue Health Response

```json
{
  "timestamp": "2025-01-15T10:30:00.000Z",
  "whmcs": {
    "health": "healthy",
    "metrics": {
      "totalRequests": 1500,
      "completedRequests": 1495,
      "failedRequests": 5,
      "queueSize": 0,
      "pendingRequests": 2,
      "averageWaitTime": 50,
      "averageExecutionTime": 250
    }
  },
  "salesforce": {
    "health": "healthy",
    "metrics": { ... },
    "dailyUsage": { "used": 5000, "limit": 15000 }
  }
}
```

---

## Key Metrics to Monitor

### Application Metrics

| Metric              | Source          | Warning       | Critical         | Description           |
| ------------------- | --------------- | ------------- | ---------------- | --------------------- |
| Health status       | `/health`       | `degraded`    | Any check `fail` | Core system health    |
| Response time (p95) | Logs/APM        | >2s           | >5s              | API response latency  |
| Error rate          | Logs/APM        | >1%           | >5%              | HTTP 5xx responses    |
| Active connections  | Node.js metrics | >80% capacity | >95% capacity    | Connection pool usage |

### Database Metrics

| Metric                | Source                | Warning   | Critical  | Description                 |
| --------------------- | --------------------- | --------- | --------- | --------------------------- |
| Connection pool usage | PostgreSQL            | >80%      | >95%      | Active connections vs limit |
| Query duration        | PostgreSQL logs       | >500ms    | >2s       | Slow query detection        |
| Database size         | PostgreSQL            | >80% disk | >90% disk | Storage capacity            |
| Dead tuples           | `pg_stat_user_tables` | >10%      | >25%      | Vacuum needed               |

### Cache Metrics

| Metric         | Source           | Warning        | Critical       | Description               |
| -------------- | ---------------- | -------------- | -------------- | ------------------------- |
| Redis memory   | Redis INFO       | >80% maxmemory | >95% maxmemory | Memory pressure           |
| Cache hit rate | Application logs | <80%           | <60%           | Cache effectiveness       |
| Redis latency  | Redis CLI        | >10ms          | >50ms          | Command latency           |
| Evictions      | Redis INFO       | Any            | High rate      | Memory pressure indicator |

### Queue Metrics

| Metric                | Source           | Warning    | Critical   | Description            |
| --------------------- | ---------------- | ---------- | ---------- | ---------------------- |
| WHMCS queue size      | `/health/queues` | >10        | >50        | Pending WHMCS requests |
| WHMCS failed requests | `/health/queues` | >5         | >20        | Failed API calls       |
| SF daily API usage    | `/health/queues` | >80% limit | >95% limit | Salesforce API quota   |
| BullMQ wait queue     | Redis            | >10        | >50        | Job backlog            |
| BullMQ failed jobs    | Redis            | >5         | >20        | Processing failures    |

### External Dependency Metrics

| Metric                   | Source | Warning | Critical | Description          |
| ------------------------ | ------ | ------- | -------- | -------------------- |
| Salesforce response time | Logs   | >2s     | >5s      | SF API latency       |
| WHMCS response time      | Logs   | >2s     | >5s      | WHMCS API latency    |
| Freebit response time    | Logs   | >3s     | >10s     | Freebit API latency  |
| External error rate      | Logs   | >1%     | >5%      | Integration failures |

---

## Structured Logging for Metrics

The BFF uses Pino for structured JSON logging. Key fields for metrics extraction:

```json
{
  "timestamp": "2025-01-15T10:30:00.000Z",
  "level": "info",
  "service": "customer-portal-bff",
  "correlationId": "req-123",
  "message": "API call completed",
  "duration": 250,
  "path": "/api/invoices",
  "method": "GET",
  "statusCode": 200
}
```

### Log Queries for Metrics

**Error Rate (last hour):**

```bash
grep '"level":50' /var/log/bff/combined.log | wc -l
```

**Slow Requests (>2s):**

```bash
grep '"duration":[0-9]\{4,\}' /var/log/bff/combined.log | tail -20
```

**External API Errors:**

```bash
grep -E '(WHMCS|Salesforce|Freebit).*error' /var/log/bff/error.log | tail -20
```

---

## Grafana Dashboard Setup

### Data Sources

1. **Prometheus** - For application metrics
2. **Loki** - For log aggregation
3. **PostgreSQL** - For database metrics

### Recommended Panels

#### Overview Dashboard

1. **System Health** (Stat panel)
   - Query: `/health` endpoint status
   - Show: ok/degraded indicator

2. **Request Rate** (Graph panel)
   - Source: Prometheus/Loki
   - Show: Requests per second

3. **Error Rate** (Graph panel)
   - Source: Loki log count
   - Filter: `level >= 50`

4. **Response Time (p95)** (Graph panel)
   - Source: Prometheus histogram
   - Show: 95th percentile latency

#### Queue Dashboard

1. **Queue Depths** (Graph panel)
   - Source: `/health/queues` endpoint
   - Show: WHMCS and SF queue sizes

2. **Failed Jobs** (Stat panel)
   - Source: Redis BullMQ metrics
   - Show: Failed job count

3. **Salesforce API Usage** (Gauge panel)
   - Source: `/health/queues/salesforce`
   - Show: Daily usage vs limit

#### Database Dashboard

1. **Connection Pool** (Gauge panel)
   - Source: PostgreSQL `pg_stat_activity`
   - Show: Active connections

2. **Query Performance** (Table panel)
   - Source: PostgreSQL `pg_stat_statements`
   - Show: Slowest queries

### Sample Prometheus Scrape Config

```yaml
scrape_configs:
  - job_name: "portal-bff"
    static_configs:
      - targets: ["bff:4000"]
    metrics_path: "/health"
    scrape_interval: 30s
```

---

## CloudWatch Setup (AWS)

### Custom Metrics

Push metrics from health endpoints to CloudWatch:

```bash
# Example: Push queue depth metric
aws cloudwatch put-metric-data \
  --namespace "CustomerPortal" \
  --metric-name "WhmcsQueueDepth" \
  --value $(curl -s http://localhost:4000/health/queues | jq '.whmcs.metrics.queueSize') \
  --dimensions Environment=production
```

### Recommended CloudWatch Alarms

| Alarm         | Metric           | Threshold | Period | Action           |
| ------------- | ---------------- | --------- | ------ | ---------------- |
| HighErrorRate | ErrorCount       | >10       | 5 min  | SNS notification |
| HighLatency   | p95 ResponseTime | >2000ms   | 5 min  | SNS notification |
| QueueBacklog  | WhmcsQueueDepth  | >50       | 5 min  | SNS notification |
| DatabaseDown  | HealthStatus     | !=ok      | 1 min  | PagerDuty        |
| CacheDown     | HealthStatus     | !=ok      | 1 min  | PagerDuty        |

### Log Insights Queries

**Error Summary:**

```sql
fields @timestamp, @message
| filter level >= 50
| stats count() by bin(5m)
```

**Slow Requests:**

```sql
fields @timestamp, path, duration
| filter duration > 2000
| sort duration desc
| limit 20
```

---

## DataDog Setup

### Agent Configuration

```yaml
# datadog.yaml
logs_enabled: true

logs:
  - type: file
    path: /var/log/bff/combined.log
    service: customer-portal-bff
    source: nodejs
```

### Custom Metrics

```typescript
// Example: Report queue metrics to DataDog
import { StatsD } from "hot-shots";

const dogstatsd = new StatsD({ host: "localhost", port: 8125 });

// Report queue depth
dogstatsd.gauge("portal.whmcs.queue_depth", metrics.queueSize);
dogstatsd.gauge("portal.whmcs.failed_requests", metrics.failedRequests);
```

### Recommended Monitors

1. **Health Check Monitor**
   - Check: HTTP check on `/health`
   - Alert: When status != ok for 2 minutes

2. **Error Rate Monitor**
   - Metric: `portal.errors.count`
   - Alert: When >5% for 5 minutes

3. **Queue Depth Monitor**
   - Metric: `portal.whmcs.queue_depth`
   - Alert: When >50 for 5 minutes

---

## Alerting Best Practices

### Alert Priority Levels

| Priority    | Response Time | Examples                                      |
| ----------- | ------------- | --------------------------------------------- |
| P1 Critical | 15 minutes    | Portal down, database unreachable             |
| P2 High     | 1 hour        | Provisioning failing, payment processing down |
| P3 Medium   | 4 hours       | Degraded performance, high error rate         |
| P4 Low      | 24 hours      | Minor issues, informational alerts            |

### Alert Routing

```yaml
# Example PagerDuty routing
routes:
  - match:
      severity: critical
    receiver: pagerduty-oncall
  - match:
      severity: warning
    receiver: slack-ops
  - match:
      severity: info
    receiver: email-team
```

### Runbook Links

Include runbook links in all alerts:

- Health check failures → [Incident Response](./incident-response.md)
- Database issues → [Database Operations](./database-operations.md)
- Queue problems → [Queue Management](./queue-management.md)
- External API failures → [External Dependencies](./external-dependencies.md)

---

## Monitoring Checklist

### Initial Setup

- [ ] Configure health endpoint scraping (every 30s)
- [ ] Set up log aggregation (Loki, CloudWatch, or DataDog)
- [ ] Create overview dashboard with key metrics
- [ ] Configure P1/P2 alerts for critical failures
- [ ] Test alert routing to on-call

### Ongoing Maintenance

- [ ] Review alert thresholds quarterly
- [ ] Check for alert fatigue (too many false positives)
- [ ] Update dashboards when new features are deployed
- [ ] Validate runbook links are current

---

## Related Documents

- [Incident Response](./incident-response.md)
- [Logging Guide](./logging.md)
- [External Dependencies](./external-dependencies.md)
- [Queue Management](./queue-management.md)

---

**Last Updated:** December 2025