Assist_Design/docs/operations/monitoring-setup.md
barsa 38bb40b88b Add Service and Component Structure for Internet and SIM Offerings
- Introduced new controllers for internet eligibility and service health checks to enhance backend functionality.
- Created service modules for internet, SIM, and VPN offerings, improving organization and maintainability.
- Developed various components for internet and SIM configuration, including forms and plan cards, to streamline user interactions.
- Implemented hooks for managing service configurations and eligibility checks, enhancing frontend data handling.
- Updated utility functions for pricing and catalog operations to support new service structures and improve performance.
2025-12-25 13:20:45 +09:00

376 lines
11 KiB
Markdown

# Monitoring Dashboard Setup
This document provides guidance for setting up monitoring infrastructure for the Customer Portal.
---
## Health Endpoints
The BFF exposes several health check endpoints for monitoring:
| Endpoint | Purpose | Authentication |
| -------------------------------- | ------------------------------------------ | -------------- |
| `GET /health` | Core system health (database, cache) | Public |
| `GET /health/queues` | Request queue metrics (WHMCS, Salesforce) | Public |
| `GET /health/queues/whmcs` | WHMCS queue details | Public |
| `GET /health/queues/salesforce` | Salesforce queue details | Public |
| `GET /api/health/services/cache` | Services cache metrics | Public |
| `GET /auth/health-check` | Integration health (DB, WHMCS, Salesforce) | Public |
### Core Health Response
```json
{
"status": "ok",
"checks": {
"database": "ok",
"cache": "ok"
}
}
```
**Status Values:**
- `ok` - All systems healthy
- `degraded` - One or more systems failing
### Queue Health Response
```json
{
"timestamp": "2025-01-15T10:30:00.000Z",
"whmcs": {
"health": "healthy",
"metrics": {
"totalRequests": 1500,
"completedRequests": 1495,
"failedRequests": 5,
"queueSize": 0,
"pendingRequests": 2,
"averageWaitTime": 50,
"averageExecutionTime": 250
}
},
"salesforce": {
"health": "healthy",
"metrics": { ... },
"dailyUsage": { "used": 5000, "limit": 15000 }
}
}
```
---
## Key Metrics to Monitor
### Application Metrics
| Metric | Source | Warning | Critical | Description |
| ------------------- | --------------- | ------------- | ---------------- | --------------------- |
| Health status | `/health` | `degraded` | Any check `fail` | Core system health |
| Response time (p95) | Logs/APM | >2s | >5s | API response latency |
| Error rate | Logs/APM | >1% | >5% | HTTP 5xx responses |
| Active connections | Node.js metrics | >80% capacity | >95% capacity | Connection pool usage |
### Database Metrics
| Metric | Source | Warning | Critical | Description |
| --------------------- | --------------------- | --------- | --------- | --------------------------- |
| Connection pool usage | PostgreSQL | >80% | >95% | Active connections vs limit |
| Query duration | PostgreSQL logs | >500ms | >2s | Slow query detection |
| Database size | PostgreSQL | >80% disk | >90% disk | Storage capacity |
| Dead tuples | `pg_stat_user_tables` | >10% | >25% | Vacuum needed |
### Cache Metrics
| Metric | Source | Warning | Critical | Description |
| -------------- | ---------------- | -------------- | -------------- | ------------------------- |
| Redis memory | Redis INFO | >80% maxmemory | >95% maxmemory | Memory pressure |
| Cache hit rate | Application logs | <80% | <60% | Cache effectiveness |
| Redis latency | Redis CLI | >10ms | >50ms | Command latency |
| Evictions | Redis INFO | Any | High rate | Memory pressure indicator |
### Queue Metrics
| Metric | Source | Warning | Critical | Description |
| --------------------- | ---------------- | ---------- | ---------- | ---------------------- |
| WHMCS queue size | `/health/queues` | >10 | >50 | Pending WHMCS requests |
| WHMCS failed requests | `/health/queues` | >5 | >20 | Failed API calls |
| SF daily API usage | `/health/queues` | >80% limit | >95% limit | Salesforce API quota |
| BullMQ wait queue | Redis | >10 | >50 | Job backlog |
| BullMQ failed jobs | Redis | >5 | >20 | Processing failures |
### External Dependency Metrics
| Metric | Source | Warning | Critical | Description |
| ------------------------ | ------ | ------- | -------- | -------------------- |
| Salesforce response time | Logs | >2s | >5s | SF API latency |
| WHMCS response time | Logs | >2s | >5s | WHMCS API latency |
| Freebit response time | Logs | >3s | >10s | Freebit API latency |
| External error rate | Logs | >1% | >5% | Integration failures |
---
## Structured Logging for Metrics
The BFF uses Pino for structured JSON logging. Key fields for metrics extraction:
```json
{
"timestamp": "2025-01-15T10:30:00.000Z",
"level": "info",
"service": "customer-portal-bff",
"correlationId": "req-123",
"message": "API call completed",
"duration": 250,
"path": "/api/invoices",
"method": "GET",
"statusCode": 200
}
```
### Log Queries for Metrics
**Error Rate (last hour):**
```bash
grep '"level":50' /var/log/bff/combined.log | wc -l
```
**Slow Requests (>2s):**
```bash
grep '"duration":[0-9]\{4,\}' /var/log/bff/combined.log | tail -20
```
**External API Errors:**
```bash
grep -E '(WHMCS|Salesforce|Freebit).*error' /var/log/bff/error.log | tail -20
```
---
## Grafana Dashboard Setup
### Data Sources
1. **Prometheus** - For application metrics
2. **Loki** - For log aggregation
3. **PostgreSQL** - For database metrics
### Recommended Panels
#### Overview Dashboard
1. **System Health** (Stat panel)
- Query: `/health` endpoint status
- Show: ok/degraded indicator
2. **Request Rate** (Graph panel)
- Source: Prometheus/Loki
- Show: Requests per second
3. **Error Rate** (Graph panel)
- Source: Loki log count
- Filter: `level >= 50`
4. **Response Time (p95)** (Graph panel)
- Source: Prometheus histogram
- Show: 95th percentile latency
#### Queue Dashboard
1. **Queue Depths** (Graph panel)
- Source: `/health/queues` endpoint
- Show: WHMCS and SF queue sizes
2. **Failed Jobs** (Stat panel)
- Source: Redis BullMQ metrics
- Show: Failed job count
3. **Salesforce API Usage** (Gauge panel)
- Source: `/health/queues/salesforce`
- Show: Daily usage vs limit
#### Database Dashboard
1. **Connection Pool** (Gauge panel)
- Source: PostgreSQL `pg_stat_activity`
- Show: Active connections
2. **Query Performance** (Table panel)
- Source: PostgreSQL `pg_stat_statements`
- Show: Slowest queries
### Sample Prometheus Scrape Config
```yaml
scrape_configs:
- job_name: "portal-bff"
static_configs:
- targets: ["bff:4000"]
metrics_path: "/health"
scrape_interval: 30s
```
---
## CloudWatch Setup (AWS)
### Custom Metrics
Push metrics from health endpoints to CloudWatch:
```bash
# Example: Push queue depth metric
aws cloudwatch put-metric-data \
--namespace "CustomerPortal" \
--metric-name "WhmcsQueueDepth" \
--value $(curl -s http://localhost:4000/health/queues | jq '.whmcs.metrics.queueSize') \
--dimensions Environment=production
```
### Recommended CloudWatch Alarms
| Alarm | Metric | Threshold | Period | Action |
| ------------- | ---------------- | --------- | ------ | ---------------- |
| HighErrorRate | ErrorCount | >10 | 5 min | SNS notification |
| HighLatency | p95 ResponseTime | >2000ms | 5 min | SNS notification |
| QueueBacklog | WhmcsQueueDepth | >50 | 5 min | SNS notification |
| DatabaseDown | HealthStatus | !=ok | 1 min | PagerDuty |
| CacheDown | HealthStatus | !=ok | 1 min | PagerDuty |
### Log Insights Queries
**Error Summary:**
```sql
fields @timestamp, @message
| filter level >= 50
| stats count() by bin(5m)
```
**Slow Requests:**
```sql
fields @timestamp, path, duration
| filter duration > 2000
| sort duration desc
| limit 20
```
---
## DataDog Setup
### Agent Configuration
```yaml
# datadog.yaml
logs_enabled: true
logs:
- type: file
path: /var/log/bff/combined.log
service: customer-portal-bff
source: nodejs
```
### Custom Metrics
```typescript
// Example: Report queue metrics to DataDog
import { StatsD } from "hot-shots";
const dogstatsd = new StatsD({ host: "localhost", port: 8125 });
// Report queue depth
dogstatsd.gauge("portal.whmcs.queue_depth", metrics.queueSize);
dogstatsd.gauge("portal.whmcs.failed_requests", metrics.failedRequests);
```
### Recommended Monitors
1. **Health Check Monitor**
- Check: HTTP check on `/health`
- Alert: When status != ok for 2 minutes
2. **Error Rate Monitor**
- Metric: `portal.errors.count`
- Alert: When >5% for 5 minutes
3. **Queue Depth Monitor**
- Metric: `portal.whmcs.queue_depth`
- Alert: When >50 for 5 minutes
---
## Alerting Best Practices
### Alert Priority Levels
| Priority | Response Time | Examples |
| ----------- | ------------- | --------------------------------------------- |
| P1 Critical | 15 minutes | Portal down, database unreachable |
| P2 High | 1 hour | Provisioning failing, payment processing down |
| P3 Medium | 4 hours | Degraded performance, high error rate |
| P4 Low | 24 hours | Minor issues, informational alerts |
### Alert Routing
```yaml
# Example PagerDuty routing
routes:
- match:
severity: critical
receiver: pagerduty-oncall
- match:
severity: warning
receiver: slack-ops
- match:
severity: info
receiver: email-team
```
### Runbook Links
Include runbook links in all alerts:
- Health check failures → [Incident Response](./incident-response.md)
- Database issues → [Database Operations](./database-operations.md)
- Queue problems → [Queue Management](./queue-management.md)
- External API failures → [External Dependencies](./external-dependencies.md)
---
## Monitoring Checklist
### Initial Setup
- [ ] Configure health endpoint scraping (every 30s)
- [ ] Set up log aggregation (Loki, CloudWatch, or DataDog)
- [ ] Create overview dashboard with key metrics
- [ ] Configure P1/P2 alerts for critical failures
- [ ] Test alert routing to on-call
### Ongoing Maintenance
- [ ] Review alert thresholds quarterly
- [ ] Check for alert fatigue (too many false positives)
- [ ] Update dashboards when new features are deployed
- [ ] Validate runbook links are current
---
## Related Documents
- [Incident Response](./incident-response.md)
- [Logging Guide](./logging.md)
- [External Dependencies](./external-dependencies.md)
- [Queue Management](./queue-management.md)
---
**Last Updated:** December 2025