Update README.md to Enhance Documentation Clarity and Add New Sections
- Added a new section for Release Procedures, detailing deployment and rollback processes. - Updated the System Operations section to include Monitoring Setup, Rate Limit Tuning, and Customer Data Management for improved operational guidance. - Reformatted the table structure for better readability and consistency across documentation.
This commit is contained in:
parent
12eb9fd763
commit
90ab71b94d
@ -148,14 +148,18 @@ Feature guides explaining how the portal functions:
|
||||
| [External Dependencies](./operations/external-dependencies.md) | Integration health checks |
|
||||
| [Queue Management](./operations/queue-management.md) | BullMQ job monitoring |
|
||||
| [External Processes](./operations/external-processes.md) | Team handoffs and workflows |
|
||||
| [Release Procedures](./operations/release-procedures.md) | Deployment and rollback |
|
||||
|
||||
### System Operations
|
||||
|
||||
| Document | Description |
|
||||
| ------------------------------------------------------------------ | -------------------------- |
|
||||
| -------------------------------------------------------------------- | -------------------------- |
|
||||
| [Logging](./operations/logging.md) | Centralized logging system |
|
||||
| [Security Monitoring](./operations/security-monitoring.md) | Security monitoring setup |
|
||||
| [Subscription Management](./operations/subscription-management.md) | Service management |
|
||||
| [Monitoring Setup](./operations/monitoring-setup.md) | Metrics and dashboards |
|
||||
| [Rate Limit Tuning](./operations/rate-limit-tuning.md) | Rate limit configuration |
|
||||
| [Customer Data Management](./operations/customer-data-management.md) | GDPR and data procedures |
|
||||
|
||||
---
|
||||
|
||||
@ -192,10 +196,12 @@ Historical documents kept for reference:
|
||||
### DevOps / Operations
|
||||
|
||||
1. [Deployment](./getting-started/deployment.md)
|
||||
2. [Incident Response](./operations/incident-response.md)
|
||||
3. [Provisioning Runbook](./operations/provisioning-runbook.md)
|
||||
4. [Database Operations](./operations/database-operations.md)
|
||||
5. [External Dependencies](./operations/external-dependencies.md)
|
||||
2. [Release Procedures](./operations/release-procedures.md)
|
||||
3. [Incident Response](./operations/incident-response.md)
|
||||
4. [Monitoring Setup](./operations/monitoring-setup.md)
|
||||
5. [Database Operations](./operations/database-operations.md)
|
||||
6. [External Dependencies](./operations/external-dependencies.md)
|
||||
7. [Rate Limit Tuning](./operations/rate-limit-tuning.md)
|
||||
|
||||
---
|
||||
|
||||
|
||||
415
docs/operations/customer-data-management.md
Normal file
415
docs/operations/customer-data-management.md
Normal file
@ -0,0 +1,415 @@
|
||||
# Customer Data Management (GDPR)
|
||||
|
||||
This document covers procedures for handling customer data in compliance with GDPR and data protection regulations.
|
||||
|
||||
---
|
||||
|
||||
## Data Storage Overview
|
||||
|
||||
Customer data is stored across multiple systems:
|
||||
|
||||
| System | Data Stored | Retention | Notes |
|
||||
| ----------------------- | ----------------------------------------------------- | --------------------------- | ---------------------------- |
|
||||
| **Portal (PostgreSQL)** | User accounts, ID mappings, audit logs, notifications | Active account lifetime | Auth data only |
|
||||
| **WHMCS** | Billing, invoices, payment methods, addresses | Legal requirement (7 years) | System of record for billing |
|
||||
| **Salesforce** | CRM data, orders, cases, contacts | Business records | System of record for CRM |
|
||||
| **Redis** | Sessions, cache, rate limits | TTL-based (minutes to days) | Temporary data |
|
||||
|
||||
### Portal Database Tables with PII
|
||||
|
||||
| Table | PII Fields | Purpose |
|
||||
| ---------------------------- | ------------------------------------ | -------------------- |
|
||||
| `users` | `email`, `passwordHash`, `mfaSecret` | Authentication |
|
||||
| `id_mappings` | Links to WHMCS/Salesforce IDs | Identity federation |
|
||||
| `audit_logs` | `ipAddress`, `userAgent`, `userId` | Security audit trail |
|
||||
| `residence_card_submissions` | Document images | ID verification |
|
||||
| `notifications` | User notifications | In-app messaging |
|
||||
| `sim_call_history_*` | Phone numbers, call details | Usage records |
|
||||
| `sim_sms_history` | Phone numbers, SMS details | Usage records |
|
||||
|
||||
---
|
||||
|
||||
## Data Subject Rights
|
||||
|
||||
Under GDPR, customers have the following rights:
|
||||
|
||||
| Right | Portal Support | Notes |
|
||||
| ---------------------- | ------------------ | ------------------------- |
|
||||
| Right of Access | Manual export | See Data Export section |
|
||||
| Right to Rectification | WHMCS self-service | Customer updates in WHMCS |
|
||||
| Right to Erasure | Manual process | See Data Deletion section |
|
||||
| Right to Portability | Manual export | See Data Export section |
|
||||
| Right to Object | Manual process | Opt-out of processing |
|
||||
|
||||
---
|
||||
|
||||
## Data Deletion Procedures
|
||||
|
||||
### Overview
|
||||
|
||||
Complete customer data deletion requires coordination across all systems:
|
||||
|
||||
1. Portal database deletion
|
||||
2. WHMCS account handling
|
||||
3. Salesforce record handling
|
||||
4. Redis cache clearing
|
||||
5. Audit trail retention
|
||||
|
||||
### Pre-Deletion Checklist
|
||||
|
||||
- [ ] Verify customer identity (authentication or CS verification)
|
||||
- [ ] Check for active subscriptions (must be cancelled first)
|
||||
- [ ] Check for unpaid invoices (must be settled first)
|
||||
- [ ] Check legal retention requirements (invoices, tax records)
|
||||
- [ ] Document the deletion request with timestamp
|
||||
|
||||
### Step 1: Portal Database Deletion
|
||||
|
||||
```sql
|
||||
-- 1. Get user information
|
||||
SELECT u.id, u.email, im.whmcs_client_id, im.sf_account_id
|
||||
FROM users u
|
||||
LEFT JOIN id_mappings im ON u.id = im.user_id
|
||||
WHERE u.email = 'customer@example.com';
|
||||
|
||||
-- 2. Delete notifications
|
||||
DELETE FROM notifications WHERE user_id = '<user_id>';
|
||||
|
||||
-- 3. Delete residence card submissions
|
||||
DELETE FROM residence_card_submissions WHERE user_id = '<user_id>';
|
||||
|
||||
-- 4. Delete SIM usage data (if applicable)
|
||||
-- Note: Check if SIM account is linked to this user first
|
||||
DELETE FROM sim_usage_daily WHERE account IN (
|
||||
SELECT account FROM sim_voice_options WHERE account = '<sim_account>'
|
||||
);
|
||||
DELETE FROM sim_call_history_domestic WHERE account = '<sim_account>';
|
||||
DELETE FROM sim_call_history_international WHERE account = '<sim_account>';
|
||||
DELETE FROM sim_sms_history WHERE account = '<sim_account>';
|
||||
DELETE FROM sim_voice_options WHERE account = '<sim_account>';
|
||||
|
||||
-- 5. Delete ID mapping (cascades from user deletion)
|
||||
-- The id_mappings table has onDelete: Cascade
|
||||
|
||||
-- 6. Delete user (cascades audit_logs user reference to NULL, deletes id_mapping)
|
||||
DELETE FROM users WHERE id = '<user_id>';
|
||||
```
|
||||
|
||||
**Using the Mappings Service:**
|
||||
|
||||
```typescript
|
||||
// Delete mapping programmatically (clears cache too)
|
||||
await mappingsService.deleteMapping(userId);
|
||||
```
|
||||
|
||||
### Step 2: Audit Log Handling
|
||||
|
||||
Audit logs may need to be retained for security compliance. Options:
|
||||
|
||||
**Option A: Anonymize (Recommended)**
|
||||
|
||||
```sql
|
||||
-- Anonymize audit logs (keeps security trail, removes PII)
|
||||
UPDATE audit_logs
|
||||
SET user_id = NULL,
|
||||
ip_address = 'ANONYMIZED',
|
||||
user_agent = 'ANONYMIZED',
|
||||
details = jsonb_set(
|
||||
COALESCE(details, '{}'::jsonb),
|
||||
'{anonymized}',
|
||||
'true'::jsonb
|
||||
)
|
||||
WHERE user_id = '<user_id>';
|
||||
```
|
||||
|
||||
**Option B: Delete (If Legally Permitted)**
|
||||
|
||||
```sql
|
||||
DELETE FROM audit_logs WHERE user_id = '<user_id>';
|
||||
```
|
||||
|
||||
### Step 3: Redis Cache Clearing
|
||||
|
||||
```bash
|
||||
# Clear user-specific cache keys
|
||||
redis-cli KEYS "user:*:<user_id>*" | xargs redis-cli DEL
|
||||
redis-cli KEYS "session:*:<user_id>*" | xargs redis-cli DEL
|
||||
redis-cli KEYS "mapping:*:<user_id>*" | xargs redis-cli DEL
|
||||
|
||||
# Clear refresh token families
|
||||
redis-cli KEYS "refresh:user:<user_id>*" | xargs redis-cli DEL
|
||||
redis-cli KEYS "refresh:family:*" | xargs redis-cli DEL # May need filtering
|
||||
|
||||
# Clear rate limit records
|
||||
redis-cli KEYS "auth-login:*" | xargs redis-cli DEL # Clears by IP, not user
|
||||
```
|
||||
|
||||
### Step 4: WHMCS Account Handling
|
||||
|
||||
WHMCS does not support full account deletion. Options:
|
||||
|
||||
**Option A: Close Account (Recommended)**
|
||||
|
||||
1. Cancel all active services
|
||||
2. Set account status to "Closed"
|
||||
3. Anonymize personal fields via WHMCS Admin
|
||||
4. Document closure date
|
||||
|
||||
**Option B: Anonymize via API**
|
||||
|
||||
```bash
|
||||
# Update client to anonymized data
|
||||
curl -X POST "$WHMCS_API_URL" \
|
||||
-d "identifier=$WHMCS_API_IDENTIFIER" \
|
||||
-d "secret=$WHMCS_API_SECRET" \
|
||||
-d "action=UpdateClient" \
|
||||
-d "clientid=<whmcs_client_id>" \
|
||||
-d "firstname=Deleted" \
|
||||
-d "lastname=User" \
|
||||
-d "email=deleted_<whmcs_client_id>@deleted.local" \
|
||||
-d "address1=Deleted" \
|
||||
-d "city=Deleted" \
|
||||
-d "state=Deleted" \
|
||||
-d "postcode=000-0000" \
|
||||
-d "phonenumber=000-0000-0000" \
|
||||
-d "status=Closed" \
|
||||
-d "responsetype=json"
|
||||
```
|
||||
|
||||
### Step 5: Salesforce Record Handling
|
||||
|
||||
Salesforce records often have legal retention requirements:
|
||||
|
||||
**For Personal Data:**
|
||||
|
||||
1. Work with Salesforce Admin
|
||||
2. Consider anonymization vs deletion
|
||||
3. Check integration impact (linked Orders, Cases)
|
||||
|
||||
**Anonymization Approach:**
|
||||
|
||||
- Update Account name to "Deleted Account - [ID]"
|
||||
- Clear personal fields (phone, address if not needed)
|
||||
- Keep transactional records with anonymized references
|
||||
|
||||
---
|
||||
|
||||
## Data Export Procedures
|
||||
|
||||
### Customer Data Export Request
|
||||
|
||||
When a customer requests their data:
|
||||
|
||||
#### 1. Portal Data Export
|
||||
|
||||
```sql
|
||||
-- Export user data
|
||||
SELECT
|
||||
u.id,
|
||||
u.email,
|
||||
u.email_verified,
|
||||
u.created_at,
|
||||
u.last_login_at,
|
||||
im.whmcs_client_id,
|
||||
im.sf_account_id
|
||||
FROM users u
|
||||
LEFT JOIN id_mappings im ON u.id = im.user_id
|
||||
WHERE u.email = 'customer@example.com';
|
||||
|
||||
-- Export audit log (security events)
|
||||
SELECT
|
||||
action,
|
||||
resource,
|
||||
success,
|
||||
created_at
|
||||
FROM audit_logs
|
||||
WHERE user_id = '<user_id>'
|
||||
ORDER BY created_at DESC;
|
||||
|
||||
-- Export notifications
|
||||
SELECT
|
||||
type,
|
||||
title,
|
||||
message,
|
||||
read,
|
||||
created_at
|
||||
FROM notifications
|
||||
WHERE user_id = '<user_id>'
|
||||
ORDER BY created_at DESC;
|
||||
|
||||
-- Export SIM usage history (if applicable)
|
||||
SELECT
|
||||
call_date,
|
||||
call_time,
|
||||
called_to,
|
||||
duration_sec,
|
||||
charge_yen
|
||||
FROM sim_call_history_domestic
|
||||
WHERE account = '<sim_account>'
|
||||
ORDER BY call_date DESC;
|
||||
```
|
||||
|
||||
#### 2. WHMCS Data Export
|
||||
|
||||
Request via WHMCS Admin:
|
||||
|
||||
- Client Details
|
||||
- Invoices
|
||||
- Services/Subscriptions
|
||||
- Tickets/Support History
|
||||
- Transaction History
|
||||
|
||||
#### 3. Salesforce Data Export
|
||||
|
||||
Request via Salesforce Admin:
|
||||
|
||||
- Account record
|
||||
- Contact record
|
||||
- Order history
|
||||
- Case history
|
||||
- Opportunities
|
||||
|
||||
### Export Format
|
||||
|
||||
Provide data in machine-readable format:
|
||||
|
||||
- JSON for structured data
|
||||
- CSV for tabular data
|
||||
- PDF for documents (invoices)
|
||||
|
||||
---
|
||||
|
||||
## PII Handling During Debugging
|
||||
|
||||
### Safe Logging Practices
|
||||
|
||||
The BFF uses Pino with automatic PII redaction. Sensitive fields are sanitized:
|
||||
|
||||
```json
|
||||
{
|
||||
"email": "cust***@example.com",
|
||||
"password": "[REDACTED]",
|
||||
"token": "[REDACTED]",
|
||||
"authorization": "[REDACTED]"
|
||||
}
|
||||
```
|
||||
|
||||
### What NOT to Log
|
||||
|
||||
- Full email addresses (use masked version)
|
||||
- Passwords or password hashes
|
||||
- JWT tokens
|
||||
- API keys or secrets
|
||||
- Credit card numbers
|
||||
- Full phone numbers
|
||||
- Full addresses
|
||||
- ID document contents
|
||||
|
||||
### Safe Debug Queries
|
||||
|
||||
```sql
|
||||
-- Use ID instead of email for lookups
|
||||
SELECT * FROM users WHERE id = '<uuid>';
|
||||
|
||||
-- Mask PII in query results
|
||||
SELECT
|
||||
id,
|
||||
CONCAT(LEFT(email, 3), '***', SUBSTRING(email FROM POSITION('@' IN email))) as masked_email,
|
||||
created_at
|
||||
FROM users
|
||||
WHERE id = '<uuid>';
|
||||
```
|
||||
|
||||
### Production Debugging
|
||||
|
||||
When investigating production issues:
|
||||
|
||||
1. **Use correlation IDs** - Search logs by request ID, not user email
|
||||
2. **Access minimal data** - Only query what's needed
|
||||
3. **Document access** - Note why you accessed customer data
|
||||
4. **Use anonymized exports** - When sharing data for analysis
|
||||
|
||||
---
|
||||
|
||||
## Data Retention Policies
|
||||
|
||||
### Recommended Retention Periods
|
||||
|
||||
| Data Type | Retention | Justification |
|
||||
| ------------------------ | ---------- | ---------------------- |
|
||||
| Active user accounts | Indefinite | Active service |
|
||||
| Closed accounts (portal) | 30 days | Grace period |
|
||||
| Audit logs | 2 years | Security compliance |
|
||||
| Session data (Redis) | 24 hours | Active sessions |
|
||||
| Rate limit data | 15 minutes | Operational |
|
||||
| Invoices | 7 years | Tax/legal requirement |
|
||||
| Support cases | 5 years | Service history |
|
||||
| Call/SMS history | 6 months | Billing reconciliation |
|
||||
|
||||
### Automated Cleanup
|
||||
|
||||
```sql
|
||||
-- Delete expired notifications (30 days after expiry)
|
||||
DELETE FROM notifications
|
||||
WHERE expires_at < NOW() - INTERVAL '30 days';
|
||||
|
||||
-- Anonymize old audit logs (over 2 years)
|
||||
UPDATE audit_logs
|
||||
SET ip_address = 'EXPIRED',
|
||||
user_agent = 'EXPIRED'
|
||||
WHERE created_at < NOW() - INTERVAL '2 years'
|
||||
AND ip_address != 'EXPIRED';
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Compliance Checklist
|
||||
|
||||
### Monthly Review
|
||||
|
||||
- [ ] Review data access logs for unusual patterns
|
||||
- [ ] Verify automated cleanup jobs are running
|
||||
- [ ] Check for pending deletion requests
|
||||
- [ ] Review new data collection points
|
||||
|
||||
### Quarterly Review
|
||||
|
||||
- [ ] Audit third-party data sharing
|
||||
- [ ] Review retention policies
|
||||
- [ ] Update data inventory if schema changed
|
||||
- [ ] Staff training on data handling
|
||||
|
||||
### Annual Review
|
||||
|
||||
- [ ] Full data protection impact assessment
|
||||
- [ ] Policy review and updates
|
||||
- [ ] Vendor compliance verification
|
||||
- [ ] Documentation updates
|
||||
|
||||
---
|
||||
|
||||
## Emergency Data Breach Response
|
||||
|
||||
If a data breach is suspected:
|
||||
|
||||
1. **Contain** - Isolate affected systems
|
||||
2. **Assess** - Determine scope and data exposed
|
||||
3. **Notify** - Inform DPO/legal within 24 hours
|
||||
4. **Report** - GDPR requires notification within 72 hours
|
||||
5. **Remediate** - Fix vulnerability and prevent recurrence
|
||||
6. **Document** - Full incident report
|
||||
|
||||
See [Incident Response](./incident-response.md) for general incident procedures.
|
||||
|
||||
---
|
||||
|
||||
## Related Documents
|
||||
|
||||
- [Incident Response](./incident-response.md)
|
||||
- [Database Operations](./database-operations.md)
|
||||
- [Logging Guide](./logging.md)
|
||||
- [Security Monitoring](./security-monitoring.md)
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** December 2025
|
||||
375
docs/operations/monitoring-setup.md
Normal file
375
docs/operations/monitoring-setup.md
Normal file
@ -0,0 +1,375 @@
|
||||
# Monitoring Dashboard Setup
|
||||
|
||||
This document provides guidance for setting up monitoring infrastructure for the Customer Portal.
|
||||
|
||||
---
|
||||
|
||||
## Health Endpoints
|
||||
|
||||
The BFF exposes several health check endpoints for monitoring:
|
||||
|
||||
| Endpoint | Purpose | Authentication |
|
||||
| ------------------------------- | ------------------------------------------ | -------------- |
|
||||
| `GET /health` | Core system health (database, cache) | Public |
|
||||
| `GET /health/queues` | Request queue metrics (WHMCS, Salesforce) | Public |
|
||||
| `GET /health/queues/whmcs` | WHMCS queue details | Public |
|
||||
| `GET /health/queues/salesforce` | Salesforce queue details | Public |
|
||||
| `GET /health/catalog/cache` | Catalog cache metrics | Public |
|
||||
| `GET /auth/health-check` | Integration health (DB, WHMCS, Salesforce) | Public |
|
||||
|
||||
### Core Health Response
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "ok",
|
||||
"checks": {
|
||||
"database": "ok",
|
||||
"cache": "ok"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Status Values:**
|
||||
|
||||
- `ok` - All systems healthy
|
||||
- `degraded` - One or more systems failing
|
||||
|
||||
### Queue Health Response
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "2025-01-15T10:30:00.000Z",
|
||||
"whmcs": {
|
||||
"health": "healthy",
|
||||
"metrics": {
|
||||
"totalRequests": 1500,
|
||||
"completedRequests": 1495,
|
||||
"failedRequests": 5,
|
||||
"queueSize": 0,
|
||||
"pendingRequests": 2,
|
||||
"averageWaitTime": 50,
|
||||
"averageExecutionTime": 250
|
||||
}
|
||||
},
|
||||
"salesforce": {
|
||||
"health": "healthy",
|
||||
"metrics": { ... },
|
||||
"dailyUsage": { "used": 5000, "limit": 15000 }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Metrics to Monitor
|
||||
|
||||
### Application Metrics
|
||||
|
||||
| Metric | Source | Warning | Critical | Description |
|
||||
| ------------------- | --------------- | ------------- | ---------------- | --------------------- |
|
||||
| Health status | `/health` | `degraded` | Any check `fail` | Core system health |
|
||||
| Response time (p95) | Logs/APM | >2s | >5s | API response latency |
|
||||
| Error rate | Logs/APM | >1% | >5% | HTTP 5xx responses |
|
||||
| Active connections | Node.js metrics | >80% capacity | >95% capacity | Connection pool usage |
|
||||
|
||||
### Database Metrics
|
||||
|
||||
| Metric | Source | Warning | Critical | Description |
|
||||
| --------------------- | --------------------- | --------- | --------- | --------------------------- |
|
||||
| Connection pool usage | PostgreSQL | >80% | >95% | Active connections vs limit |
|
||||
| Query duration | PostgreSQL logs | >500ms | >2s | Slow query detection |
|
||||
| Database size | PostgreSQL | >80% disk | >90% disk | Storage capacity |
|
||||
| Dead tuples | `pg_stat_user_tables` | >10% | >25% | Vacuum needed |
|
||||
|
||||
### Cache Metrics
|
||||
|
||||
| Metric | Source | Warning | Critical | Description |
|
||||
| -------------- | ---------------- | -------------- | -------------- | ------------------------- |
|
||||
| Redis memory | Redis INFO | >80% maxmemory | >95% maxmemory | Memory pressure |
|
||||
| Cache hit rate | Application logs | <80% | <60% | Cache effectiveness |
|
||||
| Redis latency | Redis CLI | >10ms | >50ms | Command latency |
|
||||
| Evictions | Redis INFO | Any | High rate | Memory pressure indicator |
|
||||
|
||||
### Queue Metrics
|
||||
|
||||
| Metric | Source | Warning | Critical | Description |
|
||||
| --------------------- | ---------------- | ---------- | ---------- | ---------------------- |
|
||||
| WHMCS queue size | `/health/queues` | >10 | >50 | Pending WHMCS requests |
|
||||
| WHMCS failed requests | `/health/queues` | >5 | >20 | Failed API calls |
|
||||
| SF daily API usage | `/health/queues` | >80% limit | >95% limit | Salesforce API quota |
|
||||
| BullMQ wait queue | Redis | >10 | >50 | Job backlog |
|
||||
| BullMQ failed jobs | Redis | >5 | >20 | Processing failures |
|
||||
|
||||
### External Dependency Metrics
|
||||
|
||||
| Metric | Source | Warning | Critical | Description |
|
||||
| ------------------------ | ------ | ------- | -------- | -------------------- |
|
||||
| Salesforce response time | Logs | >2s | >5s | SF API latency |
|
||||
| WHMCS response time | Logs | >2s | >5s | WHMCS API latency |
|
||||
| Freebit response time | Logs | >3s | >10s | Freebit API latency |
|
||||
| External error rate | Logs | >1% | >5% | Integration failures |
|
||||
|
||||
---
|
||||
|
||||
## Structured Logging for Metrics
|
||||
|
||||
The BFF uses Pino for structured JSON logging. Key fields for metrics extraction:
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "2025-01-15T10:30:00.000Z",
|
||||
"level": "info",
|
||||
"service": "customer-portal-bff",
|
||||
"correlationId": "req-123",
|
||||
"message": "API call completed",
|
||||
"duration": 250,
|
||||
"path": "/api/invoices",
|
||||
"method": "GET",
|
||||
"statusCode": 200
|
||||
}
|
||||
```
|
||||
|
||||
### Log Queries for Metrics
|
||||
|
||||
**Error Rate (last hour):**
|
||||
|
||||
```bash
|
||||
grep '"level":50' /var/log/bff/combined.log | wc -l
|
||||
```
|
||||
|
||||
**Slow Requests (>2s):**
|
||||
|
||||
```bash
|
||||
grep '"duration":[0-9]\{4,\}' /var/log/bff/combined.log | tail -20
|
||||
```
|
||||
|
||||
**External API Errors:**
|
||||
|
||||
```bash
|
||||
grep -E '(WHMCS|Salesforce|Freebit).*error' /var/log/bff/error.log | tail -20
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Grafana Dashboard Setup
|
||||
|
||||
### Data Sources
|
||||
|
||||
1. **Prometheus** - For application metrics
|
||||
2. **Loki** - For log aggregation
|
||||
3. **PostgreSQL** - For database metrics
|
||||
|
||||
### Recommended Panels
|
||||
|
||||
#### Overview Dashboard
|
||||
|
||||
1. **System Health** (Stat panel)
|
||||
- Query: `/health` endpoint status
|
||||
- Show: ok/degraded indicator
|
||||
|
||||
2. **Request Rate** (Graph panel)
|
||||
- Source: Prometheus/Loki
|
||||
- Show: Requests per second
|
||||
|
||||
3. **Error Rate** (Graph panel)
|
||||
- Source: Loki log count
|
||||
- Filter: `level >= 50`
|
||||
|
||||
4. **Response Time (p95)** (Graph panel)
|
||||
- Source: Prometheus histogram
|
||||
- Show: 95th percentile latency
|
||||
|
||||
#### Queue Dashboard
|
||||
|
||||
1. **Queue Depths** (Graph panel)
|
||||
- Source: `/health/queues` endpoint
|
||||
- Show: WHMCS and SF queue sizes
|
||||
|
||||
2. **Failed Jobs** (Stat panel)
|
||||
- Source: Redis BullMQ metrics
|
||||
- Show: Failed job count
|
||||
|
||||
3. **Salesforce API Usage** (Gauge panel)
|
||||
- Source: `/health/queues/salesforce`
|
||||
- Show: Daily usage vs limit
|
||||
|
||||
#### Database Dashboard
|
||||
|
||||
1. **Connection Pool** (Gauge panel)
|
||||
- Source: PostgreSQL `pg_stat_activity`
|
||||
- Show: Active connections
|
||||
|
||||
2. **Query Performance** (Table panel)
|
||||
- Source: PostgreSQL `pg_stat_statements`
|
||||
- Show: Slowest queries
|
||||
|
||||
### Sample Prometheus Scrape Config
|
||||
|
||||
```yaml
|
||||
scrape_configs:
|
||||
- job_name: "portal-bff"
|
||||
static_configs:
|
||||
- targets: ["bff:4000"]
|
||||
metrics_path: "/health"
|
||||
scrape_interval: 30s
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## CloudWatch Setup (AWS)
|
||||
|
||||
### Custom Metrics
|
||||
|
||||
Push metrics from health endpoints to CloudWatch:
|
||||
|
||||
```bash
|
||||
# Example: Push queue depth metric
|
||||
aws cloudwatch put-metric-data \
|
||||
--namespace "CustomerPortal" \
|
||||
--metric-name "WhmcsQueueDepth" \
|
||||
--value $(curl -s http://localhost:4000/health/queues | jq '.whmcs.metrics.queueSize') \
|
||||
--dimensions Environment=production
|
||||
```
|
||||
|
||||
### Recommended CloudWatch Alarms
|
||||
|
||||
| Alarm | Metric | Threshold | Period | Action |
|
||||
| ------------- | ---------------- | --------- | ------ | ---------------- |
|
||||
| HighErrorRate | ErrorCount | >10 | 5 min | SNS notification |
|
||||
| HighLatency | p95 ResponseTime | >2000ms | 5 min | SNS notification |
|
||||
| QueueBacklog | WhmcsQueueDepth | >50 | 5 min | SNS notification |
|
||||
| DatabaseDown | HealthStatus | !=ok | 1 min | PagerDuty |
|
||||
| CacheDown | HealthStatus | !=ok | 1 min | PagerDuty |
|
||||
|
||||
### Log Insights Queries
|
||||
|
||||
**Error Summary:**
|
||||
|
||||
```sql
|
||||
fields @timestamp, @message
|
||||
| filter level >= 50
|
||||
| stats count() by bin(5m)
|
||||
```
|
||||
|
||||
**Slow Requests:**
|
||||
|
||||
```sql
|
||||
fields @timestamp, path, duration
|
||||
| filter duration > 2000
|
||||
| sort duration desc
|
||||
| limit 20
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## DataDog Setup
|
||||
|
||||
### Agent Configuration
|
||||
|
||||
```yaml
|
||||
# datadog.yaml
|
||||
logs_enabled: true
|
||||
|
||||
logs:
|
||||
- type: file
|
||||
path: /var/log/bff/combined.log
|
||||
service: customer-portal-bff
|
||||
source: nodejs
|
||||
```
|
||||
|
||||
### Custom Metrics
|
||||
|
||||
```typescript
|
||||
// Example: Report queue metrics to DataDog
|
||||
import { StatsD } from "hot-shots";
|
||||
|
||||
const dogstatsd = new StatsD({ host: "localhost", port: 8125 });
|
||||
|
||||
// Report queue depth
|
||||
dogstatsd.gauge("portal.whmcs.queue_depth", metrics.queueSize);
|
||||
dogstatsd.gauge("portal.whmcs.failed_requests", metrics.failedRequests);
|
||||
```
|
||||
|
||||
### Recommended Monitors
|
||||
|
||||
1. **Health Check Monitor**
|
||||
- Check: HTTP check on `/health`
|
||||
- Alert: When status != ok for 2 minutes
|
||||
|
||||
2. **Error Rate Monitor**
|
||||
- Metric: `portal.errors.count`
|
||||
- Alert: When >5% for 5 minutes
|
||||
|
||||
3. **Queue Depth Monitor**
|
||||
- Metric: `portal.whmcs.queue_depth`
|
||||
- Alert: When >50 for 5 minutes
|
||||
|
||||
---
|
||||
|
||||
## Alerting Best Practices
|
||||
|
||||
### Alert Priority Levels
|
||||
|
||||
| Priority | Response Time | Examples |
|
||||
| ----------- | ------------- | --------------------------------------------- |
|
||||
| P1 Critical | 15 minutes | Portal down, database unreachable |
|
||||
| P2 High | 1 hour | Provisioning failing, payment processing down |
|
||||
| P3 Medium | 4 hours | Degraded performance, high error rate |
|
||||
| P4 Low | 24 hours | Minor issues, informational alerts |
|
||||
|
||||
### Alert Routing
|
||||
|
||||
```yaml
|
||||
# Example PagerDuty routing
|
||||
routes:
|
||||
- match:
|
||||
severity: critical
|
||||
receiver: pagerduty-oncall
|
||||
- match:
|
||||
severity: warning
|
||||
receiver: slack-ops
|
||||
- match:
|
||||
severity: info
|
||||
receiver: email-team
|
||||
```
|
||||
|
||||
### Runbook Links
|
||||
|
||||
Include runbook links in all alerts:
|
||||
|
||||
- Health check failures → [Incident Response](./incident-response.md)
|
||||
- Database issues → [Database Operations](./database-operations.md)
|
||||
- Queue problems → [Queue Management](./queue-management.md)
|
||||
- External API failures → [External Dependencies](./external-dependencies.md)
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Checklist
|
||||
|
||||
### Initial Setup
|
||||
|
||||
- [ ] Configure health endpoint scraping (every 30s)
|
||||
- [ ] Set up log aggregation (Loki, CloudWatch, or DataDog)
|
||||
- [ ] Create overview dashboard with key metrics
|
||||
- [ ] Configure P1/P2 alerts for critical failures
|
||||
- [ ] Test alert routing to on-call
|
||||
|
||||
### Ongoing Maintenance
|
||||
|
||||
- [ ] Review alert thresholds quarterly
|
||||
- [ ] Check for alert fatigue (too many false positives)
|
||||
- [ ] Update dashboards when new features are deployed
|
||||
- [ ] Validate runbook links are current
|
||||
|
||||
---
|
||||
|
||||
## Related Documents
|
||||
|
||||
- [Incident Response](./incident-response.md)
|
||||
- [Logging Guide](./logging.md)
|
||||
- [External Dependencies](./external-dependencies.md)
|
||||
- [Queue Management](./queue-management.md)
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** December 2025
|
||||
395
docs/operations/rate-limit-tuning.md
Normal file
395
docs/operations/rate-limit-tuning.md
Normal file
@ -0,0 +1,395 @@
|
||||
# Rate Limit Tuning Guide
|
||||
|
||||
This document covers rate limiting configuration, adjustment procedures, and troubleshooting for the Customer Portal.
|
||||
|
||||
---
|
||||
|
||||
## Rate Limiting Overview
|
||||
|
||||
The portal uses multiple rate limiting mechanisms:
|
||||
|
||||
| Type | Scope | Backend | Purpose |
|
||||
| ------------------------- | ---------------------------------- | ------------------- | --------------------------- |
|
||||
| **Auth Rate Limiting** | Per endpoint (login, signup, etc.) | Redis | Prevent brute force attacks |
|
||||
| **Global Rate Limiting** | Per route/controller | Redis | API abuse prevention |
|
||||
| **Request Queues** | Per external API | In-memory (p-queue) | External API protection |
|
||||
| **SSE Connection Limits** | Per user | In-memory | Resource protection |
|
||||
|
||||
---
|
||||
|
||||
## Authentication Rate Limits
|
||||
|
||||
### Configuration
|
||||
|
||||
| Endpoint | Env Variable | Default | Window |
|
||||
| -------------------- | --------------------------------- | ----------- | ------ |
|
||||
| Login | `LOGIN_RATE_LIMIT_LIMIT` | 5 attempts | 15 min |
|
||||
| Login (TTL) | `LOGIN_RATE_LIMIT_TTL` | 900000 ms | - |
|
||||
| Signup | `SIGNUP_RATE_LIMIT_LIMIT` | 5 attempts | 15 min |
|
||||
| Signup (TTL) | `SIGNUP_RATE_LIMIT_TTL` | 900000 ms | - |
|
||||
| Password Reset | `PASSWORD_RESET_RATE_LIMIT_LIMIT` | 5 attempts | 15 min |
|
||||
| Password Reset (TTL) | `PASSWORD_RESET_RATE_LIMIT_TTL` | 900000 ms | - |
|
||||
| Token Refresh | `AUTH_REFRESH_RATE_LIMIT_LIMIT` | 10 attempts | 5 min |
|
||||
| Token Refresh (TTL) | `AUTH_REFRESH_RATE_LIMIT_TTL` | 300000 ms | - |
|
||||
|
||||
### CAPTCHA Configuration
|
||||
|
||||
| Setting | Env Variable | Default | Description |
|
||||
| ----------------- | ------------------------------ | ------- | ------------------------------------ |
|
||||
| CAPTCHA Threshold | `LOGIN_CAPTCHA_AFTER_ATTEMPTS` | 3 | Show CAPTCHA after N failed attempts |
|
||||
| CAPTCHA Always On | `AUTH_CAPTCHA_ALWAYS_ON` | false | Require CAPTCHA for all logins |
|
||||
|
||||
### Adjusting Auth Rate Limits
|
||||
|
||||
**In Production (requires restart):**
|
||||
|
||||
```bash
|
||||
# Edit .env file
|
||||
LOGIN_RATE_LIMIT_LIMIT=10 # Increase to 10 attempts
|
||||
LOGIN_RATE_LIMIT_TTL=1800000 # Extend window to 30 minutes
|
||||
|
||||
# Restart backend
|
||||
docker compose restart backend
|
||||
```
|
||||
|
||||
**Temporary Increase via Redis (immediate, no restart):**
|
||||
|
||||
```bash
|
||||
# Check current rate limit for a key
|
||||
redis-cli GET "auth-login:<ip-hash>"
|
||||
|
||||
# Delete a rate limit record to allow immediate retry
|
||||
redis-cli DEL "auth-login:<ip-hash>"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Global API Rate Limits
|
||||
|
||||
### Configuration
|
||||
|
||||
Global rate limits are applied via the `@RateLimit` decorator:
|
||||
|
||||
```typescript
|
||||
@RateLimit({ limit: 100, ttl: 60 }) // 100 requests per minute
|
||||
@Controller('invoices')
|
||||
export class InvoicesController { ... }
|
||||
```
|
||||
|
||||
### Common Rate Limit Settings
|
||||
|
||||
| Endpoint | Limit | TTL | Notes |
|
||||
| ------------- | ----- | --- | --------------------- |
|
||||
| Invoices | 100 | 60s | High-traffic endpoint |
|
||||
| Subscriptions | 100 | 60s | High-traffic endpoint |
|
||||
| Catalog | 200 | 60s | Cached, higher limit |
|
||||
| Orders | 50 | 60s | Write operations |
|
||||
| Profile | 60 | 60s | Standard limit |
|
||||
|
||||
### Adjusting Global Rate Limits
|
||||
|
||||
Global rate limits are defined in code. To adjust:
|
||||
|
||||
1. Modify the `@RateLimit` decorator in the controller
|
||||
2. Deploy the change
|
||||
|
||||
```typescript
|
||||
// Before
|
||||
@RateLimit({ limit: 50, ttl: 60 })
|
||||
|
||||
// After (double the limit)
|
||||
@RateLimit({ limit: 100, ttl: 60 })
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## External API Request Queues
|
||||
|
||||
### WHMCS Queue Configuration
|
||||
|
||||
| Setting | Env Variable | Default | Description |
|
||||
| ------------ | -------------------------- | ------- | ----------------------- |
|
||||
| Concurrency | `WHMCS_QUEUE_CONCURRENCY` | 15 | Max parallel requests |
|
||||
| Interval Cap | `WHMCS_QUEUE_INTERVAL_CAP` | 300 | Max requests per minute |
|
||||
| Timeout | `WHMCS_QUEUE_TIMEOUT_MS` | 30000 | Request timeout (ms) |
|
||||
|
||||
### Salesforce Queue Configuration
|
||||
|
||||
| Setting | Env Variable | Default | Description |
|
||||
| ------------------------ | ----------------------------- | ------- | ----------------------- |
|
||||
| Standard Concurrency | `SF_QUEUE_CONCURRENCY` | 10 | Standard operations |
|
||||
| Long-Running Concurrency | `SF_LONG_RUNNING_CONCURRENCY` | 5 | Bulk operations |
|
||||
| Interval Cap | `SF_QUEUE_INTERVAL_CAP` | 200 | Max requests per minute |
|
||||
| Timeout | `SF_QUEUE_TIMEOUT_MS` | 30000 | Request timeout (ms) |
|
||||
|
||||
### Adjusting Queue Limits
|
||||
|
||||
**Production Adjustment:**
|
||||
|
||||
```bash
|
||||
# Edit .env file
|
||||
WHMCS_QUEUE_CONCURRENCY=20 # Increase concurrent requests
|
||||
WHMCS_QUEUE_INTERVAL_CAP=500 # Increase requests per minute
|
||||
|
||||
# Restart backend
|
||||
docker compose restart backend
|
||||
```
|
||||
|
||||
### Queue Health Monitoring
|
||||
|
||||
```bash
|
||||
# Check queue metrics
|
||||
curl http://localhost:4000/health/queues | jq '.'
|
||||
|
||||
# Expected output:
|
||||
{
|
||||
"whmcs": {
|
||||
"health": "healthy",
|
||||
"metrics": {
|
||||
"queueSize": 0,
|
||||
"pendingRequests": 2,
|
||||
"failedRequests": 0
|
||||
}
|
||||
},
|
||||
"salesforce": {
|
||||
"health": "healthy",
|
||||
"metrics": { ... },
|
||||
"dailyUsage": { "used": 5000, "limit": 15000 }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## SSE Connection Limits
|
||||
|
||||
### Configuration
|
||||
|
||||
```typescript
|
||||
// Per-user SSE connection limit (in-memory)
|
||||
private readonly maxPerUser = 3;
|
||||
```
|
||||
|
||||
This prevents a single user from opening unlimited SSE connections.
|
||||
|
||||
### Adjusting SSE Limits
|
||||
|
||||
This requires a code change in `realtime-connection-limiter.service.ts`:
|
||||
|
||||
```typescript
|
||||
// Change from
|
||||
private readonly maxPerUser = 3;
|
||||
|
||||
// To
|
||||
private readonly maxPerUser = 5;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Bypassing Rate Limits for Testing
|
||||
|
||||
### Temporary Bypass via Redis
|
||||
|
||||
```bash
|
||||
# Clear all rate limit keys for testing
|
||||
redis-cli KEYS "auth-*" | xargs redis-cli DEL
|
||||
redis-cli KEYS "rate-limit:*" | xargs redis-cli DEL
|
||||
|
||||
# Clear specific user's rate limit
|
||||
redis-cli KEYS "*<ip-or-user-identifier>*" | xargs redis-cli DEL
|
||||
```
|
||||
|
||||
### Using SkipRateLimit Decorator
|
||||
|
||||
For development/testing routes:
|
||||
|
||||
```typescript
|
||||
@SkipRateLimit()
|
||||
@Get('test-endpoint')
|
||||
async testEndpoint() { ... }
|
||||
```
|
||||
|
||||
### Environment-Based Bypass
|
||||
|
||||
Add a development bypass in configuration:
|
||||
|
||||
```bash
|
||||
# In .env (development only!)
|
||||
RATE_LIMIT_BYPASS_ENABLED=true
|
||||
```
|
||||
|
||||
```typescript
|
||||
// In guard
|
||||
if (this.configService.get("RATE_LIMIT_BYPASS_ENABLED") === "true") {
|
||||
return true;
|
||||
}
|
||||
```
|
||||
|
||||
> **Warning**: Never enable bypass in production!
|
||||
|
||||
---
|
||||
|
||||
## Signs of Rate Limit Issues
|
||||
|
||||
### User-Facing Symptoms
|
||||
|
||||
| Symptom | Possible Cause | Investigation |
|
||||
| -------------------------- | ------------------- | ------------------------- |
|
||||
| "Too many requests" errors | Rate limit exceeded | Check Redis keys, logs |
|
||||
| Login failures | Auth rate limit | Check `auth-login:*` keys |
|
||||
| Slow API responses | Queue backlog | Check `/health/queues` |
|
||||
| 429 errors in logs | Any rate limit | Check logs for specifics |
|
||||
|
||||
### Monitoring Indicators
|
||||
|
||||
| Metric | Warning | Critical | Action |
|
||||
| ----------------- | ------------- | -------- | ------------------------ |
|
||||
| 429 error rate | >1% | >5% | Review rate limits |
|
||||
| Queue size | >10 | >50 | Increase concurrency |
|
||||
| Average wait time | >1s | >5s | Scale or increase limits |
|
||||
| CAPTCHA triggers | Unusual spike | - | Possible attack |
|
||||
|
||||
### Log Analysis
|
||||
|
||||
```bash
|
||||
# Find rate limit exceeded events
|
||||
grep "Rate limit exceeded" /var/log/bff/combined.log | tail -20
|
||||
|
||||
# Find 429 responses
|
||||
grep '"statusCode":429' /var/log/bff/combined.log | tail -20
|
||||
|
||||
# Count rate limit events by path
|
||||
grep "Rate limit exceeded" /var/log/bff/combined.log | \
|
||||
jq -r '.path' | sort | uniq -c | sort -rn
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Too Many 429 Errors
|
||||
|
||||
**Diagnosis:**
|
||||
|
||||
```bash
|
||||
# Check which endpoints are rate limited
|
||||
grep "Rate limit exceeded" /var/log/bff/combined.log | \
|
||||
jq '{path: .path, key: .key}' | head -20
|
||||
|
||||
# Check queue health
|
||||
curl http://localhost:4000/health/queues
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. Identify the affected endpoint
|
||||
2. Check if limit is appropriate for traffic
|
||||
3. Increase limit if legitimate traffic
|
||||
4. Add caching if requests are repetitive
|
||||
|
||||
### Legitimate Users Being Blocked
|
||||
|
||||
**Diagnosis:**
|
||||
|
||||
```bash
|
||||
# Check rate limit state for specific key
|
||||
redis-cli KEYS "*<identifier>*"
|
||||
redis-cli GET "auth-login:<hash>"
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
```bash
|
||||
# Clear the user's rate limit record
|
||||
redis-cli DEL "auth-login:<hash>"
|
||||
```
|
||||
|
||||
### External API Rate Limit Violations
|
||||
|
||||
**WHMCS Rate Limiting:**
|
||||
|
||||
```bash
|
||||
# Check queue metrics
|
||||
curl http://localhost:4000/health/queues/whmcs
|
||||
|
||||
# Reduce concurrency if WHMCS is overloaded
|
||||
WHMCS_QUEUE_CONCURRENCY=5
|
||||
WHMCS_QUEUE_INTERVAL_CAP=100
|
||||
```
|
||||
|
||||
**Salesforce API Limits:**
|
||||
|
||||
```bash
|
||||
# Check daily API usage
|
||||
curl http://localhost:4000/health/queues/salesforce | jq '.dailyUsage'
|
||||
|
||||
# If approaching limit, reduce requests
|
||||
# Consider caching more data
|
||||
```
|
||||
|
||||
### Redis Connection Issues
|
||||
|
||||
If rate limiting fails due to Redis:
|
||||
|
||||
```bash
|
||||
# Check Redis connectivity
|
||||
redis-cli PING
|
||||
|
||||
# The guard fails open on Redis errors (allows request)
|
||||
# Check logs for "Rate limiter error - failing open"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Setting Rate Limits
|
||||
|
||||
1. **Start Conservative** - Begin with lower limits, increase as needed
|
||||
2. **Monitor Before Adjusting** - Understand traffic patterns first
|
||||
3. **Consider User Experience** - Limits should rarely impact normal use
|
||||
4. **Document Changes** - Track why limits were adjusted
|
||||
|
||||
### Rate Limit Strategies
|
||||
|
||||
| Strategy | Use Case | Implementation |
|
||||
| ---------- | ----------------------- | ---------------------- |
|
||||
| IP-based | Anonymous endpoints | Default behavior |
|
||||
| User-based | Authenticated endpoints | Include user ID in key |
|
||||
| Combined | Sensitive endpoints | IP + User-Agent hash |
|
||||
| Tiered | Different user classes | Custom logic |
|
||||
|
||||
### Performance Considerations
|
||||
|
||||
- **Redis Latency** - Keep Redis co-located with BFF
|
||||
- **Key Expiration** - Use TTL to prevent Redis bloat
|
||||
- **Fail Open** - Rate limiter allows requests if Redis fails
|
||||
- **Logging** - Log blocked requests for analysis
|
||||
|
||||
---
|
||||
|
||||
## Rate Limit Response Headers
|
||||
|
||||
The BFF includes standard rate limit headers:
|
||||
|
||||
```http
|
||||
X-RateLimit-Limit: 100
|
||||
X-RateLimit-Remaining: 95
|
||||
X-RateLimit-Reset: 1704110400
|
||||
Retry-After: 60
|
||||
```
|
||||
|
||||
Clients can use these to implement backoff.
|
||||
|
||||
---
|
||||
|
||||
## Related Documents
|
||||
|
||||
- [Incident Response](./incident-response.md)
|
||||
- [Monitoring Setup](./monitoring-setup.md)
|
||||
- [External Dependencies](./external-dependencies.md)
|
||||
- [Queue Management](./queue-management.md)
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** December 2025
|
||||
402
docs/operations/release-procedures.md
Normal file
402
docs/operations/release-procedures.md
Normal file
@ -0,0 +1,402 @@
|
||||
# Release and Deployment Procedures
|
||||
|
||||
This document covers pre-deployment checklists, deployment procedures, post-deployment verification, and rollback procedures for the Customer Portal.
|
||||
|
||||
---
|
||||
|
||||
## Deployment Overview
|
||||
|
||||
| Environment | Method | Script | Notes |
|
||||
| ----------- | -------------- | ------------------ | ------------------------------------ |
|
||||
| Development | Local | `pnpm dev` | Apps run locally, services in Docker |
|
||||
| Production | Docker Compose | `pnpm prod:deploy` | Full containerized deployment |
|
||||
| Updates | Docker Compose | `pnpm prod:update` | Zero-downtime application updates |
|
||||
|
||||
### Available Commands
|
||||
|
||||
```bash
|
||||
pnpm prod:deploy # Full deployment (build + start + migrate)
|
||||
pnpm prod:start # Start all production services
|
||||
pnpm prod:stop # Stop all production services
|
||||
pnpm prod:update # Zero-downtime update (rebuild and recreate apps)
|
||||
pnpm prod:status # Show service status and health
|
||||
pnpm prod:logs # Show service logs
|
||||
pnpm prod:backup # Create database backup
|
||||
pnpm prod:cleanup # Clean up old containers and images
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pre-Deployment Checklist
|
||||
|
||||
### Code Review
|
||||
|
||||
- [ ] All changes have been reviewed and approved
|
||||
- [ ] No console.log/console.error statements in production code
|
||||
- [ ] No hardcoded secrets or credentials
|
||||
- [ ] TypeScript compilation passes (`pnpm type-check`)
|
||||
- [ ] Linting passes (`pnpm lint`)
|
||||
- [ ] Tests pass (`pnpm test`)
|
||||
|
||||
### Environment Configuration
|
||||
|
||||
- [ ] All required environment variables are set in `.env`
|
||||
- [ ] Database URL is correct for production
|
||||
- [ ] Redis URL is correct for production
|
||||
- [ ] External API credentials are valid (Salesforce, WHMCS, Freebit)
|
||||
- [ ] CORS_ORIGIN matches production domain
|
||||
- [ ] JWT_SECRET is secure and unique
|
||||
|
||||
**Required Environment Variables:**
|
||||
|
||||
```bash
|
||||
DATABASE_URL # PostgreSQL connection string
|
||||
REDIS_URL # Redis connection string
|
||||
JWT_SECRET # Secure secret (min 32 chars)
|
||||
POSTGRES_PASSWORD # Database password
|
||||
CORS_ORIGIN # Frontend domain
|
||||
NEXT_PUBLIC_API_BASE # BFF API URL
|
||||
BFF_PORT # Backend port (usually 4000)
|
||||
```
|
||||
|
||||
### Database Migration Check
|
||||
|
||||
- [ ] Review pending migrations (`npx prisma migrate status`)
|
||||
- [ ] Test migrations on staging/local first
|
||||
- [ ] Create database backup before applying migrations
|
||||
- [ ] Prepare rollback SQL if migration is destructive
|
||||
- [ ] Estimate migration duration for large tables
|
||||
|
||||
### Dependency Check
|
||||
|
||||
- [ ] Run security audit (`pnpm security:check`)
|
||||
- [ ] No high/critical vulnerabilities
|
||||
- [ ] All dependencies are at expected versions
|
||||
- [ ] Lock file is up to date (`pnpm-lock.yaml`)
|
||||
|
||||
### Communication
|
||||
|
||||
- [ ] Notify team of deployment schedule
|
||||
- [ ] Schedule during low-traffic window if possible
|
||||
- [ ] Prepare customer communication if downtime expected
|
||||
- [ ] Ensure on-call engineer is available
|
||||
|
||||
---
|
||||
|
||||
## Deployment Procedure
|
||||
|
||||
### Standard Deployment (First Time)
|
||||
|
||||
```bash
|
||||
# 1. Create database backup (if updating existing system)
|
||||
pnpm prod:backup
|
||||
|
||||
# 2. Full deployment
|
||||
pnpm prod:deploy
|
||||
```
|
||||
|
||||
This command:
|
||||
|
||||
1. Validates environment configuration
|
||||
2. Builds production Docker images
|
||||
3. Starts database and cache services
|
||||
4. Waits for database readiness
|
||||
5. Runs Prisma migrations
|
||||
6. Starts frontend and backend services
|
||||
7. Performs health checks
|
||||
|
||||
### Application Update (Zero-Downtime)
|
||||
|
||||
For updates that don't require database migrations:
|
||||
|
||||
```bash
|
||||
# 1. Create database backup
|
||||
pnpm prod:backup
|
||||
|
||||
# 2. Update applications
|
||||
pnpm prod:update
|
||||
```
|
||||
|
||||
This rebuilds and recreates frontend and backend containers without stopping the database.
|
||||
|
||||
### Database Migration Deployment
|
||||
|
||||
For deployments with schema changes:
|
||||
|
||||
```bash
|
||||
# 1. Create database backup
|
||||
pnpm prod:backup
|
||||
|
||||
# 2. Stop application to prevent writes during migration
|
||||
pnpm prod:stop
|
||||
|
||||
# 3. Start only database
|
||||
docker compose -f docker/prod/docker-compose.yml up -d database
|
||||
|
||||
# 4. Run migrations
|
||||
docker compose -f docker/prod/docker-compose.yml run --rm backend pnpm db:migrate
|
||||
|
||||
# 5. Verify migration success
|
||||
docker compose -f docker/prod/docker-compose.yml exec database psql -U portal -d portal_prod -c "SELECT * FROM _prisma_migrations ORDER BY finished_at DESC LIMIT 5;"
|
||||
|
||||
# 6. Start all services
|
||||
pnpm prod:start
|
||||
|
||||
# 7. Verify application health
|
||||
pnpm prod:status
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Post-Deployment Verification
|
||||
|
||||
### Immediate Checks (0-5 minutes)
|
||||
|
||||
- [ ] Health endpoints return `ok`
|
||||
```bash
|
||||
curl http://localhost:4000/health
|
||||
curl http://localhost:3000/_health
|
||||
```
|
||||
- [ ] No error spikes in logs
|
||||
```bash
|
||||
pnpm prod:logs backend | grep -i error | tail -20
|
||||
```
|
||||
- [ ] Database migrations applied successfully
|
||||
- [ ] Redis connectivity verified
|
||||
|
||||
### Functional Checks (5-15 minutes)
|
||||
|
||||
- [ ] User can log in to portal
|
||||
- [ ] Dashboard loads correctly
|
||||
- [ ] Invoice list displays
|
||||
- [ ] Subscription list displays
|
||||
- [ ] Catalog products load
|
||||
|
||||
### Integration Checks (15-30 minutes)
|
||||
|
||||
- [ ] Salesforce connectivity verified
|
||||
```bash
|
||||
curl http://localhost:4000/auth/health-check | jq '.services.salesforce'
|
||||
```
|
||||
- [ ] WHMCS connectivity verified
|
||||
```bash
|
||||
curl http://localhost:4000/auth/health-check | jq '.services.whmcs'
|
||||
```
|
||||
- [ ] Queue health verified
|
||||
```bash
|
||||
curl http://localhost:4000/health/queues
|
||||
```
|
||||
|
||||
### Monitoring Checks
|
||||
|
||||
- [ ] Metrics are being collected
|
||||
- [ ] No alert triggers from deployment
|
||||
- [ ] Log aggregation is working
|
||||
- [ ] Error rates are normal
|
||||
|
||||
---
|
||||
|
||||
## Rollback Procedures
|
||||
|
||||
### Application Rollback (No DB Changes)
|
||||
|
||||
If deployment fails without database changes:
|
||||
|
||||
```bash
|
||||
# 1. Stop current deployment
|
||||
pnpm prod:stop
|
||||
|
||||
# 2. Checkout previous version
|
||||
git checkout <previous-tag-or-commit>
|
||||
|
||||
# 3. Rebuild and deploy
|
||||
pnpm prod:deploy
|
||||
```
|
||||
|
||||
### Application Rollback with Docker Images
|
||||
|
||||
If previous images are available:
|
||||
|
||||
```bash
|
||||
# 1. Stop current services
|
||||
pnpm prod:stop
|
||||
|
||||
# 2. Start with previous image tags
|
||||
docker compose -f docker/prod/docker-compose.yml up -d \
|
||||
--no-build \
|
||||
-e BACKEND_IMAGE=portal-backend:previous \
|
||||
-e FRONTEND_IMAGE=portal-frontend:previous
|
||||
```
|
||||
|
||||
### Database Rollback
|
||||
|
||||
If database migration needs to be reverted:
|
||||
|
||||
**Option 1: Restore from Backup**
|
||||
|
||||
```bash
|
||||
# 1. Stop application
|
||||
pnpm prod:stop
|
||||
|
||||
# 2. Restore database
|
||||
docker compose exec database psql -U portal -d portal_prod < backup_YYYYMMDD_HHMMSS.sql
|
||||
|
||||
# 3. Checkout previous code version
|
||||
git checkout <previous-tag>
|
||||
|
||||
# 4. Rebuild and restart
|
||||
pnpm prod:deploy
|
||||
```
|
||||
|
||||
**Option 2: Manual Rollback SQL**
|
||||
|
||||
```bash
|
||||
# 1. Stop application
|
||||
pnpm prod:stop
|
||||
|
||||
# 2. Apply rollback script (if prepared)
|
||||
docker compose exec database psql -U portal -d portal_prod < rollback_migration_YYYYMMDD.sql
|
||||
|
||||
# 3. Manually remove migration record
|
||||
docker compose exec database psql -U portal -d portal_prod -c "DELETE FROM _prisma_migrations WHERE migration_name = '20240115_migration_name';"
|
||||
|
||||
# 4. Restart with previous code
|
||||
git checkout <previous-tag>
|
||||
pnpm prod:deploy
|
||||
```
|
||||
|
||||
### Emergency Rollback
|
||||
|
||||
For critical failures requiring immediate action:
|
||||
|
||||
```bash
|
||||
# 1. Immediately stop all services
|
||||
pnpm prod:stop
|
||||
|
||||
# 2. Restore from most recent backup
|
||||
docker compose exec database psql -U portal -d portal_prod < /path/to/latest_backup.sql
|
||||
|
||||
# 3. Deploy last known good version
|
||||
git checkout <last-known-good-tag>
|
||||
pnpm prod:deploy
|
||||
|
||||
# 4. Notify team
|
||||
# Send incident notification
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Feature Flags
|
||||
|
||||
The portal does not currently use a formal feature flag system. Feature availability is controlled through:
|
||||
|
||||
1. **Environment Variables** - Toggle features via configuration
|
||||
2. **Conditional Rendering** - Frontend checks for feature availability
|
||||
3. **Backend Feature Checks** - API endpoints check configuration
|
||||
|
||||
### Adding a Feature Toggle
|
||||
|
||||
```typescript
|
||||
// Backend: Check environment variable
|
||||
const featureEnabled = this.configService.get("FEATURE_NEW_CHECKOUT", "false") === "true";
|
||||
|
||||
// Frontend: Check feature availability
|
||||
if (process.env.NEXT_PUBLIC_FEATURE_NEW_CHECKOUT === "true") {
|
||||
// Render new feature
|
||||
}
|
||||
```
|
||||
|
||||
### Emergency Feature Disable
|
||||
|
||||
To disable a feature without redeployment:
|
||||
|
||||
1. Update environment variable in `.env`
|
||||
2. Restart affected services:
|
||||
```bash
|
||||
docker compose restart backend frontend
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Deployment Timeline Template
|
||||
|
||||
| Time | Action | Owner | Notes |
|
||||
| ----- | ------------------------------- | ---------- | ------------------------- |
|
||||
| T-24h | Announce deployment window | Tech Lead | Notify all stakeholders |
|
||||
| T-2h | Final code review | Developers | Verify all changes merged |
|
||||
| T-1h | Pre-deployment checklist | DevOps | Complete all checks |
|
||||
| T-30m | Create backup | DevOps | Verify backup integrity |
|
||||
| T-15m | Notify team deployment starting | DevOps | Slack/Teams message |
|
||||
| T-0 | Execute deployment | DevOps | Run deployment commands |
|
||||
| T+5m | Immediate verification | DevOps | Health checks |
|
||||
| T+15m | Functional verification | QA/DevOps | Test key flows |
|
||||
| T+30m | All-clear or rollback decision | Tech Lead | Confirm success |
|
||||
| T+1h | Post-deployment monitoring | DevOps | Watch metrics |
|
||||
| T+24h | Close deployment | Tech Lead | Final verification |
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Build Failures
|
||||
|
||||
```bash
|
||||
# Check Docker daemon
|
||||
docker info
|
||||
|
||||
# Check disk space
|
||||
df -h
|
||||
|
||||
# Clean Docker resources
|
||||
docker system prune -a
|
||||
```
|
||||
|
||||
### Migration Failures
|
||||
|
||||
```bash
|
||||
# Check migration status
|
||||
npx prisma migrate status
|
||||
|
||||
# View migration history
|
||||
docker compose exec database psql -U portal -d portal_prod -c "SELECT * FROM _prisma_migrations;"
|
||||
|
||||
# Reset migration (development only!)
|
||||
npx prisma migrate reset
|
||||
```
|
||||
|
||||
### Service Startup Failures
|
||||
|
||||
```bash
|
||||
# Check service logs
|
||||
pnpm prod:logs backend
|
||||
pnpm prod:logs frontend
|
||||
|
||||
# Check container status
|
||||
docker compose ps -a
|
||||
|
||||
# Check resource usage
|
||||
docker stats
|
||||
```
|
||||
|
||||
### Database Connection Issues
|
||||
|
||||
```bash
|
||||
# Test database connectivity
|
||||
docker compose exec database pg_isready -U portal -d portal_prod
|
||||
|
||||
# Check connection count
|
||||
docker compose exec database psql -U portal -d portal_prod -c "SELECT count(*) FROM pg_stat_activity;"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Related Documents
|
||||
|
||||
- [Deployment Guide](../getting-started/deployment.md)
|
||||
- [Database Operations](./database-operations.md)
|
||||
- [Incident Response](./incident-response.md)
|
||||
- [Monitoring Setup](./monitoring-setup.md)
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** December 2025
|
||||
Loading…
x
Reference in New Issue
Block a user