From 90ab71b94dd837bc614e8e3b32ec9b717050a165 Mon Sep 17 00:00:00 2001 From: barsa Date: Tue, 23 Dec 2025 16:08:15 +0900 Subject: [PATCH] Update README.md to Enhance Documentation Clarity and Add New Sections - Added a new section for Release Procedures, detailing deployment and rollback processes. - Updated the System Operations section to include Monitoring Setup, Rate Limit Tuning, and Customer Data Management for improved operational guidance. - Reformatted the table structure for better readability and consistency across documentation. --- docs/README.md | 24 +- docs/operations/customer-data-management.md | 415 ++++++++++++++++++++ docs/operations/monitoring-setup.md | 375 ++++++++++++++++++ docs/operations/rate-limit-tuning.md | 395 +++++++++++++++++++ docs/operations/release-procedures.md | 402 +++++++++++++++++++ 5 files changed, 1602 insertions(+), 9 deletions(-) create mode 100644 docs/operations/customer-data-management.md create mode 100644 docs/operations/monitoring-setup.md create mode 100644 docs/operations/rate-limit-tuning.md create mode 100644 docs/operations/release-procedures.md diff --git a/docs/README.md b/docs/README.md index ecf8f8e3..c1ce2fcc 100644 --- a/docs/README.md +++ b/docs/README.md @@ -148,14 +148,18 @@ Feature guides explaining how the portal functions: | [External Dependencies](./operations/external-dependencies.md) | Integration health checks | | [Queue Management](./operations/queue-management.md) | BullMQ job monitoring | | [External Processes](./operations/external-processes.md) | Team handoffs and workflows | +| [Release Procedures](./operations/release-procedures.md) | Deployment and rollback | ### System Operations -| Document | Description | -| ------------------------------------------------------------------ | -------------------------- | -| [Logging](./operations/logging.md) | Centralized logging system | -| [Security Monitoring](./operations/security-monitoring.md) | Security monitoring setup | -| [Subscription Management](./operations/subscription-management.md) | Service management | +| Document | Description | +| -------------------------------------------------------------------- | -------------------------- | +| [Logging](./operations/logging.md) | Centralized logging system | +| [Security Monitoring](./operations/security-monitoring.md) | Security monitoring setup | +| [Subscription Management](./operations/subscription-management.md) | Service management | +| [Monitoring Setup](./operations/monitoring-setup.md) | Metrics and dashboards | +| [Rate Limit Tuning](./operations/rate-limit-tuning.md) | Rate limit configuration | +| [Customer Data Management](./operations/customer-data-management.md) | GDPR and data procedures | --- @@ -192,10 +196,12 @@ Historical documents kept for reference: ### DevOps / Operations 1. [Deployment](./getting-started/deployment.md) -2. [Incident Response](./operations/incident-response.md) -3. [Provisioning Runbook](./operations/provisioning-runbook.md) -4. [Database Operations](./operations/database-operations.md) -5. [External Dependencies](./operations/external-dependencies.md) +2. [Release Procedures](./operations/release-procedures.md) +3. [Incident Response](./operations/incident-response.md) +4. [Monitoring Setup](./operations/monitoring-setup.md) +5. [Database Operations](./operations/database-operations.md) +6. [External Dependencies](./operations/external-dependencies.md) +7. [Rate Limit Tuning](./operations/rate-limit-tuning.md) --- diff --git a/docs/operations/customer-data-management.md b/docs/operations/customer-data-management.md new file mode 100644 index 00000000..d7419bd3 --- /dev/null +++ b/docs/operations/customer-data-management.md @@ -0,0 +1,415 @@ +# Customer Data Management (GDPR) + +This document covers procedures for handling customer data in compliance with GDPR and data protection regulations. + +--- + +## Data Storage Overview + +Customer data is stored across multiple systems: + +| System | Data Stored | Retention | Notes | +| ----------------------- | ----------------------------------------------------- | --------------------------- | ---------------------------- | +| **Portal (PostgreSQL)** | User accounts, ID mappings, audit logs, notifications | Active account lifetime | Auth data only | +| **WHMCS** | Billing, invoices, payment methods, addresses | Legal requirement (7 years) | System of record for billing | +| **Salesforce** | CRM data, orders, cases, contacts | Business records | System of record for CRM | +| **Redis** | Sessions, cache, rate limits | TTL-based (minutes to days) | Temporary data | + +### Portal Database Tables with PII + +| Table | PII Fields | Purpose | +| ---------------------------- | ------------------------------------ | -------------------- | +| `users` | `email`, `passwordHash`, `mfaSecret` | Authentication | +| `id_mappings` | Links to WHMCS/Salesforce IDs | Identity federation | +| `audit_logs` | `ipAddress`, `userAgent`, `userId` | Security audit trail | +| `residence_card_submissions` | Document images | ID verification | +| `notifications` | User notifications | In-app messaging | +| `sim_call_history_*` | Phone numbers, call details | Usage records | +| `sim_sms_history` | Phone numbers, SMS details | Usage records | + +--- + +## Data Subject Rights + +Under GDPR, customers have the following rights: + +| Right | Portal Support | Notes | +| ---------------------- | ------------------ | ------------------------- | +| Right of Access | Manual export | See Data Export section | +| Right to Rectification | WHMCS self-service | Customer updates in WHMCS | +| Right to Erasure | Manual process | See Data Deletion section | +| Right to Portability | Manual export | See Data Export section | +| Right to Object | Manual process | Opt-out of processing | + +--- + +## Data Deletion Procedures + +### Overview + +Complete customer data deletion requires coordination across all systems: + +1. Portal database deletion +2. WHMCS account handling +3. Salesforce record handling +4. Redis cache clearing +5. Audit trail retention + +### Pre-Deletion Checklist + +- [ ] Verify customer identity (authentication or CS verification) +- [ ] Check for active subscriptions (must be cancelled first) +- [ ] Check for unpaid invoices (must be settled first) +- [ ] Check legal retention requirements (invoices, tax records) +- [ ] Document the deletion request with timestamp + +### Step 1: Portal Database Deletion + +```sql +-- 1. Get user information +SELECT u.id, u.email, im.whmcs_client_id, im.sf_account_id +FROM users u +LEFT JOIN id_mappings im ON u.id = im.user_id +WHERE u.email = 'customer@example.com'; + +-- 2. Delete notifications +DELETE FROM notifications WHERE user_id = ''; + +-- 3. Delete residence card submissions +DELETE FROM residence_card_submissions WHERE user_id = ''; + +-- 4. Delete SIM usage data (if applicable) +-- Note: Check if SIM account is linked to this user first +DELETE FROM sim_usage_daily WHERE account IN ( + SELECT account FROM sim_voice_options WHERE account = '' +); +DELETE FROM sim_call_history_domestic WHERE account = ''; +DELETE FROM sim_call_history_international WHERE account = ''; +DELETE FROM sim_sms_history WHERE account = ''; +DELETE FROM sim_voice_options WHERE account = ''; + +-- 5. Delete ID mapping (cascades from user deletion) +-- The id_mappings table has onDelete: Cascade + +-- 6. Delete user (cascades audit_logs user reference to NULL, deletes id_mapping) +DELETE FROM users WHERE id = ''; +``` + +**Using the Mappings Service:** + +```typescript +// Delete mapping programmatically (clears cache too) +await mappingsService.deleteMapping(userId); +``` + +### Step 2: Audit Log Handling + +Audit logs may need to be retained for security compliance. Options: + +**Option A: Anonymize (Recommended)** + +```sql +-- Anonymize audit logs (keeps security trail, removes PII) +UPDATE audit_logs +SET user_id = NULL, + ip_address = 'ANONYMIZED', + user_agent = 'ANONYMIZED', + details = jsonb_set( + COALESCE(details, '{}'::jsonb), + '{anonymized}', + 'true'::jsonb + ) +WHERE user_id = ''; +``` + +**Option B: Delete (If Legally Permitted)** + +```sql +DELETE FROM audit_logs WHERE user_id = ''; +``` + +### Step 3: Redis Cache Clearing + +```bash +# Clear user-specific cache keys +redis-cli KEYS "user:*:*" | xargs redis-cli DEL +redis-cli KEYS "session:*:*" | xargs redis-cli DEL +redis-cli KEYS "mapping:*:*" | xargs redis-cli DEL + +# Clear refresh token families +redis-cli KEYS "refresh:user:*" | xargs redis-cli DEL +redis-cli KEYS "refresh:family:*" | xargs redis-cli DEL # May need filtering + +# Clear rate limit records +redis-cli KEYS "auth-login:*" | xargs redis-cli DEL # Clears by IP, not user +``` + +### Step 4: WHMCS Account Handling + +WHMCS does not support full account deletion. Options: + +**Option A: Close Account (Recommended)** + +1. Cancel all active services +2. Set account status to "Closed" +3. Anonymize personal fields via WHMCS Admin +4. Document closure date + +**Option B: Anonymize via API** + +```bash +# Update client to anonymized data +curl -X POST "$WHMCS_API_URL" \ + -d "identifier=$WHMCS_API_IDENTIFIER" \ + -d "secret=$WHMCS_API_SECRET" \ + -d "action=UpdateClient" \ + -d "clientid=" \ + -d "firstname=Deleted" \ + -d "lastname=User" \ + -d "email=deleted_@deleted.local" \ + -d "address1=Deleted" \ + -d "city=Deleted" \ + -d "state=Deleted" \ + -d "postcode=000-0000" \ + -d "phonenumber=000-0000-0000" \ + -d "status=Closed" \ + -d "responsetype=json" +``` + +### Step 5: Salesforce Record Handling + +Salesforce records often have legal retention requirements: + +**For Personal Data:** + +1. Work with Salesforce Admin +2. Consider anonymization vs deletion +3. Check integration impact (linked Orders, Cases) + +**Anonymization Approach:** + +- Update Account name to "Deleted Account - [ID]" +- Clear personal fields (phone, address if not needed) +- Keep transactional records with anonymized references + +--- + +## Data Export Procedures + +### Customer Data Export Request + +When a customer requests their data: + +#### 1. Portal Data Export + +```sql +-- Export user data +SELECT + u.id, + u.email, + u.email_verified, + u.created_at, + u.last_login_at, + im.whmcs_client_id, + im.sf_account_id +FROM users u +LEFT JOIN id_mappings im ON u.id = im.user_id +WHERE u.email = 'customer@example.com'; + +-- Export audit log (security events) +SELECT + action, + resource, + success, + created_at +FROM audit_logs +WHERE user_id = '' +ORDER BY created_at DESC; + +-- Export notifications +SELECT + type, + title, + message, + read, + created_at +FROM notifications +WHERE user_id = '' +ORDER BY created_at DESC; + +-- Export SIM usage history (if applicable) +SELECT + call_date, + call_time, + called_to, + duration_sec, + charge_yen +FROM sim_call_history_domestic +WHERE account = '' +ORDER BY call_date DESC; +``` + +#### 2. WHMCS Data Export + +Request via WHMCS Admin: + +- Client Details +- Invoices +- Services/Subscriptions +- Tickets/Support History +- Transaction History + +#### 3. Salesforce Data Export + +Request via Salesforce Admin: + +- Account record +- Contact record +- Order history +- Case history +- Opportunities + +### Export Format + +Provide data in machine-readable format: + +- JSON for structured data +- CSV for tabular data +- PDF for documents (invoices) + +--- + +## PII Handling During Debugging + +### Safe Logging Practices + +The BFF uses Pino with automatic PII redaction. Sensitive fields are sanitized: + +```json +{ + "email": "cust***@example.com", + "password": "[REDACTED]", + "token": "[REDACTED]", + "authorization": "[REDACTED]" +} +``` + +### What NOT to Log + +- Full email addresses (use masked version) +- Passwords or password hashes +- JWT tokens +- API keys or secrets +- Credit card numbers +- Full phone numbers +- Full addresses +- ID document contents + +### Safe Debug Queries + +```sql +-- Use ID instead of email for lookups +SELECT * FROM users WHERE id = ''; + +-- Mask PII in query results +SELECT + id, + CONCAT(LEFT(email, 3), '***', SUBSTRING(email FROM POSITION('@' IN email))) as masked_email, + created_at +FROM users +WHERE id = ''; +``` + +### Production Debugging + +When investigating production issues: + +1. **Use correlation IDs** - Search logs by request ID, not user email +2. **Access minimal data** - Only query what's needed +3. **Document access** - Note why you accessed customer data +4. **Use anonymized exports** - When sharing data for analysis + +--- + +## Data Retention Policies + +### Recommended Retention Periods + +| Data Type | Retention | Justification | +| ------------------------ | ---------- | ---------------------- | +| Active user accounts | Indefinite | Active service | +| Closed accounts (portal) | 30 days | Grace period | +| Audit logs | 2 years | Security compliance | +| Session data (Redis) | 24 hours | Active sessions | +| Rate limit data | 15 minutes | Operational | +| Invoices | 7 years | Tax/legal requirement | +| Support cases | 5 years | Service history | +| Call/SMS history | 6 months | Billing reconciliation | + +### Automated Cleanup + +```sql +-- Delete expired notifications (30 days after expiry) +DELETE FROM notifications +WHERE expires_at < NOW() - INTERVAL '30 days'; + +-- Anonymize old audit logs (over 2 years) +UPDATE audit_logs +SET ip_address = 'EXPIRED', + user_agent = 'EXPIRED' +WHERE created_at < NOW() - INTERVAL '2 years' + AND ip_address != 'EXPIRED'; +``` + +--- + +## Compliance Checklist + +### Monthly Review + +- [ ] Review data access logs for unusual patterns +- [ ] Verify automated cleanup jobs are running +- [ ] Check for pending deletion requests +- [ ] Review new data collection points + +### Quarterly Review + +- [ ] Audit third-party data sharing +- [ ] Review retention policies +- [ ] Update data inventory if schema changed +- [ ] Staff training on data handling + +### Annual Review + +- [ ] Full data protection impact assessment +- [ ] Policy review and updates +- [ ] Vendor compliance verification +- [ ] Documentation updates + +--- + +## Emergency Data Breach Response + +If a data breach is suspected: + +1. **Contain** - Isolate affected systems +2. **Assess** - Determine scope and data exposed +3. **Notify** - Inform DPO/legal within 24 hours +4. **Report** - GDPR requires notification within 72 hours +5. **Remediate** - Fix vulnerability and prevent recurrence +6. **Document** - Full incident report + +See [Incident Response](./incident-response.md) for general incident procedures. + +--- + +## Related Documents + +- [Incident Response](./incident-response.md) +- [Database Operations](./database-operations.md) +- [Logging Guide](./logging.md) +- [Security Monitoring](./security-monitoring.md) + +--- + +**Last Updated:** December 2025 diff --git a/docs/operations/monitoring-setup.md b/docs/operations/monitoring-setup.md new file mode 100644 index 00000000..b40808cf --- /dev/null +++ b/docs/operations/monitoring-setup.md @@ -0,0 +1,375 @@ +# Monitoring Dashboard Setup + +This document provides guidance for setting up monitoring infrastructure for the Customer Portal. + +--- + +## Health Endpoints + +The BFF exposes several health check endpoints for monitoring: + +| Endpoint | Purpose | Authentication | +| ------------------------------- | ------------------------------------------ | -------------- | +| `GET /health` | Core system health (database, cache) | Public | +| `GET /health/queues` | Request queue metrics (WHMCS, Salesforce) | Public | +| `GET /health/queues/whmcs` | WHMCS queue details | Public | +| `GET /health/queues/salesforce` | Salesforce queue details | Public | +| `GET /health/catalog/cache` | Catalog cache metrics | Public | +| `GET /auth/health-check` | Integration health (DB, WHMCS, Salesforce) | Public | + +### Core Health Response + +```json +{ + "status": "ok", + "checks": { + "database": "ok", + "cache": "ok" + } +} +``` + +**Status Values:** + +- `ok` - All systems healthy +- `degraded` - One or more systems failing + +### Queue Health Response + +```json +{ + "timestamp": "2025-01-15T10:30:00.000Z", + "whmcs": { + "health": "healthy", + "metrics": { + "totalRequests": 1500, + "completedRequests": 1495, + "failedRequests": 5, + "queueSize": 0, + "pendingRequests": 2, + "averageWaitTime": 50, + "averageExecutionTime": 250 + } + }, + "salesforce": { + "health": "healthy", + "metrics": { ... }, + "dailyUsage": { "used": 5000, "limit": 15000 } + } +} +``` + +--- + +## Key Metrics to Monitor + +### Application Metrics + +| Metric | Source | Warning | Critical | Description | +| ------------------- | --------------- | ------------- | ---------------- | --------------------- | +| Health status | `/health` | `degraded` | Any check `fail` | Core system health | +| Response time (p95) | Logs/APM | >2s | >5s | API response latency | +| Error rate | Logs/APM | >1% | >5% | HTTP 5xx responses | +| Active connections | Node.js metrics | >80% capacity | >95% capacity | Connection pool usage | + +### Database Metrics + +| Metric | Source | Warning | Critical | Description | +| --------------------- | --------------------- | --------- | --------- | --------------------------- | +| Connection pool usage | PostgreSQL | >80% | >95% | Active connections vs limit | +| Query duration | PostgreSQL logs | >500ms | >2s | Slow query detection | +| Database size | PostgreSQL | >80% disk | >90% disk | Storage capacity | +| Dead tuples | `pg_stat_user_tables` | >10% | >25% | Vacuum needed | + +### Cache Metrics + +| Metric | Source | Warning | Critical | Description | +| -------------- | ---------------- | -------------- | -------------- | ------------------------- | +| Redis memory | Redis INFO | >80% maxmemory | >95% maxmemory | Memory pressure | +| Cache hit rate | Application logs | <80% | <60% | Cache effectiveness | +| Redis latency | Redis CLI | >10ms | >50ms | Command latency | +| Evictions | Redis INFO | Any | High rate | Memory pressure indicator | + +### Queue Metrics + +| Metric | Source | Warning | Critical | Description | +| --------------------- | ---------------- | ---------- | ---------- | ---------------------- | +| WHMCS queue size | `/health/queues` | >10 | >50 | Pending WHMCS requests | +| WHMCS failed requests | `/health/queues` | >5 | >20 | Failed API calls | +| SF daily API usage | `/health/queues` | >80% limit | >95% limit | Salesforce API quota | +| BullMQ wait queue | Redis | >10 | >50 | Job backlog | +| BullMQ failed jobs | Redis | >5 | >20 | Processing failures | + +### External Dependency Metrics + +| Metric | Source | Warning | Critical | Description | +| ------------------------ | ------ | ------- | -------- | -------------------- | +| Salesforce response time | Logs | >2s | >5s | SF API latency | +| WHMCS response time | Logs | >2s | >5s | WHMCS API latency | +| Freebit response time | Logs | >3s | >10s | Freebit API latency | +| External error rate | Logs | >1% | >5% | Integration failures | + +--- + +## Structured Logging for Metrics + +The BFF uses Pino for structured JSON logging. Key fields for metrics extraction: + +```json +{ + "timestamp": "2025-01-15T10:30:00.000Z", + "level": "info", + "service": "customer-portal-bff", + "correlationId": "req-123", + "message": "API call completed", + "duration": 250, + "path": "/api/invoices", + "method": "GET", + "statusCode": 200 +} +``` + +### Log Queries for Metrics + +**Error Rate (last hour):** + +```bash +grep '"level":50' /var/log/bff/combined.log | wc -l +``` + +**Slow Requests (>2s):** + +```bash +grep '"duration":[0-9]\{4,\}' /var/log/bff/combined.log | tail -20 +``` + +**External API Errors:** + +```bash +grep -E '(WHMCS|Salesforce|Freebit).*error' /var/log/bff/error.log | tail -20 +``` + +--- + +## Grafana Dashboard Setup + +### Data Sources + +1. **Prometheus** - For application metrics +2. **Loki** - For log aggregation +3. **PostgreSQL** - For database metrics + +### Recommended Panels + +#### Overview Dashboard + +1. **System Health** (Stat panel) + - Query: `/health` endpoint status + - Show: ok/degraded indicator + +2. **Request Rate** (Graph panel) + - Source: Prometheus/Loki + - Show: Requests per second + +3. **Error Rate** (Graph panel) + - Source: Loki log count + - Filter: `level >= 50` + +4. **Response Time (p95)** (Graph panel) + - Source: Prometheus histogram + - Show: 95th percentile latency + +#### Queue Dashboard + +1. **Queue Depths** (Graph panel) + - Source: `/health/queues` endpoint + - Show: WHMCS and SF queue sizes + +2. **Failed Jobs** (Stat panel) + - Source: Redis BullMQ metrics + - Show: Failed job count + +3. **Salesforce API Usage** (Gauge panel) + - Source: `/health/queues/salesforce` + - Show: Daily usage vs limit + +#### Database Dashboard + +1. **Connection Pool** (Gauge panel) + - Source: PostgreSQL `pg_stat_activity` + - Show: Active connections + +2. **Query Performance** (Table panel) + - Source: PostgreSQL `pg_stat_statements` + - Show: Slowest queries + +### Sample Prometheus Scrape Config + +```yaml +scrape_configs: + - job_name: "portal-bff" + static_configs: + - targets: ["bff:4000"] + metrics_path: "/health" + scrape_interval: 30s +``` + +--- + +## CloudWatch Setup (AWS) + +### Custom Metrics + +Push metrics from health endpoints to CloudWatch: + +```bash +# Example: Push queue depth metric +aws cloudwatch put-metric-data \ + --namespace "CustomerPortal" \ + --metric-name "WhmcsQueueDepth" \ + --value $(curl -s http://localhost:4000/health/queues | jq '.whmcs.metrics.queueSize') \ + --dimensions Environment=production +``` + +### Recommended CloudWatch Alarms + +| Alarm | Metric | Threshold | Period | Action | +| ------------- | ---------------- | --------- | ------ | ---------------- | +| HighErrorRate | ErrorCount | >10 | 5 min | SNS notification | +| HighLatency | p95 ResponseTime | >2000ms | 5 min | SNS notification | +| QueueBacklog | WhmcsQueueDepth | >50 | 5 min | SNS notification | +| DatabaseDown | HealthStatus | !=ok | 1 min | PagerDuty | +| CacheDown | HealthStatus | !=ok | 1 min | PagerDuty | + +### Log Insights Queries + +**Error Summary:** + +```sql +fields @timestamp, @message +| filter level >= 50 +| stats count() by bin(5m) +``` + +**Slow Requests:** + +```sql +fields @timestamp, path, duration +| filter duration > 2000 +| sort duration desc +| limit 20 +``` + +--- + +## DataDog Setup + +### Agent Configuration + +```yaml +# datadog.yaml +logs_enabled: true + +logs: + - type: file + path: /var/log/bff/combined.log + service: customer-portal-bff + source: nodejs +``` + +### Custom Metrics + +```typescript +// Example: Report queue metrics to DataDog +import { StatsD } from "hot-shots"; + +const dogstatsd = new StatsD({ host: "localhost", port: 8125 }); + +// Report queue depth +dogstatsd.gauge("portal.whmcs.queue_depth", metrics.queueSize); +dogstatsd.gauge("portal.whmcs.failed_requests", metrics.failedRequests); +``` + +### Recommended Monitors + +1. **Health Check Monitor** + - Check: HTTP check on `/health` + - Alert: When status != ok for 2 minutes + +2. **Error Rate Monitor** + - Metric: `portal.errors.count` + - Alert: When >5% for 5 minutes + +3. **Queue Depth Monitor** + - Metric: `portal.whmcs.queue_depth` + - Alert: When >50 for 5 minutes + +--- + +## Alerting Best Practices + +### Alert Priority Levels + +| Priority | Response Time | Examples | +| ----------- | ------------- | --------------------------------------------- | +| P1 Critical | 15 minutes | Portal down, database unreachable | +| P2 High | 1 hour | Provisioning failing, payment processing down | +| P3 Medium | 4 hours | Degraded performance, high error rate | +| P4 Low | 24 hours | Minor issues, informational alerts | + +### Alert Routing + +```yaml +# Example PagerDuty routing +routes: + - match: + severity: critical + receiver: pagerduty-oncall + - match: + severity: warning + receiver: slack-ops + - match: + severity: info + receiver: email-team +``` + +### Runbook Links + +Include runbook links in all alerts: + +- Health check failures → [Incident Response](./incident-response.md) +- Database issues → [Database Operations](./database-operations.md) +- Queue problems → [Queue Management](./queue-management.md) +- External API failures → [External Dependencies](./external-dependencies.md) + +--- + +## Monitoring Checklist + +### Initial Setup + +- [ ] Configure health endpoint scraping (every 30s) +- [ ] Set up log aggregation (Loki, CloudWatch, or DataDog) +- [ ] Create overview dashboard with key metrics +- [ ] Configure P1/P2 alerts for critical failures +- [ ] Test alert routing to on-call + +### Ongoing Maintenance + +- [ ] Review alert thresholds quarterly +- [ ] Check for alert fatigue (too many false positives) +- [ ] Update dashboards when new features are deployed +- [ ] Validate runbook links are current + +--- + +## Related Documents + +- [Incident Response](./incident-response.md) +- [Logging Guide](./logging.md) +- [External Dependencies](./external-dependencies.md) +- [Queue Management](./queue-management.md) + +--- + +**Last Updated:** December 2025 diff --git a/docs/operations/rate-limit-tuning.md b/docs/operations/rate-limit-tuning.md new file mode 100644 index 00000000..88499898 --- /dev/null +++ b/docs/operations/rate-limit-tuning.md @@ -0,0 +1,395 @@ +# Rate Limit Tuning Guide + +This document covers rate limiting configuration, adjustment procedures, and troubleshooting for the Customer Portal. + +--- + +## Rate Limiting Overview + +The portal uses multiple rate limiting mechanisms: + +| Type | Scope | Backend | Purpose | +| ------------------------- | ---------------------------------- | ------------------- | --------------------------- | +| **Auth Rate Limiting** | Per endpoint (login, signup, etc.) | Redis | Prevent brute force attacks | +| **Global Rate Limiting** | Per route/controller | Redis | API abuse prevention | +| **Request Queues** | Per external API | In-memory (p-queue) | External API protection | +| **SSE Connection Limits** | Per user | In-memory | Resource protection | + +--- + +## Authentication Rate Limits + +### Configuration + +| Endpoint | Env Variable | Default | Window | +| -------------------- | --------------------------------- | ----------- | ------ | +| Login | `LOGIN_RATE_LIMIT_LIMIT` | 5 attempts | 15 min | +| Login (TTL) | `LOGIN_RATE_LIMIT_TTL` | 900000 ms | - | +| Signup | `SIGNUP_RATE_LIMIT_LIMIT` | 5 attempts | 15 min | +| Signup (TTL) | `SIGNUP_RATE_LIMIT_TTL` | 900000 ms | - | +| Password Reset | `PASSWORD_RESET_RATE_LIMIT_LIMIT` | 5 attempts | 15 min | +| Password Reset (TTL) | `PASSWORD_RESET_RATE_LIMIT_TTL` | 900000 ms | - | +| Token Refresh | `AUTH_REFRESH_RATE_LIMIT_LIMIT` | 10 attempts | 5 min | +| Token Refresh (TTL) | `AUTH_REFRESH_RATE_LIMIT_TTL` | 300000 ms | - | + +### CAPTCHA Configuration + +| Setting | Env Variable | Default | Description | +| ----------------- | ------------------------------ | ------- | ------------------------------------ | +| CAPTCHA Threshold | `LOGIN_CAPTCHA_AFTER_ATTEMPTS` | 3 | Show CAPTCHA after N failed attempts | +| CAPTCHA Always On | `AUTH_CAPTCHA_ALWAYS_ON` | false | Require CAPTCHA for all logins | + +### Adjusting Auth Rate Limits + +**In Production (requires restart):** + +```bash +# Edit .env file +LOGIN_RATE_LIMIT_LIMIT=10 # Increase to 10 attempts +LOGIN_RATE_LIMIT_TTL=1800000 # Extend window to 30 minutes + +# Restart backend +docker compose restart backend +``` + +**Temporary Increase via Redis (immediate, no restart):** + +```bash +# Check current rate limit for a key +redis-cli GET "auth-login:" + +# Delete a rate limit record to allow immediate retry +redis-cli DEL "auth-login:" +``` + +--- + +## Global API Rate Limits + +### Configuration + +Global rate limits are applied via the `@RateLimit` decorator: + +```typescript +@RateLimit({ limit: 100, ttl: 60 }) // 100 requests per minute +@Controller('invoices') +export class InvoicesController { ... } +``` + +### Common Rate Limit Settings + +| Endpoint | Limit | TTL | Notes | +| ------------- | ----- | --- | --------------------- | +| Invoices | 100 | 60s | High-traffic endpoint | +| Subscriptions | 100 | 60s | High-traffic endpoint | +| Catalog | 200 | 60s | Cached, higher limit | +| Orders | 50 | 60s | Write operations | +| Profile | 60 | 60s | Standard limit | + +### Adjusting Global Rate Limits + +Global rate limits are defined in code. To adjust: + +1. Modify the `@RateLimit` decorator in the controller +2. Deploy the change + +```typescript +// Before +@RateLimit({ limit: 50, ttl: 60 }) + +// After (double the limit) +@RateLimit({ limit: 100, ttl: 60 }) +``` + +--- + +## External API Request Queues + +### WHMCS Queue Configuration + +| Setting | Env Variable | Default | Description | +| ------------ | -------------------------- | ------- | ----------------------- | +| Concurrency | `WHMCS_QUEUE_CONCURRENCY` | 15 | Max parallel requests | +| Interval Cap | `WHMCS_QUEUE_INTERVAL_CAP` | 300 | Max requests per minute | +| Timeout | `WHMCS_QUEUE_TIMEOUT_MS` | 30000 | Request timeout (ms) | + +### Salesforce Queue Configuration + +| Setting | Env Variable | Default | Description | +| ------------------------ | ----------------------------- | ------- | ----------------------- | +| Standard Concurrency | `SF_QUEUE_CONCURRENCY` | 10 | Standard operations | +| Long-Running Concurrency | `SF_LONG_RUNNING_CONCURRENCY` | 5 | Bulk operations | +| Interval Cap | `SF_QUEUE_INTERVAL_CAP` | 200 | Max requests per minute | +| Timeout | `SF_QUEUE_TIMEOUT_MS` | 30000 | Request timeout (ms) | + +### Adjusting Queue Limits + +**Production Adjustment:** + +```bash +# Edit .env file +WHMCS_QUEUE_CONCURRENCY=20 # Increase concurrent requests +WHMCS_QUEUE_INTERVAL_CAP=500 # Increase requests per minute + +# Restart backend +docker compose restart backend +``` + +### Queue Health Monitoring + +```bash +# Check queue metrics +curl http://localhost:4000/health/queues | jq '.' + +# Expected output: +{ + "whmcs": { + "health": "healthy", + "metrics": { + "queueSize": 0, + "pendingRequests": 2, + "failedRequests": 0 + } + }, + "salesforce": { + "health": "healthy", + "metrics": { ... }, + "dailyUsage": { "used": 5000, "limit": 15000 } + } +} +``` + +--- + +## SSE Connection Limits + +### Configuration + +```typescript +// Per-user SSE connection limit (in-memory) +private readonly maxPerUser = 3; +``` + +This prevents a single user from opening unlimited SSE connections. + +### Adjusting SSE Limits + +This requires a code change in `realtime-connection-limiter.service.ts`: + +```typescript +// Change from +private readonly maxPerUser = 3; + +// To +private readonly maxPerUser = 5; +``` + +--- + +## Bypassing Rate Limits for Testing + +### Temporary Bypass via Redis + +```bash +# Clear all rate limit keys for testing +redis-cli KEYS "auth-*" | xargs redis-cli DEL +redis-cli KEYS "rate-limit:*" | xargs redis-cli DEL + +# Clear specific user's rate limit +redis-cli KEYS "**" | xargs redis-cli DEL +``` + +### Using SkipRateLimit Decorator + +For development/testing routes: + +```typescript +@SkipRateLimit() +@Get('test-endpoint') +async testEndpoint() { ... } +``` + +### Environment-Based Bypass + +Add a development bypass in configuration: + +```bash +# In .env (development only!) +RATE_LIMIT_BYPASS_ENABLED=true +``` + +```typescript +// In guard +if (this.configService.get("RATE_LIMIT_BYPASS_ENABLED") === "true") { + return true; +} +``` + +> **Warning**: Never enable bypass in production! + +--- + +## Signs of Rate Limit Issues + +### User-Facing Symptoms + +| Symptom | Possible Cause | Investigation | +| -------------------------- | ------------------- | ------------------------- | +| "Too many requests" errors | Rate limit exceeded | Check Redis keys, logs | +| Login failures | Auth rate limit | Check `auth-login:*` keys | +| Slow API responses | Queue backlog | Check `/health/queues` | +| 429 errors in logs | Any rate limit | Check logs for specifics | + +### Monitoring Indicators + +| Metric | Warning | Critical | Action | +| ----------------- | ------------- | -------- | ------------------------ | +| 429 error rate | >1% | >5% | Review rate limits | +| Queue size | >10 | >50 | Increase concurrency | +| Average wait time | >1s | >5s | Scale or increase limits | +| CAPTCHA triggers | Unusual spike | - | Possible attack | + +### Log Analysis + +```bash +# Find rate limit exceeded events +grep "Rate limit exceeded" /var/log/bff/combined.log | tail -20 + +# Find 429 responses +grep '"statusCode":429' /var/log/bff/combined.log | tail -20 + +# Count rate limit events by path +grep "Rate limit exceeded" /var/log/bff/combined.log | \ + jq -r '.path' | sort | uniq -c | sort -rn +``` + +--- + +## Troubleshooting + +### Too Many 429 Errors + +**Diagnosis:** + +```bash +# Check which endpoints are rate limited +grep "Rate limit exceeded" /var/log/bff/combined.log | \ + jq '{path: .path, key: .key}' | head -20 + +# Check queue health +curl http://localhost:4000/health/queues +``` + +**Resolution:** + +1. Identify the affected endpoint +2. Check if limit is appropriate for traffic +3. Increase limit if legitimate traffic +4. Add caching if requests are repetitive + +### Legitimate Users Being Blocked + +**Diagnosis:** + +```bash +# Check rate limit state for specific key +redis-cli KEYS "**" +redis-cli GET "auth-login:" +``` + +**Resolution:** + +```bash +# Clear the user's rate limit record +redis-cli DEL "auth-login:" +``` + +### External API Rate Limit Violations + +**WHMCS Rate Limiting:** + +```bash +# Check queue metrics +curl http://localhost:4000/health/queues/whmcs + +# Reduce concurrency if WHMCS is overloaded +WHMCS_QUEUE_CONCURRENCY=5 +WHMCS_QUEUE_INTERVAL_CAP=100 +``` + +**Salesforce API Limits:** + +```bash +# Check daily API usage +curl http://localhost:4000/health/queues/salesforce | jq '.dailyUsage' + +# If approaching limit, reduce requests +# Consider caching more data +``` + +### Redis Connection Issues + +If rate limiting fails due to Redis: + +```bash +# Check Redis connectivity +redis-cli PING + +# The guard fails open on Redis errors (allows request) +# Check logs for "Rate limiter error - failing open" +``` + +--- + +## Best Practices + +### Setting Rate Limits + +1. **Start Conservative** - Begin with lower limits, increase as needed +2. **Monitor Before Adjusting** - Understand traffic patterns first +3. **Consider User Experience** - Limits should rarely impact normal use +4. **Document Changes** - Track why limits were adjusted + +### Rate Limit Strategies + +| Strategy | Use Case | Implementation | +| ---------- | ----------------------- | ---------------------- | +| IP-based | Anonymous endpoints | Default behavior | +| User-based | Authenticated endpoints | Include user ID in key | +| Combined | Sensitive endpoints | IP + User-Agent hash | +| Tiered | Different user classes | Custom logic | + +### Performance Considerations + +- **Redis Latency** - Keep Redis co-located with BFF +- **Key Expiration** - Use TTL to prevent Redis bloat +- **Fail Open** - Rate limiter allows requests if Redis fails +- **Logging** - Log blocked requests for analysis + +--- + +## Rate Limit Response Headers + +The BFF includes standard rate limit headers: + +```http +X-RateLimit-Limit: 100 +X-RateLimit-Remaining: 95 +X-RateLimit-Reset: 1704110400 +Retry-After: 60 +``` + +Clients can use these to implement backoff. + +--- + +## Related Documents + +- [Incident Response](./incident-response.md) +- [Monitoring Setup](./monitoring-setup.md) +- [External Dependencies](./external-dependencies.md) +- [Queue Management](./queue-management.md) + +--- + +**Last Updated:** December 2025 diff --git a/docs/operations/release-procedures.md b/docs/operations/release-procedures.md new file mode 100644 index 00000000..45c2522b --- /dev/null +++ b/docs/operations/release-procedures.md @@ -0,0 +1,402 @@ +# Release and Deployment Procedures + +This document covers pre-deployment checklists, deployment procedures, post-deployment verification, and rollback procedures for the Customer Portal. + +--- + +## Deployment Overview + +| Environment | Method | Script | Notes | +| ----------- | -------------- | ------------------ | ------------------------------------ | +| Development | Local | `pnpm dev` | Apps run locally, services in Docker | +| Production | Docker Compose | `pnpm prod:deploy` | Full containerized deployment | +| Updates | Docker Compose | `pnpm prod:update` | Zero-downtime application updates | + +### Available Commands + +```bash +pnpm prod:deploy # Full deployment (build + start + migrate) +pnpm prod:start # Start all production services +pnpm prod:stop # Stop all production services +pnpm prod:update # Zero-downtime update (rebuild and recreate apps) +pnpm prod:status # Show service status and health +pnpm prod:logs # Show service logs +pnpm prod:backup # Create database backup +pnpm prod:cleanup # Clean up old containers and images +``` + +--- + +## Pre-Deployment Checklist + +### Code Review + +- [ ] All changes have been reviewed and approved +- [ ] No console.log/console.error statements in production code +- [ ] No hardcoded secrets or credentials +- [ ] TypeScript compilation passes (`pnpm type-check`) +- [ ] Linting passes (`pnpm lint`) +- [ ] Tests pass (`pnpm test`) + +### Environment Configuration + +- [ ] All required environment variables are set in `.env` +- [ ] Database URL is correct for production +- [ ] Redis URL is correct for production +- [ ] External API credentials are valid (Salesforce, WHMCS, Freebit) +- [ ] CORS_ORIGIN matches production domain +- [ ] JWT_SECRET is secure and unique + +**Required Environment Variables:** + +```bash +DATABASE_URL # PostgreSQL connection string +REDIS_URL # Redis connection string +JWT_SECRET # Secure secret (min 32 chars) +POSTGRES_PASSWORD # Database password +CORS_ORIGIN # Frontend domain +NEXT_PUBLIC_API_BASE # BFF API URL +BFF_PORT # Backend port (usually 4000) +``` + +### Database Migration Check + +- [ ] Review pending migrations (`npx prisma migrate status`) +- [ ] Test migrations on staging/local first +- [ ] Create database backup before applying migrations +- [ ] Prepare rollback SQL if migration is destructive +- [ ] Estimate migration duration for large tables + +### Dependency Check + +- [ ] Run security audit (`pnpm security:check`) +- [ ] No high/critical vulnerabilities +- [ ] All dependencies are at expected versions +- [ ] Lock file is up to date (`pnpm-lock.yaml`) + +### Communication + +- [ ] Notify team of deployment schedule +- [ ] Schedule during low-traffic window if possible +- [ ] Prepare customer communication if downtime expected +- [ ] Ensure on-call engineer is available + +--- + +## Deployment Procedure + +### Standard Deployment (First Time) + +```bash +# 1. Create database backup (if updating existing system) +pnpm prod:backup + +# 2. Full deployment +pnpm prod:deploy +``` + +This command: + +1. Validates environment configuration +2. Builds production Docker images +3. Starts database and cache services +4. Waits for database readiness +5. Runs Prisma migrations +6. Starts frontend and backend services +7. Performs health checks + +### Application Update (Zero-Downtime) + +For updates that don't require database migrations: + +```bash +# 1. Create database backup +pnpm prod:backup + +# 2. Update applications +pnpm prod:update +``` + +This rebuilds and recreates frontend and backend containers without stopping the database. + +### Database Migration Deployment + +For deployments with schema changes: + +```bash +# 1. Create database backup +pnpm prod:backup + +# 2. Stop application to prevent writes during migration +pnpm prod:stop + +# 3. Start only database +docker compose -f docker/prod/docker-compose.yml up -d database + +# 4. Run migrations +docker compose -f docker/prod/docker-compose.yml run --rm backend pnpm db:migrate + +# 5. Verify migration success +docker compose -f docker/prod/docker-compose.yml exec database psql -U portal -d portal_prod -c "SELECT * FROM _prisma_migrations ORDER BY finished_at DESC LIMIT 5;" + +# 6. Start all services +pnpm prod:start + +# 7. Verify application health +pnpm prod:status +``` + +--- + +## Post-Deployment Verification + +### Immediate Checks (0-5 minutes) + +- [ ] Health endpoints return `ok` + ```bash + curl http://localhost:4000/health + curl http://localhost:3000/_health + ``` +- [ ] No error spikes in logs + ```bash + pnpm prod:logs backend | grep -i error | tail -20 + ``` +- [ ] Database migrations applied successfully +- [ ] Redis connectivity verified + +### Functional Checks (5-15 minutes) + +- [ ] User can log in to portal +- [ ] Dashboard loads correctly +- [ ] Invoice list displays +- [ ] Subscription list displays +- [ ] Catalog products load + +### Integration Checks (15-30 minutes) + +- [ ] Salesforce connectivity verified + ```bash + curl http://localhost:4000/auth/health-check | jq '.services.salesforce' + ``` +- [ ] WHMCS connectivity verified + ```bash + curl http://localhost:4000/auth/health-check | jq '.services.whmcs' + ``` +- [ ] Queue health verified + ```bash + curl http://localhost:4000/health/queues + ``` + +### Monitoring Checks + +- [ ] Metrics are being collected +- [ ] No alert triggers from deployment +- [ ] Log aggregation is working +- [ ] Error rates are normal + +--- + +## Rollback Procedures + +### Application Rollback (No DB Changes) + +If deployment fails without database changes: + +```bash +# 1. Stop current deployment +pnpm prod:stop + +# 2. Checkout previous version +git checkout + +# 3. Rebuild and deploy +pnpm prod:deploy +``` + +### Application Rollback with Docker Images + +If previous images are available: + +```bash +# 1. Stop current services +pnpm prod:stop + +# 2. Start with previous image tags +docker compose -f docker/prod/docker-compose.yml up -d \ + --no-build \ + -e BACKEND_IMAGE=portal-backend:previous \ + -e FRONTEND_IMAGE=portal-frontend:previous +``` + +### Database Rollback + +If database migration needs to be reverted: + +**Option 1: Restore from Backup** + +```bash +# 1. Stop application +pnpm prod:stop + +# 2. Restore database +docker compose exec database psql -U portal -d portal_prod < backup_YYYYMMDD_HHMMSS.sql + +# 3. Checkout previous code version +git checkout + +# 4. Rebuild and restart +pnpm prod:deploy +``` + +**Option 2: Manual Rollback SQL** + +```bash +# 1. Stop application +pnpm prod:stop + +# 2. Apply rollback script (if prepared) +docker compose exec database psql -U portal -d portal_prod < rollback_migration_YYYYMMDD.sql + +# 3. Manually remove migration record +docker compose exec database psql -U portal -d portal_prod -c "DELETE FROM _prisma_migrations WHERE migration_name = '20240115_migration_name';" + +# 4. Restart with previous code +git checkout +pnpm prod:deploy +``` + +### Emergency Rollback + +For critical failures requiring immediate action: + +```bash +# 1. Immediately stop all services +pnpm prod:stop + +# 2. Restore from most recent backup +docker compose exec database psql -U portal -d portal_prod < /path/to/latest_backup.sql + +# 3. Deploy last known good version +git checkout +pnpm prod:deploy + +# 4. Notify team +# Send incident notification +``` + +--- + +## Feature Flags + +The portal does not currently use a formal feature flag system. Feature availability is controlled through: + +1. **Environment Variables** - Toggle features via configuration +2. **Conditional Rendering** - Frontend checks for feature availability +3. **Backend Feature Checks** - API endpoints check configuration + +### Adding a Feature Toggle + +```typescript +// Backend: Check environment variable +const featureEnabled = this.configService.get("FEATURE_NEW_CHECKOUT", "false") === "true"; + +// Frontend: Check feature availability +if (process.env.NEXT_PUBLIC_FEATURE_NEW_CHECKOUT === "true") { + // Render new feature +} +``` + +### Emergency Feature Disable + +To disable a feature without redeployment: + +1. Update environment variable in `.env` +2. Restart affected services: + ```bash + docker compose restart backend frontend + ``` + +--- + +## Deployment Timeline Template + +| Time | Action | Owner | Notes | +| ----- | ------------------------------- | ---------- | ------------------------- | +| T-24h | Announce deployment window | Tech Lead | Notify all stakeholders | +| T-2h | Final code review | Developers | Verify all changes merged | +| T-1h | Pre-deployment checklist | DevOps | Complete all checks | +| T-30m | Create backup | DevOps | Verify backup integrity | +| T-15m | Notify team deployment starting | DevOps | Slack/Teams message | +| T-0 | Execute deployment | DevOps | Run deployment commands | +| T+5m | Immediate verification | DevOps | Health checks | +| T+15m | Functional verification | QA/DevOps | Test key flows | +| T+30m | All-clear or rollback decision | Tech Lead | Confirm success | +| T+1h | Post-deployment monitoring | DevOps | Watch metrics | +| T+24h | Close deployment | Tech Lead | Final verification | + +--- + +## Troubleshooting + +### Build Failures + +```bash +# Check Docker daemon +docker info + +# Check disk space +df -h + +# Clean Docker resources +docker system prune -a +``` + +### Migration Failures + +```bash +# Check migration status +npx prisma migrate status + +# View migration history +docker compose exec database psql -U portal -d portal_prod -c "SELECT * FROM _prisma_migrations;" + +# Reset migration (development only!) +npx prisma migrate reset +``` + +### Service Startup Failures + +```bash +# Check service logs +pnpm prod:logs backend +pnpm prod:logs frontend + +# Check container status +docker compose ps -a + +# Check resource usage +docker stats +``` + +### Database Connection Issues + +```bash +# Test database connectivity +docker compose exec database pg_isready -U portal -d portal_prod + +# Check connection count +docker compose exec database psql -U portal -d portal_prod -c "SELECT count(*) FROM pg_stat_activity;" +``` + +--- + +## Related Documents + +- [Deployment Guide](../getting-started/deployment.md) +- [Database Operations](./database-operations.md) +- [Incident Response](./incident-response.md) +- [Monitoring Setup](./monitoring-setup.md) + +--- + +**Last Updated:** December 2025