Update README.md to Enhance Documentation Clarity and Add New Sections

- Added a new section for Release Procedures, detailing deployment and rollback processes. - Updated the System Operations section to include Monitoring Setup, Rate Limit Tuning, and Customer Data Management for improved operational guidance. - Reformatted the table structure for better readability and consistency across documentation.
2025-12-23 16:08:15 +09:00 · 2025-12-23 16:08:15 +09:00 · 90ab71b94d
commit 90ab71b94d
parent 12eb9fd763
5 changed files with 1602 additions and 9 deletions
--- a/docs/README.md
+++ b/docs/README.md
@ -148,14 +148,18 @@ Feature guides explaining how the portal functions:
 | [External Dependencies](./operations/external-dependencies.md) | Integration health checks     |
 | [Queue Management](./operations/queue-management.md)           | BullMQ job monitoring         |
 | [External Processes](./operations/external-processes.md)       | Team handoffs and workflows   |
+| [Release Procedures](./operations/release-procedures.md)       | Deployment and rollback       |

 ### System Operations

 | Document                                                             | Description                |
-| ------------------------------------------------------------------ | -------------------------- |
+| -------------------------------------------------------------------- | -------------------------- |
 | [Logging](./operations/logging.md)                                   | Centralized logging system |
 | [Security Monitoring](./operations/security-monitoring.md)           | Security monitoring setup  |
 | [Subscription Management](./operations/subscription-management.md)   | Service management         |
+| [Monitoring Setup](./operations/monitoring-setup.md)                 | Metrics and dashboards     |
+| [Rate Limit Tuning](./operations/rate-limit-tuning.md)               | Rate limit configuration   |
+| [Customer Data Management](./operations/customer-data-management.md) | GDPR and data procedures   |

 ---

@ -192,10 +196,12 @@ Historical documents kept for reference:
 ### DevOps / Operations

 1. [Deployment](./getting-started/deployment.md)
-2. [Incident Response](./operations/incident-response.md)
-3. [Provisioning Runbook](./operations/provisioning-runbook.md)
-4. [Database Operations](./operations/database-operations.md)
-5. [External Dependencies](./operations/external-dependencies.md)
+2. [Release Procedures](./operations/release-procedures.md)
+3. [Incident Response](./operations/incident-response.md)
+4. [Monitoring Setup](./operations/monitoring-setup.md)
+5. [Database Operations](./operations/database-operations.md)
+6. [External Dependencies](./operations/external-dependencies.md)
+7. [Rate Limit Tuning](./operations/rate-limit-tuning.md)

 ---

--- a/docs/operations/customer-data-management.md
+++ b/docs/operations/customer-data-management.md
@ -0,0 +1,415 @@
+# Customer Data Management (GDPR)
+
+This document covers procedures for handling customer data in compliance with GDPR and data protection regulations.
+
+---
+
+## Data Storage Overview
+
+Customer data is stored across multiple systems:
+
+| System                  | Data Stored                                           | Retention                   | Notes                        |
+| ----------------------- | ----------------------------------------------------- | --------------------------- | ---------------------------- |
+| **Portal (PostgreSQL)** | User accounts, ID mappings, audit logs, notifications | Active account lifetime     | Auth data only               |
+| **WHMCS**               | Billing, invoices, payment methods, addresses         | Legal requirement (7 years) | System of record for billing |
+| **Salesforce**          | CRM data, orders, cases, contacts                     | Business records            | System of record for CRM     |
+| **Redis**               | Sessions, cache, rate limits                          | TTL-based (minutes to days) | Temporary data               |
+
+### Portal Database Tables with PII
+
+| Table                        | PII Fields                           | Purpose              |
+| ---------------------------- | ------------------------------------ | -------------------- |
+| `users`                      | `email`, `passwordHash`, `mfaSecret` | Authentication       |
+| `id_mappings`                | Links to WHMCS/Salesforce IDs        | Identity federation  |
+| `audit_logs`                 | `ipAddress`, `userAgent`, `userId`   | Security audit trail |
+| `residence_card_submissions` | Document images                      | ID verification      |
+| `notifications`              | User notifications                   | In-app messaging     |
+| `sim_call_history_*`         | Phone numbers, call details          | Usage records        |
+| `sim_sms_history`            | Phone numbers, SMS details           | Usage records        |
+
+---
+
+## Data Subject Rights
+
+Under GDPR, customers have the following rights:
+
+| Right                  | Portal Support     | Notes                     |
+| ---------------------- | ------------------ | ------------------------- |
+| Right of Access        | Manual export      | See Data Export section   |
+| Right to Rectification | WHMCS self-service | Customer updates in WHMCS |
+| Right to Erasure       | Manual process     | See Data Deletion section |
+| Right to Portability   | Manual export      | See Data Export section   |
+| Right to Object        | Manual process     | Opt-out of processing     |
+
+---
+
+## Data Deletion Procedures
+
+### Overview
+
+Complete customer data deletion requires coordination across all systems:
+
+1. Portal database deletion
+2. WHMCS account handling
+3. Salesforce record handling
+4. Redis cache clearing
+5. Audit trail retention
+
+### Pre-Deletion Checklist
+
+- [ ] Verify customer identity (authentication or CS verification)
+- [ ] Check for active subscriptions (must be cancelled first)
+- [ ] Check for unpaid invoices (must be settled first)
+- [ ] Check legal retention requirements (invoices, tax records)
+- [ ] Document the deletion request with timestamp
+
+### Step 1: Portal Database Deletion
+
+```sql
+-- 1. Get user information
+SELECT u.id, u.email, im.whmcs_client_id, im.sf_account_id
+FROM users u
+LEFT JOIN id_mappings im ON u.id = im.user_id
+WHERE u.email = 'customer@example.com';
+
+-- 2. Delete notifications
+DELETE FROM notifications WHERE user_id = '<user_id>';
+
+-- 3. Delete residence card submissions
+DELETE FROM residence_card_submissions WHERE user_id = '<user_id>';
+
+-- 4. Delete SIM usage data (if applicable)
+-- Note: Check if SIM account is linked to this user first
+DELETE FROM sim_usage_daily WHERE account IN (
+  SELECT account FROM sim_voice_options WHERE account = '<sim_account>'
+);
+DELETE FROM sim_call_history_domestic WHERE account = '<sim_account>';
+DELETE FROM sim_call_history_international WHERE account = '<sim_account>';
+DELETE FROM sim_sms_history WHERE account = '<sim_account>';
+DELETE FROM sim_voice_options WHERE account = '<sim_account>';
+
+-- 5. Delete ID mapping (cascades from user deletion)
+-- The id_mappings table has onDelete: Cascade
+
+-- 6. Delete user (cascades audit_logs user reference to NULL, deletes id_mapping)
+DELETE FROM users WHERE id = '<user_id>';
+```
+
+**Using the Mappings Service:**
+
+```typescript
+// Delete mapping programmatically (clears cache too)
+await mappingsService.deleteMapping(userId);
+```
+
+### Step 2: Audit Log Handling
+
+Audit logs may need to be retained for security compliance. Options:
+
+**Option A: Anonymize (Recommended)**
+
+```sql
+-- Anonymize audit logs (keeps security trail, removes PII)
+UPDATE audit_logs
+SET user_id = NULL,
+    ip_address = 'ANONYMIZED',
+    user_agent = 'ANONYMIZED',
+    details = jsonb_set(
+      COALESCE(details, '{}'::jsonb),
+      '{anonymized}',
+      'true'::jsonb
+    )
+WHERE user_id = '<user_id>';
+```
+
+**Option B: Delete (If Legally Permitted)**
+
+```sql
+DELETE FROM audit_logs WHERE user_id = '<user_id>';
+```
+
+### Step 3: Redis Cache Clearing
+
+```bash
+# Clear user-specific cache keys
+redis-cli KEYS "user:*:<user_id>*" | xargs redis-cli DEL
+redis-cli KEYS "session:*:<user_id>*" | xargs redis-cli DEL
+redis-cli KEYS "mapping:*:<user_id>*" | xargs redis-cli DEL
+
+# Clear refresh token families
+redis-cli KEYS "refresh:user:<user_id>*" | xargs redis-cli DEL
+redis-cli KEYS "refresh:family:*" | xargs redis-cli DEL  # May need filtering
+
+# Clear rate limit records
+redis-cli KEYS "auth-login:*" | xargs redis-cli DEL  # Clears by IP, not user
+```
+
+### Step 4: WHMCS Account Handling
+
+WHMCS does not support full account deletion. Options:
+
+**Option A: Close Account (Recommended)**
+
+1. Cancel all active services
+2. Set account status to "Closed"
+3. Anonymize personal fields via WHMCS Admin
+4. Document closure date
+
+**Option B: Anonymize via API**
+
+```bash
+# Update client to anonymized data
+curl -X POST "$WHMCS_API_URL" \
+  -d "identifier=$WHMCS_API_IDENTIFIER" \
+  -d "secret=$WHMCS_API_SECRET" \
+  -d "action=UpdateClient" \
+  -d "clientid=<whmcs_client_id>" \
+  -d "firstname=Deleted" \
+  -d "lastname=User" \
+  -d "email=deleted_<whmcs_client_id>@deleted.local" \
+  -d "address1=Deleted" \
+  -d "city=Deleted" \
+  -d "state=Deleted" \
+  -d "postcode=000-0000" \
+  -d "phonenumber=000-0000-0000" \
+  -d "status=Closed" \
+  -d "responsetype=json"
+```
+
+### Step 5: Salesforce Record Handling
+
+Salesforce records often have legal retention requirements:
+
+**For Personal Data:**
+
+1. Work with Salesforce Admin
+2. Consider anonymization vs deletion
+3. Check integration impact (linked Orders, Cases)
+
+**Anonymization Approach:**
+
+- Update Account name to "Deleted Account - [ID]"
+- Clear personal fields (phone, address if not needed)
+- Keep transactional records with anonymized references
+
+---
+
+## Data Export Procedures
+
+### Customer Data Export Request
+
+When a customer requests their data:
+
+#### 1. Portal Data Export
+
+```sql
+-- Export user data
+SELECT
+  u.id,
+  u.email,
+  u.email_verified,
+  u.created_at,
+  u.last_login_at,
+  im.whmcs_client_id,
+  im.sf_account_id
+FROM users u
+LEFT JOIN id_mappings im ON u.id = im.user_id
+WHERE u.email = 'customer@example.com';
+
+-- Export audit log (security events)
+SELECT
+  action,
+  resource,
+  success,
+  created_at
+FROM audit_logs
+WHERE user_id = '<user_id>'
+ORDER BY created_at DESC;
+
+-- Export notifications
+SELECT
+  type,
+  title,
+  message,
+  read,
+  created_at
+FROM notifications
+WHERE user_id = '<user_id>'
+ORDER BY created_at DESC;
+
+-- Export SIM usage history (if applicable)
+SELECT
+  call_date,
+  call_time,
+  called_to,
+  duration_sec,
+  charge_yen
+FROM sim_call_history_domestic
+WHERE account = '<sim_account>'
+ORDER BY call_date DESC;
+```
+
+#### 2. WHMCS Data Export
+
+Request via WHMCS Admin:
+
+- Client Details
+- Invoices
+- Services/Subscriptions
+- Tickets/Support History
+- Transaction History
+
+#### 3. Salesforce Data Export
+
+Request via Salesforce Admin:
+
+- Account record
+- Contact record
+- Order history
+- Case history
+- Opportunities
+
+### Export Format
+
+Provide data in machine-readable format:
+
+- JSON for structured data
+- CSV for tabular data
+- PDF for documents (invoices)
+
+---
+
+## PII Handling During Debugging
+
+### Safe Logging Practices
+
+The BFF uses Pino with automatic PII redaction. Sensitive fields are sanitized:
+
+```json
+{
+  "email": "cust***@example.com",
+  "password": "[REDACTED]",
+  "token": "[REDACTED]",
+  "authorization": "[REDACTED]"
+}
+```
+
+### What NOT to Log
+
+- Full email addresses (use masked version)
+- Passwords or password hashes
+- JWT tokens
+- API keys or secrets
+- Credit card numbers
+- Full phone numbers
+- Full addresses
+- ID document contents
+
+### Safe Debug Queries
+
+```sql
+-- Use ID instead of email for lookups
+SELECT * FROM users WHERE id = '<uuid>';
+
+-- Mask PII in query results
+SELECT
+  id,
+  CONCAT(LEFT(email, 3), '***', SUBSTRING(email FROM POSITION('@' IN email))) as masked_email,
+  created_at
+FROM users
+WHERE id = '<uuid>';
+```
+
+### Production Debugging
+
+When investigating production issues:
+
+1. **Use correlation IDs** - Search logs by request ID, not user email
+2. **Access minimal data** - Only query what's needed
+3. **Document access** - Note why you accessed customer data
+4. **Use anonymized exports** - When sharing data for analysis
+
+---
+
+## Data Retention Policies
+
+### Recommended Retention Periods
+
+| Data Type                | Retention  | Justification          |
+| ------------------------ | ---------- | ---------------------- |
+| Active user accounts     | Indefinite | Active service         |
+| Closed accounts (portal) | 30 days    | Grace period           |
+| Audit logs               | 2 years    | Security compliance    |
+| Session data (Redis)     | 24 hours   | Active sessions        |
+| Rate limit data          | 15 minutes | Operational            |
+| Invoices                 | 7 years    | Tax/legal requirement  |
+| Support cases            | 5 years    | Service history        |
+| Call/SMS history         | 6 months   | Billing reconciliation |
+
+### Automated Cleanup
+
+```sql
+-- Delete expired notifications (30 days after expiry)
+DELETE FROM notifications
+WHERE expires_at < NOW() - INTERVAL '30 days';
+
+-- Anonymize old audit logs (over 2 years)
+UPDATE audit_logs
+SET ip_address = 'EXPIRED',
+    user_agent = 'EXPIRED'
+WHERE created_at < NOW() - INTERVAL '2 years'
+  AND ip_address != 'EXPIRED';
+```
+
+---
+
+## Compliance Checklist
+
+### Monthly Review
+
+- [ ] Review data access logs for unusual patterns
+- [ ] Verify automated cleanup jobs are running
+- [ ] Check for pending deletion requests
+- [ ] Review new data collection points
+
+### Quarterly Review
+
+- [ ] Audit third-party data sharing
+- [ ] Review retention policies
+- [ ] Update data inventory if schema changed
+- [ ] Staff training on data handling
+
+### Annual Review
+
+- [ ] Full data protection impact assessment
+- [ ] Policy review and updates
+- [ ] Vendor compliance verification
+- [ ] Documentation updates
+
+---
+
+## Emergency Data Breach Response
+
+If a data breach is suspected:
+
+1. **Contain** - Isolate affected systems
+2. **Assess** - Determine scope and data exposed
+3. **Notify** - Inform DPO/legal within 24 hours
+4. **Report** - GDPR requires notification within 72 hours
+5. **Remediate** - Fix vulnerability and prevent recurrence
+6. **Document** - Full incident report
+
+See [Incident Response](./incident-response.md) for general incident procedures.
+
+---
+
+## Related Documents
+
+- [Incident Response](./incident-response.md)
+- [Database Operations](./database-operations.md)
+- [Logging Guide](./logging.md)
+- [Security Monitoring](./security-monitoring.md)
+
+---
+
+**Last Updated:** December 2025
--- a/docs/operations/monitoring-setup.md
+++ b/docs/operations/monitoring-setup.md
@ -0,0 +1,375 @@
+# Monitoring Dashboard Setup
+
+This document provides guidance for setting up monitoring infrastructure for the Customer Portal.
+
+---
+
+## Health Endpoints
+
+The BFF exposes several health check endpoints for monitoring:
+
+| Endpoint                        | Purpose                                    | Authentication |
+| ------------------------------- | ------------------------------------------ | -------------- |
+| `GET /health`                   | Core system health (database, cache)       | Public         |
+| `GET /health/queues`            | Request queue metrics (WHMCS, Salesforce)  | Public         |
+| `GET /health/queues/whmcs`      | WHMCS queue details                        | Public         |
+| `GET /health/queues/salesforce` | Salesforce queue details                   | Public         |
+| `GET /health/catalog/cache`     | Catalog cache metrics                      | Public         |
+| `GET /auth/health-check`        | Integration health (DB, WHMCS, Salesforce) | Public         |
+
+### Core Health Response
+
+```json
+{
+  "status": "ok",
+  "checks": {
+    "database": "ok",
+    "cache": "ok"
+  }
+}
+```
+
+**Status Values:**
+
+- `ok` - All systems healthy
+- `degraded` - One or more systems failing
+
+### Queue Health Response
+
+```json
+{
+  "timestamp": "2025-01-15T10:30:00.000Z",
+  "whmcs": {
+    "health": "healthy",
+    "metrics": {
+      "totalRequests": 1500,
+      "completedRequests": 1495,
+      "failedRequests": 5,
+      "queueSize": 0,
+      "pendingRequests": 2,
+      "averageWaitTime": 50,
+      "averageExecutionTime": 250
+    }
+  },
+  "salesforce": {
+    "health": "healthy",
+    "metrics": { ... },
+    "dailyUsage": { "used": 5000, "limit": 15000 }
+  }
+}
+```
+
+---
+
+## Key Metrics to Monitor
+
+### Application Metrics
+
+| Metric              | Source          | Warning       | Critical         | Description           |
+| ------------------- | --------------- | ------------- | ---------------- | --------------------- |
+| Health status       | `/health`       | `degraded`    | Any check `fail` | Core system health    |
+| Response time (p95) | Logs/APM        | >2s           | >5s              | API response latency  |
+| Error rate          | Logs/APM        | >1%           | >5%              | HTTP 5xx responses    |
+| Active connections  | Node.js metrics | >80% capacity | >95% capacity    | Connection pool usage |
+
+### Database Metrics
+
+| Metric                | Source                | Warning   | Critical  | Description                 |
+| --------------------- | --------------------- | --------- | --------- | --------------------------- |
+| Connection pool usage | PostgreSQL            | >80%      | >95%      | Active connections vs limit |
+| Query duration        | PostgreSQL logs       | >500ms    | >2s       | Slow query detection        |
+| Database size         | PostgreSQL            | >80% disk | >90% disk | Storage capacity            |
+| Dead tuples           | `pg_stat_user_tables` | >10%      | >25%      | Vacuum needed               |
+
+### Cache Metrics
+
+| Metric         | Source           | Warning        | Critical       | Description               |
+| -------------- | ---------------- | -------------- | -------------- | ------------------------- |
+| Redis memory   | Redis INFO       | >80% maxmemory | >95% maxmemory | Memory pressure           |
+| Cache hit rate | Application logs | <80%           | <60%           | Cache effectiveness       |
+| Redis latency  | Redis CLI        | >10ms          | >50ms          | Command latency           |
+| Evictions      | Redis INFO       | Any            | High rate      | Memory pressure indicator |
+
+### Queue Metrics
+
+| Metric                | Source           | Warning    | Critical   | Description            |
+| --------------------- | ---------------- | ---------- | ---------- | ---------------------- |
+| WHMCS queue size      | `/health/queues` | >10        | >50        | Pending WHMCS requests |
+| WHMCS failed requests | `/health/queues` | >5         | >20        | Failed API calls       |
+| SF daily API usage    | `/health/queues` | >80% limit | >95% limit | Salesforce API quota   |
+| BullMQ wait queue     | Redis            | >10        | >50        | Job backlog            |
+| BullMQ failed jobs    | Redis            | >5         | >20        | Processing failures    |
+
+### External Dependency Metrics
+
+| Metric                   | Source | Warning | Critical | Description          |
+| ------------------------ | ------ | ------- | -------- | -------------------- |
+| Salesforce response time | Logs   | >2s     | >5s      | SF API latency       |
+| WHMCS response time      | Logs   | >2s     | >5s      | WHMCS API latency    |
+| Freebit response time    | Logs   | >3s     | >10s     | Freebit API latency  |
+| External error rate      | Logs   | >1%     | >5%      | Integration failures |
+
+---
+
+## Structured Logging for Metrics
+
+The BFF uses Pino for structured JSON logging. Key fields for metrics extraction:
+
+```json
+{
+  "timestamp": "2025-01-15T10:30:00.000Z",
+  "level": "info",
+  "service": "customer-portal-bff",
+  "correlationId": "req-123",
+  "message": "API call completed",
+  "duration": 250,
+  "path": "/api/invoices",
+  "method": "GET",
+  "statusCode": 200
+}
+```
+
+### Log Queries for Metrics
+
+**Error Rate (last hour):**
+
+```bash
+grep '"level":50' /var/log/bff/combined.log | wc -l
+```
+
+**Slow Requests (>2s):**
+
+```bash
+grep '"duration":[0-9]\{4,\}' /var/log/bff/combined.log | tail -20
+```
+
+**External API Errors:**
+
+```bash
+grep -E '(WHMCS|Salesforce|Freebit).*error' /var/log/bff/error.log | tail -20
+```
+
+---
+
+## Grafana Dashboard Setup
+
+### Data Sources
+
+1. **Prometheus** - For application metrics
+2. **Loki** - For log aggregation
+3. **PostgreSQL** - For database metrics
+
+### Recommended Panels
+
+#### Overview Dashboard
+
+1. **System Health** (Stat panel)
+   - Query: `/health` endpoint status
+   - Show: ok/degraded indicator
+
+2. **Request Rate** (Graph panel)
+   - Source: Prometheus/Loki
+   - Show: Requests per second
+
+3. **Error Rate** (Graph panel)
+   - Source: Loki log count
+   - Filter: `level >= 50`
+
+4. **Response Time (p95)** (Graph panel)
+   - Source: Prometheus histogram
+   - Show: 95th percentile latency
+
+#### Queue Dashboard
+
+1. **Queue Depths** (Graph panel)
+   - Source: `/health/queues` endpoint
+   - Show: WHMCS and SF queue sizes
+
+2. **Failed Jobs** (Stat panel)
+   - Source: Redis BullMQ metrics
+   - Show: Failed job count
+
+3. **Salesforce API Usage** (Gauge panel)
+   - Source: `/health/queues/salesforce`
+   - Show: Daily usage vs limit
+
+#### Database Dashboard
+
+1. **Connection Pool** (Gauge panel)
+   - Source: PostgreSQL `pg_stat_activity`
+   - Show: Active connections
+
+2. **Query Performance** (Table panel)
+   - Source: PostgreSQL `pg_stat_statements`
+   - Show: Slowest queries
+
+### Sample Prometheus Scrape Config
+
+```yaml
+scrape_configs:
+  - job_name: "portal-bff"
+    static_configs:
+      - targets: ["bff:4000"]
+    metrics_path: "/health"
+    scrape_interval: 30s
+```
+
+---
+
+## CloudWatch Setup (AWS)
+
+### Custom Metrics
+
+Push metrics from health endpoints to CloudWatch:
+
+```bash
+# Example: Push queue depth metric
+aws cloudwatch put-metric-data \
+  --namespace "CustomerPortal" \
+  --metric-name "WhmcsQueueDepth" \
+  --value $(curl -s http://localhost:4000/health/queues | jq '.whmcs.metrics.queueSize') \
+  --dimensions Environment=production
+```
+
+### Recommended CloudWatch Alarms
+
+| Alarm         | Metric           | Threshold | Period | Action           |
+| ------------- | ---------------- | --------- | ------ | ---------------- |
+| HighErrorRate | ErrorCount       | >10       | 5 min  | SNS notification |
+| HighLatency   | p95 ResponseTime | >2000ms   | 5 min  | SNS notification |
+| QueueBacklog  | WhmcsQueueDepth  | >50       | 5 min  | SNS notification |
+| DatabaseDown  | HealthStatus     | !=ok      | 1 min  | PagerDuty        |
+| CacheDown     | HealthStatus     | !=ok      | 1 min  | PagerDuty        |
+
+### Log Insights Queries
+
+**Error Summary:**
+
+```sql
+fields @timestamp, @message
+| filter level >= 50
+| stats count() by bin(5m)
+```
+
+**Slow Requests:**
+
+```sql
+fields @timestamp, path, duration
+| filter duration > 2000
+| sort duration desc
+| limit 20
+```
+
+---
+
+## DataDog Setup
+
+### Agent Configuration
+
+```yaml
+# datadog.yaml
+logs_enabled: true
+
+logs:
+  - type: file
+    path: /var/log/bff/combined.log
+    service: customer-portal-bff
+    source: nodejs
+```
+
+### Custom Metrics
+
+```typescript
+// Example: Report queue metrics to DataDog
+import { StatsD } from "hot-shots";
+
+const dogstatsd = new StatsD({ host: "localhost", port: 8125 });
+
+// Report queue depth
+dogstatsd.gauge("portal.whmcs.queue_depth", metrics.queueSize);
+dogstatsd.gauge("portal.whmcs.failed_requests", metrics.failedRequests);
+```
+
+### Recommended Monitors
+
+1. **Health Check Monitor**
+   - Check: HTTP check on `/health`
+   - Alert: When status != ok for 2 minutes
+
+2. **Error Rate Monitor**
+   - Metric: `portal.errors.count`
+   - Alert: When >5% for 5 minutes
+
+3. **Queue Depth Monitor**
+   - Metric: `portal.whmcs.queue_depth`
+   - Alert: When >50 for 5 minutes
+
+---
+
+## Alerting Best Practices
+
+### Alert Priority Levels
+
+| Priority    | Response Time | Examples                                      |
+| ----------- | ------------- | --------------------------------------------- |
+| P1 Critical | 15 minutes    | Portal down, database unreachable             |
+| P2 High     | 1 hour        | Provisioning failing, payment processing down |
+| P3 Medium   | 4 hours       | Degraded performance, high error rate         |
+| P4 Low      | 24 hours      | Minor issues, informational alerts            |
+
+### Alert Routing
+
+```yaml
+# Example PagerDuty routing
+routes:
+  - match:
+      severity: critical
+    receiver: pagerduty-oncall
+  - match:
+      severity: warning
+    receiver: slack-ops
+  - match:
+      severity: info
+    receiver: email-team
+```
+
+### Runbook Links
+
+Include runbook links in all alerts:
+
+- Health check failures → [Incident Response](./incident-response.md)
+- Database issues → [Database Operations](./database-operations.md)
+- Queue problems → [Queue Management](./queue-management.md)
+- External API failures → [External Dependencies](./external-dependencies.md)
+
+---
+
+## Monitoring Checklist
+
+### Initial Setup
+
+- [ ] Configure health endpoint scraping (every 30s)
+- [ ] Set up log aggregation (Loki, CloudWatch, or DataDog)
+- [ ] Create overview dashboard with key metrics
+- [ ] Configure P1/P2 alerts for critical failures
+- [ ] Test alert routing to on-call
+
+### Ongoing Maintenance
+
+- [ ] Review alert thresholds quarterly
+- [ ] Check for alert fatigue (too many false positives)
+- [ ] Update dashboards when new features are deployed
+- [ ] Validate runbook links are current
+
+---
+
+## Related Documents
+
+- [Incident Response](./incident-response.md)
+- [Logging Guide](./logging.md)
+- [External Dependencies](./external-dependencies.md)
+- [Queue Management](./queue-management.md)
+
+---
+
+**Last Updated:** December 2025
--- a/docs/operations/rate-limit-tuning.md
+++ b/docs/operations/rate-limit-tuning.md
@ -0,0 +1,395 @@
+# Rate Limit Tuning Guide
+
+This document covers rate limiting configuration, adjustment procedures, and troubleshooting for the Customer Portal.
+
+---
+
+## Rate Limiting Overview
+
+The portal uses multiple rate limiting mechanisms:
+
+| Type                      | Scope                              | Backend             | Purpose                     |
+| ------------------------- | ---------------------------------- | ------------------- | --------------------------- |
+| **Auth Rate Limiting**    | Per endpoint (login, signup, etc.) | Redis               | Prevent brute force attacks |
+| **Global Rate Limiting**  | Per route/controller               | Redis               | API abuse prevention        |
+| **Request Queues**        | Per external API                   | In-memory (p-queue) | External API protection     |
+| **SSE Connection Limits** | Per user                           | In-memory           | Resource protection         |
+
+---
+
+## Authentication Rate Limits
+
+### Configuration
+
+| Endpoint             | Env Variable                      | Default     | Window |
+| -------------------- | --------------------------------- | ----------- | ------ |
+| Login                | `LOGIN_RATE_LIMIT_LIMIT`          | 5 attempts  | 15 min |
+| Login (TTL)          | `LOGIN_RATE_LIMIT_TTL`            | 900000 ms   | -      |
+| Signup               | `SIGNUP_RATE_LIMIT_LIMIT`         | 5 attempts  | 15 min |
+| Signup (TTL)         | `SIGNUP_RATE_LIMIT_TTL`           | 900000 ms   | -      |
+| Password Reset       | `PASSWORD_RESET_RATE_LIMIT_LIMIT` | 5 attempts  | 15 min |
+| Password Reset (TTL) | `PASSWORD_RESET_RATE_LIMIT_TTL`   | 900000 ms   | -      |
+| Token Refresh        | `AUTH_REFRESH_RATE_LIMIT_LIMIT`   | 10 attempts | 5 min  |
+| Token Refresh (TTL)  | `AUTH_REFRESH_RATE_LIMIT_TTL`     | 300000 ms   | -      |
+
+### CAPTCHA Configuration
+
+| Setting           | Env Variable                   | Default | Description                          |
+| ----------------- | ------------------------------ | ------- | ------------------------------------ |
+| CAPTCHA Threshold | `LOGIN_CAPTCHA_AFTER_ATTEMPTS` | 3       | Show CAPTCHA after N failed attempts |
+| CAPTCHA Always On | `AUTH_CAPTCHA_ALWAYS_ON`       | false   | Require CAPTCHA for all logins       |
+
+### Adjusting Auth Rate Limits
+
+**In Production (requires restart):**
+
+```bash
+# Edit .env file
+LOGIN_RATE_LIMIT_LIMIT=10        # Increase to 10 attempts
+LOGIN_RATE_LIMIT_TTL=1800000     # Extend window to 30 minutes
+
+# Restart backend
+docker compose restart backend
+```
+
+**Temporary Increase via Redis (immediate, no restart):**
+
+```bash
+# Check current rate limit for a key
+redis-cli GET "auth-login:<ip-hash>"
+
+# Delete a rate limit record to allow immediate retry
+redis-cli DEL "auth-login:<ip-hash>"
+```
+
+---
+
+## Global API Rate Limits
+
+### Configuration
+
+Global rate limits are applied via the `@RateLimit` decorator:
+
+```typescript
+@RateLimit({ limit: 100, ttl: 60 })  // 100 requests per minute
+@Controller('invoices')
+export class InvoicesController { ... }
+```
+
+### Common Rate Limit Settings
+
+| Endpoint      | Limit | TTL | Notes                 |
+| ------------- | ----- | --- | --------------------- |
+| Invoices      | 100   | 60s | High-traffic endpoint |
+| Subscriptions | 100   | 60s | High-traffic endpoint |
+| Catalog       | 200   | 60s | Cached, higher limit  |
+| Orders        | 50    | 60s | Write operations      |
+| Profile       | 60    | 60s | Standard limit        |
+
+### Adjusting Global Rate Limits
+
+Global rate limits are defined in code. To adjust:
+
+1. Modify the `@RateLimit` decorator in the controller
+2. Deploy the change
+
+```typescript
+// Before
+@RateLimit({ limit: 50, ttl: 60 })
+
+// After (double the limit)
+@RateLimit({ limit: 100, ttl: 60 })
+```
+
+---
+
+## External API Request Queues
+
+### WHMCS Queue Configuration
+
+| Setting      | Env Variable               | Default | Description             |
+| ------------ | -------------------------- | ------- | ----------------------- |
+| Concurrency  | `WHMCS_QUEUE_CONCURRENCY`  | 15      | Max parallel requests   |
+| Interval Cap | `WHMCS_QUEUE_INTERVAL_CAP` | 300     | Max requests per minute |
+| Timeout      | `WHMCS_QUEUE_TIMEOUT_MS`   | 30000   | Request timeout (ms)    |
+
+### Salesforce Queue Configuration
+
+| Setting                  | Env Variable                  | Default | Description             |
+| ------------------------ | ----------------------------- | ------- | ----------------------- |
+| Standard Concurrency     | `SF_QUEUE_CONCURRENCY`        | 10      | Standard operations     |
+| Long-Running Concurrency | `SF_LONG_RUNNING_CONCURRENCY` | 5       | Bulk operations         |
+| Interval Cap             | `SF_QUEUE_INTERVAL_CAP`       | 200     | Max requests per minute |
+| Timeout                  | `SF_QUEUE_TIMEOUT_MS`         | 30000   | Request timeout (ms)    |
+
+### Adjusting Queue Limits
+
+**Production Adjustment:**
+
+```bash
+# Edit .env file
+WHMCS_QUEUE_CONCURRENCY=20      # Increase concurrent requests
+WHMCS_QUEUE_INTERVAL_CAP=500    # Increase requests per minute
+
+# Restart backend
+docker compose restart backend
+```
+
+### Queue Health Monitoring
+
+```bash
+# Check queue metrics
+curl http://localhost:4000/health/queues | jq '.'
+
+# Expected output:
+{
+  "whmcs": {
+    "health": "healthy",
+    "metrics": {
+      "queueSize": 0,
+      "pendingRequests": 2,
+      "failedRequests": 0
+    }
+  },
+  "salesforce": {
+    "health": "healthy",
+    "metrics": { ... },
+    "dailyUsage": { "used": 5000, "limit": 15000 }
+  }
+}
+```
+
+---
+
+## SSE Connection Limits
+
+### Configuration
+
+```typescript
+// Per-user SSE connection limit (in-memory)
+private readonly maxPerUser = 3;
+```
+
+This prevents a single user from opening unlimited SSE connections.
+
+### Adjusting SSE Limits
+
+This requires a code change in `realtime-connection-limiter.service.ts`:
+
+```typescript
+// Change from
+private readonly maxPerUser = 3;
+
+// To
+private readonly maxPerUser = 5;
+```
+
+---
+
+## Bypassing Rate Limits for Testing
+
+### Temporary Bypass via Redis
+
+```bash
+# Clear all rate limit keys for testing
+redis-cli KEYS "auth-*" | xargs redis-cli DEL
+redis-cli KEYS "rate-limit:*" | xargs redis-cli DEL
+
+# Clear specific user's rate limit
+redis-cli KEYS "*<ip-or-user-identifier>*" | xargs redis-cli DEL
+```
+
+### Using SkipRateLimit Decorator
+
+For development/testing routes:
+
+```typescript
+@SkipRateLimit()
+@Get('test-endpoint')
+async testEndpoint() { ... }
+```
+
+### Environment-Based Bypass
+
+Add a development bypass in configuration:
+
+```bash
+# In .env (development only!)
+RATE_LIMIT_BYPASS_ENABLED=true
+```
+
+```typescript
+// In guard
+if (this.configService.get("RATE_LIMIT_BYPASS_ENABLED") === "true") {
+  return true;
+}
+```
+
+> **Warning**: Never enable bypass in production!
+
+---
+
+## Signs of Rate Limit Issues
+
+### User-Facing Symptoms
+
+| Symptom                    | Possible Cause      | Investigation             |
+| -------------------------- | ------------------- | ------------------------- |
+| "Too many requests" errors | Rate limit exceeded | Check Redis keys, logs    |
+| Login failures             | Auth rate limit     | Check `auth-login:*` keys |
+| Slow API responses         | Queue backlog       | Check `/health/queues`    |
+| 429 errors in logs         | Any rate limit      | Check logs for specifics  |
+
+### Monitoring Indicators
+
+| Metric            | Warning       | Critical | Action                   |
+| ----------------- | ------------- | -------- | ------------------------ |
+| 429 error rate    | >1%           | >5%      | Review rate limits       |
+| Queue size        | >10           | >50      | Increase concurrency     |
+| Average wait time | >1s           | >5s      | Scale or increase limits |
+| CAPTCHA triggers  | Unusual spike | -        | Possible attack          |
+
+### Log Analysis
+
+```bash
+# Find rate limit exceeded events
+grep "Rate limit exceeded" /var/log/bff/combined.log | tail -20
+
+# Find 429 responses
+grep '"statusCode":429' /var/log/bff/combined.log | tail -20
+
+# Count rate limit events by path
+grep "Rate limit exceeded" /var/log/bff/combined.log | \
+  jq -r '.path' | sort | uniq -c | sort -rn
+```
+
+---
+
+## Troubleshooting
+
+### Too Many 429 Errors
+
+**Diagnosis:**
+
+```bash
+# Check which endpoints are rate limited
+grep "Rate limit exceeded" /var/log/bff/combined.log | \
+  jq '{path: .path, key: .key}' | head -20
+
+# Check queue health
+curl http://localhost:4000/health/queues
+```
+
+**Resolution:**
+
+1. Identify the affected endpoint
+2. Check if limit is appropriate for traffic
+3. Increase limit if legitimate traffic
+4. Add caching if requests are repetitive
+
+### Legitimate Users Being Blocked
+
+**Diagnosis:**
+
+```bash
+# Check rate limit state for specific key
+redis-cli KEYS "*<identifier>*"
+redis-cli GET "auth-login:<hash>"
+```
+
+**Resolution:**
+
+```bash
+# Clear the user's rate limit record
+redis-cli DEL "auth-login:<hash>"
+```
+
+### External API Rate Limit Violations
+
+**WHMCS Rate Limiting:**
+
+```bash
+# Check queue metrics
+curl http://localhost:4000/health/queues/whmcs
+
+# Reduce concurrency if WHMCS is overloaded
+WHMCS_QUEUE_CONCURRENCY=5
+WHMCS_QUEUE_INTERVAL_CAP=100
+```
+
+**Salesforce API Limits:**
+
+```bash
+# Check daily API usage
+curl http://localhost:4000/health/queues/salesforce | jq '.dailyUsage'
+
+# If approaching limit, reduce requests
+# Consider caching more data
+```
+
+### Redis Connection Issues
+
+If rate limiting fails due to Redis:
+
+```bash
+# Check Redis connectivity
+redis-cli PING
+
+# The guard fails open on Redis errors (allows request)
+# Check logs for "Rate limiter error - failing open"
+```
+
+---
+
+## Best Practices
+
+### Setting Rate Limits
+
+1. **Start Conservative** - Begin with lower limits, increase as needed
+2. **Monitor Before Adjusting** - Understand traffic patterns first
+3. **Consider User Experience** - Limits should rarely impact normal use
+4. **Document Changes** - Track why limits were adjusted
+
+### Rate Limit Strategies
+
+| Strategy   | Use Case                | Implementation         |
+| ---------- | ----------------------- | ---------------------- |
+| IP-based   | Anonymous endpoints     | Default behavior       |
+| User-based | Authenticated endpoints | Include user ID in key |
+| Combined   | Sensitive endpoints     | IP + User-Agent hash   |
+| Tiered     | Different user classes  | Custom logic           |
+
+### Performance Considerations
+
+- **Redis Latency** - Keep Redis co-located with BFF
+- **Key Expiration** - Use TTL to prevent Redis bloat
+- **Fail Open** - Rate limiter allows requests if Redis fails
+- **Logging** - Log blocked requests for analysis
+
+---
+
+## Rate Limit Response Headers
+
+The BFF includes standard rate limit headers:
+
+```http
+X-RateLimit-Limit: 100
+X-RateLimit-Remaining: 95
+X-RateLimit-Reset: 1704110400
+Retry-After: 60
+```
+
+Clients can use these to implement backoff.
+
+---
+
+## Related Documents
+
+- [Incident Response](./incident-response.md)
+- [Monitoring Setup](./monitoring-setup.md)
+- [External Dependencies](./external-dependencies.md)
+- [Queue Management](./queue-management.md)
+
+---
+
+**Last Updated:** December 2025
--- a/docs/operations/release-procedures.md
+++ b/docs/operations/release-procedures.md
@ -0,0 +1,402 @@
+# Release and Deployment Procedures
+
+This document covers pre-deployment checklists, deployment procedures, post-deployment verification, and rollback procedures for the Customer Portal.
+
+---
+
+## Deployment Overview
+
+| Environment | Method         | Script             | Notes                                |
+| ----------- | -------------- | ------------------ | ------------------------------------ |
+| Development | Local          | `pnpm dev`         | Apps run locally, services in Docker |
+| Production  | Docker Compose | `pnpm prod:deploy` | Full containerized deployment        |
+| Updates     | Docker Compose | `pnpm prod:update` | Zero-downtime application updates    |
+
+### Available Commands
+
+```bash
+pnpm prod:deploy    # Full deployment (build + start + migrate)
+pnpm prod:start     # Start all production services
+pnpm prod:stop      # Stop all production services
+pnpm prod:update    # Zero-downtime update (rebuild and recreate apps)
+pnpm prod:status    # Show service status and health
+pnpm prod:logs      # Show service logs
+pnpm prod:backup    # Create database backup
+pnpm prod:cleanup   # Clean up old containers and images
+```
+
+---
+
+## Pre-Deployment Checklist
+
+### Code Review
+
+- [ ] All changes have been reviewed and approved
+- [ ] No console.log/console.error statements in production code
+- [ ] No hardcoded secrets or credentials
+- [ ] TypeScript compilation passes (`pnpm type-check`)
+- [ ] Linting passes (`pnpm lint`)
+- [ ] Tests pass (`pnpm test`)
+
+### Environment Configuration
+
+- [ ] All required environment variables are set in `.env`
+- [ ] Database URL is correct for production
+- [ ] Redis URL is correct for production
+- [ ] External API credentials are valid (Salesforce, WHMCS, Freebit)
+- [ ] CORS_ORIGIN matches production domain
+- [ ] JWT_SECRET is secure and unique
+
+**Required Environment Variables:**
+
+```bash
+DATABASE_URL        # PostgreSQL connection string
+REDIS_URL           # Redis connection string
+JWT_SECRET          # Secure secret (min 32 chars)
+POSTGRES_PASSWORD   # Database password
+CORS_ORIGIN         # Frontend domain
+NEXT_PUBLIC_API_BASE # BFF API URL
+BFF_PORT            # Backend port (usually 4000)
+```
+
+### Database Migration Check
+
+- [ ] Review pending migrations (`npx prisma migrate status`)
+- [ ] Test migrations on staging/local first
+- [ ] Create database backup before applying migrations
+- [ ] Prepare rollback SQL if migration is destructive
+- [ ] Estimate migration duration for large tables
+
+### Dependency Check
+
+- [ ] Run security audit (`pnpm security:check`)
+- [ ] No high/critical vulnerabilities
+- [ ] All dependencies are at expected versions
+- [ ] Lock file is up to date (`pnpm-lock.yaml`)
+
+### Communication
+
+- [ ] Notify team of deployment schedule
+- [ ] Schedule during low-traffic window if possible
+- [ ] Prepare customer communication if downtime expected
+- [ ] Ensure on-call engineer is available
+
+---
+
+## Deployment Procedure
+
+### Standard Deployment (First Time)
+
+```bash
+# 1. Create database backup (if updating existing system)
+pnpm prod:backup
+
+# 2. Full deployment
+pnpm prod:deploy
+```
+
+This command:
+
+1. Validates environment configuration
+2. Builds production Docker images
+3. Starts database and cache services
+4. Waits for database readiness
+5. Runs Prisma migrations
+6. Starts frontend and backend services
+7. Performs health checks
+
+### Application Update (Zero-Downtime)
+
+For updates that don't require database migrations:
+
+```bash
+# 1. Create database backup
+pnpm prod:backup
+
+# 2. Update applications
+pnpm prod:update
+```
+
+This rebuilds and recreates frontend and backend containers without stopping the database.
+
+### Database Migration Deployment
+
+For deployments with schema changes:
+
+```bash
+# 1. Create database backup
+pnpm prod:backup
+
+# 2. Stop application to prevent writes during migration
+pnpm prod:stop
+
+# 3. Start only database
+docker compose -f docker/prod/docker-compose.yml up -d database
+
+# 4. Run migrations
+docker compose -f docker/prod/docker-compose.yml run --rm backend pnpm db:migrate
+
+# 5. Verify migration success
+docker compose -f docker/prod/docker-compose.yml exec database psql -U portal -d portal_prod -c "SELECT * FROM _prisma_migrations ORDER BY finished_at DESC LIMIT 5;"
+
+# 6. Start all services
+pnpm prod:start
+
+# 7. Verify application health
+pnpm prod:status
+```
+
+---
+
+## Post-Deployment Verification
+
+### Immediate Checks (0-5 minutes)
+
+- [ ] Health endpoints return `ok`
+  ```bash
+  curl http://localhost:4000/health
+  curl http://localhost:3000/_health
+  ```
+- [ ] No error spikes in logs
+  ```bash
+  pnpm prod:logs backend | grep -i error | tail -20
+  ```
+- [ ] Database migrations applied successfully
+- [ ] Redis connectivity verified
+
+### Functional Checks (5-15 minutes)
+
+- [ ] User can log in to portal
+- [ ] Dashboard loads correctly
+- [ ] Invoice list displays
+- [ ] Subscription list displays
+- [ ] Catalog products load
+
+### Integration Checks (15-30 minutes)
+
+- [ ] Salesforce connectivity verified
+  ```bash
+  curl http://localhost:4000/auth/health-check | jq '.services.salesforce'
+  ```
+- [ ] WHMCS connectivity verified
+  ```bash
+  curl http://localhost:4000/auth/health-check | jq '.services.whmcs'
+  ```
+- [ ] Queue health verified
+  ```bash
+  curl http://localhost:4000/health/queues
+  ```
+
+### Monitoring Checks
+
+- [ ] Metrics are being collected
+- [ ] No alert triggers from deployment
+- [ ] Log aggregation is working
+- [ ] Error rates are normal
+
+---
+
+## Rollback Procedures
+
+### Application Rollback (No DB Changes)
+
+If deployment fails without database changes:
+
+```bash
+# 1. Stop current deployment
+pnpm prod:stop
+
+# 2. Checkout previous version
+git checkout <previous-tag-or-commit>
+
+# 3. Rebuild and deploy
+pnpm prod:deploy
+```
+
+### Application Rollback with Docker Images
+
+If previous images are available:
+
+```bash
+# 1. Stop current services
+pnpm prod:stop
+
+# 2. Start with previous image tags
+docker compose -f docker/prod/docker-compose.yml up -d \
+  --no-build \
+  -e BACKEND_IMAGE=portal-backend:previous \
+  -e FRONTEND_IMAGE=portal-frontend:previous
+```
+
+### Database Rollback
+
+If database migration needs to be reverted:
+
+**Option 1: Restore from Backup**
+
+```bash
+# 1. Stop application
+pnpm prod:stop
+
+# 2. Restore database
+docker compose exec database psql -U portal -d portal_prod < backup_YYYYMMDD_HHMMSS.sql
+
+# 3. Checkout previous code version
+git checkout <previous-tag>
+
+# 4. Rebuild and restart
+pnpm prod:deploy
+```
+
+**Option 2: Manual Rollback SQL**
+
+```bash
+# 1. Stop application
+pnpm prod:stop
+
+# 2. Apply rollback script (if prepared)
+docker compose exec database psql -U portal -d portal_prod < rollback_migration_YYYYMMDD.sql
+
+# 3. Manually remove migration record
+docker compose exec database psql -U portal -d portal_prod -c "DELETE FROM _prisma_migrations WHERE migration_name = '20240115_migration_name';"
+
+# 4. Restart with previous code
+git checkout <previous-tag>
+pnpm prod:deploy
+```
+
+### Emergency Rollback
+
+For critical failures requiring immediate action:
+
+```bash
+# 1. Immediately stop all services
+pnpm prod:stop
+
+# 2. Restore from most recent backup
+docker compose exec database psql -U portal -d portal_prod < /path/to/latest_backup.sql
+
+# 3. Deploy last known good version
+git checkout <last-known-good-tag>
+pnpm prod:deploy
+
+# 4. Notify team
+# Send incident notification
+```
+
+---
+
+## Feature Flags
+
+The portal does not currently use a formal feature flag system. Feature availability is controlled through:
+
+1. **Environment Variables** - Toggle features via configuration
+2. **Conditional Rendering** - Frontend checks for feature availability
+3. **Backend Feature Checks** - API endpoints check configuration
+
+### Adding a Feature Toggle
+
+```typescript
+// Backend: Check environment variable
+const featureEnabled = this.configService.get("FEATURE_NEW_CHECKOUT", "false") === "true";
+
+// Frontend: Check feature availability
+if (process.env.NEXT_PUBLIC_FEATURE_NEW_CHECKOUT === "true") {
+  // Render new feature
+}
+```
+
+### Emergency Feature Disable
+
+To disable a feature without redeployment:
+
+1. Update environment variable in `.env`
+2. Restart affected services:
+   ```bash
+   docker compose restart backend frontend
+   ```
+
+---
+
+## Deployment Timeline Template
+
+| Time  | Action                          | Owner      | Notes                     |
+| ----- | ------------------------------- | ---------- | ------------------------- |
+| T-24h | Announce deployment window      | Tech Lead  | Notify all stakeholders   |
+| T-2h  | Final code review               | Developers | Verify all changes merged |
+| T-1h  | Pre-deployment checklist        | DevOps     | Complete all checks       |
+| T-30m | Create backup                   | DevOps     | Verify backup integrity   |
+| T-15m | Notify team deployment starting | DevOps     | Slack/Teams message       |
+| T-0   | Execute deployment              | DevOps     | Run deployment commands   |
+| T+5m  | Immediate verification          | DevOps     | Health checks             |
+| T+15m | Functional verification         | QA/DevOps  | Test key flows            |
+| T+30m | All-clear or rollback decision  | Tech Lead  | Confirm success           |
+| T+1h  | Post-deployment monitoring      | DevOps     | Watch metrics             |
+| T+24h | Close deployment                | Tech Lead  | Final verification        |
+
+---
+
+## Troubleshooting
+
+### Build Failures
+
+```bash
+# Check Docker daemon
+docker info
+
+# Check disk space
+df -h
+
+# Clean Docker resources
+docker system prune -a
+```
+
+### Migration Failures
+
+```bash
+# Check migration status
+npx prisma migrate status
+
+# View migration history
+docker compose exec database psql -U portal -d portal_prod -c "SELECT * FROM _prisma_migrations;"
+
+# Reset migration (development only!)
+npx prisma migrate reset
+```
+
+### Service Startup Failures
+
+```bash
+# Check service logs
+pnpm prod:logs backend
+pnpm prod:logs frontend
+
+# Check container status
+docker compose ps -a
+
+# Check resource usage
+docker stats
+```
+
+### Database Connection Issues
+
+```bash
+# Test database connectivity
+docker compose exec database pg_isready -U portal -d portal_prod
+
+# Check connection count
+docker compose exec database psql -U portal -d portal_prod -c "SELECT count(*) FROM pg_stat_activity;"
+```
+
+---
+
+## Related Documents
+
+- [Deployment Guide](../getting-started/deployment.md)
+- [Database Operations](./database-operations.md)
+- [Incident Response](./incident-response.md)
+- [Monitoring Setup](./monitoring-setup.md)
+
+---
+
+**Last Updated:** December 2025