Update README.md to Enhance Documentation Clarity and Add New Sections

- Added a new section for Release Procedures, detailing deployment and rollback processes.
- Updated the System Operations section to include Monitoring Setup, Rate Limit Tuning, and Customer Data Management for improved operational guidance.
- Reformatted the table structure for better readability and consistency across documentation.
This commit is contained in:
barsa 2025-12-23 16:08:15 +09:00
parent 12eb9fd763
commit 90ab71b94d
5 changed files with 1602 additions and 9 deletions

View File

@ -148,14 +148,18 @@ Feature guides explaining how the portal functions:
| [External Dependencies](./operations/external-dependencies.md) | Integration health checks |
| [Queue Management](./operations/queue-management.md) | BullMQ job monitoring |
| [External Processes](./operations/external-processes.md) | Team handoffs and workflows |
| [Release Procedures](./operations/release-procedures.md) | Deployment and rollback |
### System Operations
| Document | Description |
| ------------------------------------------------------------------ | -------------------------- |
| -------------------------------------------------------------------- | -------------------------- |
| [Logging](./operations/logging.md) | Centralized logging system |
| [Security Monitoring](./operations/security-monitoring.md) | Security monitoring setup |
| [Subscription Management](./operations/subscription-management.md) | Service management |
| [Monitoring Setup](./operations/monitoring-setup.md) | Metrics and dashboards |
| [Rate Limit Tuning](./operations/rate-limit-tuning.md) | Rate limit configuration |
| [Customer Data Management](./operations/customer-data-management.md) | GDPR and data procedures |
---
@ -192,10 +196,12 @@ Historical documents kept for reference:
### DevOps / Operations
1. [Deployment](./getting-started/deployment.md)
2. [Incident Response](./operations/incident-response.md)
3. [Provisioning Runbook](./operations/provisioning-runbook.md)
4. [Database Operations](./operations/database-operations.md)
5. [External Dependencies](./operations/external-dependencies.md)
2. [Release Procedures](./operations/release-procedures.md)
3. [Incident Response](./operations/incident-response.md)
4. [Monitoring Setup](./operations/monitoring-setup.md)
5. [Database Operations](./operations/database-operations.md)
6. [External Dependencies](./operations/external-dependencies.md)
7. [Rate Limit Tuning](./operations/rate-limit-tuning.md)
---

View File

@ -0,0 +1,415 @@
# Customer Data Management (GDPR)
This document covers procedures for handling customer data in compliance with GDPR and data protection regulations.
---
## Data Storage Overview
Customer data is stored across multiple systems:
| System | Data Stored | Retention | Notes |
| ----------------------- | ----------------------------------------------------- | --------------------------- | ---------------------------- |
| **Portal (PostgreSQL)** | User accounts, ID mappings, audit logs, notifications | Active account lifetime | Auth data only |
| **WHMCS** | Billing, invoices, payment methods, addresses | Legal requirement (7 years) | System of record for billing |
| **Salesforce** | CRM data, orders, cases, contacts | Business records | System of record for CRM |
| **Redis** | Sessions, cache, rate limits | TTL-based (minutes to days) | Temporary data |
### Portal Database Tables with PII
| Table | PII Fields | Purpose |
| ---------------------------- | ------------------------------------ | -------------------- |
| `users` | `email`, `passwordHash`, `mfaSecret` | Authentication |
| `id_mappings` | Links to WHMCS/Salesforce IDs | Identity federation |
| `audit_logs` | `ipAddress`, `userAgent`, `userId` | Security audit trail |
| `residence_card_submissions` | Document images | ID verification |
| `notifications` | User notifications | In-app messaging |
| `sim_call_history_*` | Phone numbers, call details | Usage records |
| `sim_sms_history` | Phone numbers, SMS details | Usage records |
---
## Data Subject Rights
Under GDPR, customers have the following rights:
| Right | Portal Support | Notes |
| ---------------------- | ------------------ | ------------------------- |
| Right of Access | Manual export | See Data Export section |
| Right to Rectification | WHMCS self-service | Customer updates in WHMCS |
| Right to Erasure | Manual process | See Data Deletion section |
| Right to Portability | Manual export | See Data Export section |
| Right to Object | Manual process | Opt-out of processing |
---
## Data Deletion Procedures
### Overview
Complete customer data deletion requires coordination across all systems:
1. Portal database deletion
2. WHMCS account handling
3. Salesforce record handling
4. Redis cache clearing
5. Audit trail retention
### Pre-Deletion Checklist
- [ ] Verify customer identity (authentication or CS verification)
- [ ] Check for active subscriptions (must be cancelled first)
- [ ] Check for unpaid invoices (must be settled first)
- [ ] Check legal retention requirements (invoices, tax records)
- [ ] Document the deletion request with timestamp
### Step 1: Portal Database Deletion
```sql
-- 1. Get user information
SELECT u.id, u.email, im.whmcs_client_id, im.sf_account_id
FROM users u
LEFT JOIN id_mappings im ON u.id = im.user_id
WHERE u.email = 'customer@example.com';
-- 2. Delete notifications
DELETE FROM notifications WHERE user_id = '<user_id>';
-- 3. Delete residence card submissions
DELETE FROM residence_card_submissions WHERE user_id = '<user_id>';
-- 4. Delete SIM usage data (if applicable)
-- Note: Check if SIM account is linked to this user first
DELETE FROM sim_usage_daily WHERE account IN (
SELECT account FROM sim_voice_options WHERE account = '<sim_account>'
);
DELETE FROM sim_call_history_domestic WHERE account = '<sim_account>';
DELETE FROM sim_call_history_international WHERE account = '<sim_account>';
DELETE FROM sim_sms_history WHERE account = '<sim_account>';
DELETE FROM sim_voice_options WHERE account = '<sim_account>';
-- 5. Delete ID mapping (cascades from user deletion)
-- The id_mappings table has onDelete: Cascade
-- 6. Delete user (cascades audit_logs user reference to NULL, deletes id_mapping)
DELETE FROM users WHERE id = '<user_id>';
```
**Using the Mappings Service:**
```typescript
// Delete mapping programmatically (clears cache too)
await mappingsService.deleteMapping(userId);
```
### Step 2: Audit Log Handling
Audit logs may need to be retained for security compliance. Options:
**Option A: Anonymize (Recommended)**
```sql
-- Anonymize audit logs (keeps security trail, removes PII)
UPDATE audit_logs
SET user_id = NULL,
ip_address = 'ANONYMIZED',
user_agent = 'ANONYMIZED',
details = jsonb_set(
COALESCE(details, '{}'::jsonb),
'{anonymized}',
'true'::jsonb
)
WHERE user_id = '<user_id>';
```
**Option B: Delete (If Legally Permitted)**
```sql
DELETE FROM audit_logs WHERE user_id = '<user_id>';
```
### Step 3: Redis Cache Clearing
```bash
# Clear user-specific cache keys
redis-cli KEYS "user:*:<user_id>*" | xargs redis-cli DEL
redis-cli KEYS "session:*:<user_id>*" | xargs redis-cli DEL
redis-cli KEYS "mapping:*:<user_id>*" | xargs redis-cli DEL
# Clear refresh token families
redis-cli KEYS "refresh:user:<user_id>*" | xargs redis-cli DEL
redis-cli KEYS "refresh:family:*" | xargs redis-cli DEL # May need filtering
# Clear rate limit records
redis-cli KEYS "auth-login:*" | xargs redis-cli DEL # Clears by IP, not user
```
### Step 4: WHMCS Account Handling
WHMCS does not support full account deletion. Options:
**Option A: Close Account (Recommended)**
1. Cancel all active services
2. Set account status to "Closed"
3. Anonymize personal fields via WHMCS Admin
4. Document closure date
**Option B: Anonymize via API**
```bash
# Update client to anonymized data
curl -X POST "$WHMCS_API_URL" \
-d "identifier=$WHMCS_API_IDENTIFIER" \
-d "secret=$WHMCS_API_SECRET" \
-d "action=UpdateClient" \
-d "clientid=<whmcs_client_id>" \
-d "firstname=Deleted" \
-d "lastname=User" \
-d "email=deleted_<whmcs_client_id>@deleted.local" \
-d "address1=Deleted" \
-d "city=Deleted" \
-d "state=Deleted" \
-d "postcode=000-0000" \
-d "phonenumber=000-0000-0000" \
-d "status=Closed" \
-d "responsetype=json"
```
### Step 5: Salesforce Record Handling
Salesforce records often have legal retention requirements:
**For Personal Data:**
1. Work with Salesforce Admin
2. Consider anonymization vs deletion
3. Check integration impact (linked Orders, Cases)
**Anonymization Approach:**
- Update Account name to "Deleted Account - [ID]"
- Clear personal fields (phone, address if not needed)
- Keep transactional records with anonymized references
---
## Data Export Procedures
### Customer Data Export Request
When a customer requests their data:
#### 1. Portal Data Export
```sql
-- Export user data
SELECT
u.id,
u.email,
u.email_verified,
u.created_at,
u.last_login_at,
im.whmcs_client_id,
im.sf_account_id
FROM users u
LEFT JOIN id_mappings im ON u.id = im.user_id
WHERE u.email = 'customer@example.com';
-- Export audit log (security events)
SELECT
action,
resource,
success,
created_at
FROM audit_logs
WHERE user_id = '<user_id>'
ORDER BY created_at DESC;
-- Export notifications
SELECT
type,
title,
message,
read,
created_at
FROM notifications
WHERE user_id = '<user_id>'
ORDER BY created_at DESC;
-- Export SIM usage history (if applicable)
SELECT
call_date,
call_time,
called_to,
duration_sec,
charge_yen
FROM sim_call_history_domestic
WHERE account = '<sim_account>'
ORDER BY call_date DESC;
```
#### 2. WHMCS Data Export
Request via WHMCS Admin:
- Client Details
- Invoices
- Services/Subscriptions
- Tickets/Support History
- Transaction History
#### 3. Salesforce Data Export
Request via Salesforce Admin:
- Account record
- Contact record
- Order history
- Case history
- Opportunities
### Export Format
Provide data in machine-readable format:
- JSON for structured data
- CSV for tabular data
- PDF for documents (invoices)
---
## PII Handling During Debugging
### Safe Logging Practices
The BFF uses Pino with automatic PII redaction. Sensitive fields are sanitized:
```json
{
"email": "cust***@example.com",
"password": "[REDACTED]",
"token": "[REDACTED]",
"authorization": "[REDACTED]"
}
```
### What NOT to Log
- Full email addresses (use masked version)
- Passwords or password hashes
- JWT tokens
- API keys or secrets
- Credit card numbers
- Full phone numbers
- Full addresses
- ID document contents
### Safe Debug Queries
```sql
-- Use ID instead of email for lookups
SELECT * FROM users WHERE id = '<uuid>';
-- Mask PII in query results
SELECT
id,
CONCAT(LEFT(email, 3), '***', SUBSTRING(email FROM POSITION('@' IN email))) as masked_email,
created_at
FROM users
WHERE id = '<uuid>';
```
### Production Debugging
When investigating production issues:
1. **Use correlation IDs** - Search logs by request ID, not user email
2. **Access minimal data** - Only query what's needed
3. **Document access** - Note why you accessed customer data
4. **Use anonymized exports** - When sharing data for analysis
---
## Data Retention Policies
### Recommended Retention Periods
| Data Type | Retention | Justification |
| ------------------------ | ---------- | ---------------------- |
| Active user accounts | Indefinite | Active service |
| Closed accounts (portal) | 30 days | Grace period |
| Audit logs | 2 years | Security compliance |
| Session data (Redis) | 24 hours | Active sessions |
| Rate limit data | 15 minutes | Operational |
| Invoices | 7 years | Tax/legal requirement |
| Support cases | 5 years | Service history |
| Call/SMS history | 6 months | Billing reconciliation |
### Automated Cleanup
```sql
-- Delete expired notifications (30 days after expiry)
DELETE FROM notifications
WHERE expires_at < NOW() - INTERVAL '30 days';
-- Anonymize old audit logs (over 2 years)
UPDATE audit_logs
SET ip_address = 'EXPIRED',
user_agent = 'EXPIRED'
WHERE created_at < NOW() - INTERVAL '2 years'
AND ip_address != 'EXPIRED';
```
---
## Compliance Checklist
### Monthly Review
- [ ] Review data access logs for unusual patterns
- [ ] Verify automated cleanup jobs are running
- [ ] Check for pending deletion requests
- [ ] Review new data collection points
### Quarterly Review
- [ ] Audit third-party data sharing
- [ ] Review retention policies
- [ ] Update data inventory if schema changed
- [ ] Staff training on data handling
### Annual Review
- [ ] Full data protection impact assessment
- [ ] Policy review and updates
- [ ] Vendor compliance verification
- [ ] Documentation updates
---
## Emergency Data Breach Response
If a data breach is suspected:
1. **Contain** - Isolate affected systems
2. **Assess** - Determine scope and data exposed
3. **Notify** - Inform DPO/legal within 24 hours
4. **Report** - GDPR requires notification within 72 hours
5. **Remediate** - Fix vulnerability and prevent recurrence
6. **Document** - Full incident report
See [Incident Response](./incident-response.md) for general incident procedures.
---
## Related Documents
- [Incident Response](./incident-response.md)
- [Database Operations](./database-operations.md)
- [Logging Guide](./logging.md)
- [Security Monitoring](./security-monitoring.md)
---
**Last Updated:** December 2025

View File

@ -0,0 +1,375 @@
# Monitoring Dashboard Setup
This document provides guidance for setting up monitoring infrastructure for the Customer Portal.
---
## Health Endpoints
The BFF exposes several health check endpoints for monitoring:
| Endpoint | Purpose | Authentication |
| ------------------------------- | ------------------------------------------ | -------------- |
| `GET /health` | Core system health (database, cache) | Public |
| `GET /health/queues` | Request queue metrics (WHMCS, Salesforce) | Public |
| `GET /health/queues/whmcs` | WHMCS queue details | Public |
| `GET /health/queues/salesforce` | Salesforce queue details | Public |
| `GET /health/catalog/cache` | Catalog cache metrics | Public |
| `GET /auth/health-check` | Integration health (DB, WHMCS, Salesforce) | Public |
### Core Health Response
```json
{
"status": "ok",
"checks": {
"database": "ok",
"cache": "ok"
}
}
```
**Status Values:**
- `ok` - All systems healthy
- `degraded` - One or more systems failing
### Queue Health Response
```json
{
"timestamp": "2025-01-15T10:30:00.000Z",
"whmcs": {
"health": "healthy",
"metrics": {
"totalRequests": 1500,
"completedRequests": 1495,
"failedRequests": 5,
"queueSize": 0,
"pendingRequests": 2,
"averageWaitTime": 50,
"averageExecutionTime": 250
}
},
"salesforce": {
"health": "healthy",
"metrics": { ... },
"dailyUsage": { "used": 5000, "limit": 15000 }
}
}
```
---
## Key Metrics to Monitor
### Application Metrics
| Metric | Source | Warning | Critical | Description |
| ------------------- | --------------- | ------------- | ---------------- | --------------------- |
| Health status | `/health` | `degraded` | Any check `fail` | Core system health |
| Response time (p95) | Logs/APM | >2s | >5s | API response latency |
| Error rate | Logs/APM | >1% | >5% | HTTP 5xx responses |
| Active connections | Node.js metrics | >80% capacity | >95% capacity | Connection pool usage |
### Database Metrics
| Metric | Source | Warning | Critical | Description |
| --------------------- | --------------------- | --------- | --------- | --------------------------- |
| Connection pool usage | PostgreSQL | >80% | >95% | Active connections vs limit |
| Query duration | PostgreSQL logs | >500ms | >2s | Slow query detection |
| Database size | PostgreSQL | >80% disk | >90% disk | Storage capacity |
| Dead tuples | `pg_stat_user_tables` | >10% | >25% | Vacuum needed |
### Cache Metrics
| Metric | Source | Warning | Critical | Description |
| -------------- | ---------------- | -------------- | -------------- | ------------------------- |
| Redis memory | Redis INFO | >80% maxmemory | >95% maxmemory | Memory pressure |
| Cache hit rate | Application logs | <80% | <60% | Cache effectiveness |
| Redis latency | Redis CLI | >10ms | >50ms | Command latency |
| Evictions | Redis INFO | Any | High rate | Memory pressure indicator |
### Queue Metrics
| Metric | Source | Warning | Critical | Description |
| --------------------- | ---------------- | ---------- | ---------- | ---------------------- |
| WHMCS queue size | `/health/queues` | >10 | >50 | Pending WHMCS requests |
| WHMCS failed requests | `/health/queues` | >5 | >20 | Failed API calls |
| SF daily API usage | `/health/queues` | >80% limit | >95% limit | Salesforce API quota |
| BullMQ wait queue | Redis | >10 | >50 | Job backlog |
| BullMQ failed jobs | Redis | >5 | >20 | Processing failures |
### External Dependency Metrics
| Metric | Source | Warning | Critical | Description |
| ------------------------ | ------ | ------- | -------- | -------------------- |
| Salesforce response time | Logs | >2s | >5s | SF API latency |
| WHMCS response time | Logs | >2s | >5s | WHMCS API latency |
| Freebit response time | Logs | >3s | >10s | Freebit API latency |
| External error rate | Logs | >1% | >5% | Integration failures |
---
## Structured Logging for Metrics
The BFF uses Pino for structured JSON logging. Key fields for metrics extraction:
```json
{
"timestamp": "2025-01-15T10:30:00.000Z",
"level": "info",
"service": "customer-portal-bff",
"correlationId": "req-123",
"message": "API call completed",
"duration": 250,
"path": "/api/invoices",
"method": "GET",
"statusCode": 200
}
```
### Log Queries for Metrics
**Error Rate (last hour):**
```bash
grep '"level":50' /var/log/bff/combined.log | wc -l
```
**Slow Requests (>2s):**
```bash
grep '"duration":[0-9]\{4,\}' /var/log/bff/combined.log | tail -20
```
**External API Errors:**
```bash
grep -E '(WHMCS|Salesforce|Freebit).*error' /var/log/bff/error.log | tail -20
```
---
## Grafana Dashboard Setup
### Data Sources
1. **Prometheus** - For application metrics
2. **Loki** - For log aggregation
3. **PostgreSQL** - For database metrics
### Recommended Panels
#### Overview Dashboard
1. **System Health** (Stat panel)
- Query: `/health` endpoint status
- Show: ok/degraded indicator
2. **Request Rate** (Graph panel)
- Source: Prometheus/Loki
- Show: Requests per second
3. **Error Rate** (Graph panel)
- Source: Loki log count
- Filter: `level >= 50`
4. **Response Time (p95)** (Graph panel)
- Source: Prometheus histogram
- Show: 95th percentile latency
#### Queue Dashboard
1. **Queue Depths** (Graph panel)
- Source: `/health/queues` endpoint
- Show: WHMCS and SF queue sizes
2. **Failed Jobs** (Stat panel)
- Source: Redis BullMQ metrics
- Show: Failed job count
3. **Salesforce API Usage** (Gauge panel)
- Source: `/health/queues/salesforce`
- Show: Daily usage vs limit
#### Database Dashboard
1. **Connection Pool** (Gauge panel)
- Source: PostgreSQL `pg_stat_activity`
- Show: Active connections
2. **Query Performance** (Table panel)
- Source: PostgreSQL `pg_stat_statements`
- Show: Slowest queries
### Sample Prometheus Scrape Config
```yaml
scrape_configs:
- job_name: "portal-bff"
static_configs:
- targets: ["bff:4000"]
metrics_path: "/health"
scrape_interval: 30s
```
---
## CloudWatch Setup (AWS)
### Custom Metrics
Push metrics from health endpoints to CloudWatch:
```bash
# Example: Push queue depth metric
aws cloudwatch put-metric-data \
--namespace "CustomerPortal" \
--metric-name "WhmcsQueueDepth" \
--value $(curl -s http://localhost:4000/health/queues | jq '.whmcs.metrics.queueSize') \
--dimensions Environment=production
```
### Recommended CloudWatch Alarms
| Alarm | Metric | Threshold | Period | Action |
| ------------- | ---------------- | --------- | ------ | ---------------- |
| HighErrorRate | ErrorCount | >10 | 5 min | SNS notification |
| HighLatency | p95 ResponseTime | >2000ms | 5 min | SNS notification |
| QueueBacklog | WhmcsQueueDepth | >50 | 5 min | SNS notification |
| DatabaseDown | HealthStatus | !=ok | 1 min | PagerDuty |
| CacheDown | HealthStatus | !=ok | 1 min | PagerDuty |
### Log Insights Queries
**Error Summary:**
```sql
fields @timestamp, @message
| filter level >= 50
| stats count() by bin(5m)
```
**Slow Requests:**
```sql
fields @timestamp, path, duration
| filter duration > 2000
| sort duration desc
| limit 20
```
---
## DataDog Setup
### Agent Configuration
```yaml
# datadog.yaml
logs_enabled: true
logs:
- type: file
path: /var/log/bff/combined.log
service: customer-portal-bff
source: nodejs
```
### Custom Metrics
```typescript
// Example: Report queue metrics to DataDog
import { StatsD } from "hot-shots";
const dogstatsd = new StatsD({ host: "localhost", port: 8125 });
// Report queue depth
dogstatsd.gauge("portal.whmcs.queue_depth", metrics.queueSize);
dogstatsd.gauge("portal.whmcs.failed_requests", metrics.failedRequests);
```
### Recommended Monitors
1. **Health Check Monitor**
- Check: HTTP check on `/health`
- Alert: When status != ok for 2 minutes
2. **Error Rate Monitor**
- Metric: `portal.errors.count`
- Alert: When >5% for 5 minutes
3. **Queue Depth Monitor**
- Metric: `portal.whmcs.queue_depth`
- Alert: When >50 for 5 minutes
---
## Alerting Best Practices
### Alert Priority Levels
| Priority | Response Time | Examples |
| ----------- | ------------- | --------------------------------------------- |
| P1 Critical | 15 minutes | Portal down, database unreachable |
| P2 High | 1 hour | Provisioning failing, payment processing down |
| P3 Medium | 4 hours | Degraded performance, high error rate |
| P4 Low | 24 hours | Minor issues, informational alerts |
### Alert Routing
```yaml
# Example PagerDuty routing
routes:
- match:
severity: critical
receiver: pagerduty-oncall
- match:
severity: warning
receiver: slack-ops
- match:
severity: info
receiver: email-team
```
### Runbook Links
Include runbook links in all alerts:
- Health check failures → [Incident Response](./incident-response.md)
- Database issues → [Database Operations](./database-operations.md)
- Queue problems → [Queue Management](./queue-management.md)
- External API failures → [External Dependencies](./external-dependencies.md)
---
## Monitoring Checklist
### Initial Setup
- [ ] Configure health endpoint scraping (every 30s)
- [ ] Set up log aggregation (Loki, CloudWatch, or DataDog)
- [ ] Create overview dashboard with key metrics
- [ ] Configure P1/P2 alerts for critical failures
- [ ] Test alert routing to on-call
### Ongoing Maintenance
- [ ] Review alert thresholds quarterly
- [ ] Check for alert fatigue (too many false positives)
- [ ] Update dashboards when new features are deployed
- [ ] Validate runbook links are current
---
## Related Documents
- [Incident Response](./incident-response.md)
- [Logging Guide](./logging.md)
- [External Dependencies](./external-dependencies.md)
- [Queue Management](./queue-management.md)
---
**Last Updated:** December 2025

View File

@ -0,0 +1,395 @@
# Rate Limit Tuning Guide
This document covers rate limiting configuration, adjustment procedures, and troubleshooting for the Customer Portal.
---
## Rate Limiting Overview
The portal uses multiple rate limiting mechanisms:
| Type | Scope | Backend | Purpose |
| ------------------------- | ---------------------------------- | ------------------- | --------------------------- |
| **Auth Rate Limiting** | Per endpoint (login, signup, etc.) | Redis | Prevent brute force attacks |
| **Global Rate Limiting** | Per route/controller | Redis | API abuse prevention |
| **Request Queues** | Per external API | In-memory (p-queue) | External API protection |
| **SSE Connection Limits** | Per user | In-memory | Resource protection |
---
## Authentication Rate Limits
### Configuration
| Endpoint | Env Variable | Default | Window |
| -------------------- | --------------------------------- | ----------- | ------ |
| Login | `LOGIN_RATE_LIMIT_LIMIT` | 5 attempts | 15 min |
| Login (TTL) | `LOGIN_RATE_LIMIT_TTL` | 900000 ms | - |
| Signup | `SIGNUP_RATE_LIMIT_LIMIT` | 5 attempts | 15 min |
| Signup (TTL) | `SIGNUP_RATE_LIMIT_TTL` | 900000 ms | - |
| Password Reset | `PASSWORD_RESET_RATE_LIMIT_LIMIT` | 5 attempts | 15 min |
| Password Reset (TTL) | `PASSWORD_RESET_RATE_LIMIT_TTL` | 900000 ms | - |
| Token Refresh | `AUTH_REFRESH_RATE_LIMIT_LIMIT` | 10 attempts | 5 min |
| Token Refresh (TTL) | `AUTH_REFRESH_RATE_LIMIT_TTL` | 300000 ms | - |
### CAPTCHA Configuration
| Setting | Env Variable | Default | Description |
| ----------------- | ------------------------------ | ------- | ------------------------------------ |
| CAPTCHA Threshold | `LOGIN_CAPTCHA_AFTER_ATTEMPTS` | 3 | Show CAPTCHA after N failed attempts |
| CAPTCHA Always On | `AUTH_CAPTCHA_ALWAYS_ON` | false | Require CAPTCHA for all logins |
### Adjusting Auth Rate Limits
**In Production (requires restart):**
```bash
# Edit .env file
LOGIN_RATE_LIMIT_LIMIT=10 # Increase to 10 attempts
LOGIN_RATE_LIMIT_TTL=1800000 # Extend window to 30 minutes
# Restart backend
docker compose restart backend
```
**Temporary Increase via Redis (immediate, no restart):**
```bash
# Check current rate limit for a key
redis-cli GET "auth-login:<ip-hash>"
# Delete a rate limit record to allow immediate retry
redis-cli DEL "auth-login:<ip-hash>"
```
---
## Global API Rate Limits
### Configuration
Global rate limits are applied via the `@RateLimit` decorator:
```typescript
@RateLimit({ limit: 100, ttl: 60 }) // 100 requests per minute
@Controller('invoices')
export class InvoicesController { ... }
```
### Common Rate Limit Settings
| Endpoint | Limit | TTL | Notes |
| ------------- | ----- | --- | --------------------- |
| Invoices | 100 | 60s | High-traffic endpoint |
| Subscriptions | 100 | 60s | High-traffic endpoint |
| Catalog | 200 | 60s | Cached, higher limit |
| Orders | 50 | 60s | Write operations |
| Profile | 60 | 60s | Standard limit |
### Adjusting Global Rate Limits
Global rate limits are defined in code. To adjust:
1. Modify the `@RateLimit` decorator in the controller
2. Deploy the change
```typescript
// Before
@RateLimit({ limit: 50, ttl: 60 })
// After (double the limit)
@RateLimit({ limit: 100, ttl: 60 })
```
---
## External API Request Queues
### WHMCS Queue Configuration
| Setting | Env Variable | Default | Description |
| ------------ | -------------------------- | ------- | ----------------------- |
| Concurrency | `WHMCS_QUEUE_CONCURRENCY` | 15 | Max parallel requests |
| Interval Cap | `WHMCS_QUEUE_INTERVAL_CAP` | 300 | Max requests per minute |
| Timeout | `WHMCS_QUEUE_TIMEOUT_MS` | 30000 | Request timeout (ms) |
### Salesforce Queue Configuration
| Setting | Env Variable | Default | Description |
| ------------------------ | ----------------------------- | ------- | ----------------------- |
| Standard Concurrency | `SF_QUEUE_CONCURRENCY` | 10 | Standard operations |
| Long-Running Concurrency | `SF_LONG_RUNNING_CONCURRENCY` | 5 | Bulk operations |
| Interval Cap | `SF_QUEUE_INTERVAL_CAP` | 200 | Max requests per minute |
| Timeout | `SF_QUEUE_TIMEOUT_MS` | 30000 | Request timeout (ms) |
### Adjusting Queue Limits
**Production Adjustment:**
```bash
# Edit .env file
WHMCS_QUEUE_CONCURRENCY=20 # Increase concurrent requests
WHMCS_QUEUE_INTERVAL_CAP=500 # Increase requests per minute
# Restart backend
docker compose restart backend
```
### Queue Health Monitoring
```bash
# Check queue metrics
curl http://localhost:4000/health/queues | jq '.'
# Expected output:
{
"whmcs": {
"health": "healthy",
"metrics": {
"queueSize": 0,
"pendingRequests": 2,
"failedRequests": 0
}
},
"salesforce": {
"health": "healthy",
"metrics": { ... },
"dailyUsage": { "used": 5000, "limit": 15000 }
}
}
```
---
## SSE Connection Limits
### Configuration
```typescript
// Per-user SSE connection limit (in-memory)
private readonly maxPerUser = 3;
```
This prevents a single user from opening unlimited SSE connections.
### Adjusting SSE Limits
This requires a code change in `realtime-connection-limiter.service.ts`:
```typescript
// Change from
private readonly maxPerUser = 3;
// To
private readonly maxPerUser = 5;
```
---
## Bypassing Rate Limits for Testing
### Temporary Bypass via Redis
```bash
# Clear all rate limit keys for testing
redis-cli KEYS "auth-*" | xargs redis-cli DEL
redis-cli KEYS "rate-limit:*" | xargs redis-cli DEL
# Clear specific user's rate limit
redis-cli KEYS "*<ip-or-user-identifier>*" | xargs redis-cli DEL
```
### Using SkipRateLimit Decorator
For development/testing routes:
```typescript
@SkipRateLimit()
@Get('test-endpoint')
async testEndpoint() { ... }
```
### Environment-Based Bypass
Add a development bypass in configuration:
```bash
# In .env (development only!)
RATE_LIMIT_BYPASS_ENABLED=true
```
```typescript
// In guard
if (this.configService.get("RATE_LIMIT_BYPASS_ENABLED") === "true") {
return true;
}
```
> **Warning**: Never enable bypass in production!
---
## Signs of Rate Limit Issues
### User-Facing Symptoms
| Symptom | Possible Cause | Investigation |
| -------------------------- | ------------------- | ------------------------- |
| "Too many requests" errors | Rate limit exceeded | Check Redis keys, logs |
| Login failures | Auth rate limit | Check `auth-login:*` keys |
| Slow API responses | Queue backlog | Check `/health/queues` |
| 429 errors in logs | Any rate limit | Check logs for specifics |
### Monitoring Indicators
| Metric | Warning | Critical | Action |
| ----------------- | ------------- | -------- | ------------------------ |
| 429 error rate | >1% | >5% | Review rate limits |
| Queue size | >10 | >50 | Increase concurrency |
| Average wait time | >1s | >5s | Scale or increase limits |
| CAPTCHA triggers | Unusual spike | - | Possible attack |
### Log Analysis
```bash
# Find rate limit exceeded events
grep "Rate limit exceeded" /var/log/bff/combined.log | tail -20
# Find 429 responses
grep '"statusCode":429' /var/log/bff/combined.log | tail -20
# Count rate limit events by path
grep "Rate limit exceeded" /var/log/bff/combined.log | \
jq -r '.path' | sort | uniq -c | sort -rn
```
---
## Troubleshooting
### Too Many 429 Errors
**Diagnosis:**
```bash
# Check which endpoints are rate limited
grep "Rate limit exceeded" /var/log/bff/combined.log | \
jq '{path: .path, key: .key}' | head -20
# Check queue health
curl http://localhost:4000/health/queues
```
**Resolution:**
1. Identify the affected endpoint
2. Check if limit is appropriate for traffic
3. Increase limit if legitimate traffic
4. Add caching if requests are repetitive
### Legitimate Users Being Blocked
**Diagnosis:**
```bash
# Check rate limit state for specific key
redis-cli KEYS "*<identifier>*"
redis-cli GET "auth-login:<hash>"
```
**Resolution:**
```bash
# Clear the user's rate limit record
redis-cli DEL "auth-login:<hash>"
```
### External API Rate Limit Violations
**WHMCS Rate Limiting:**
```bash
# Check queue metrics
curl http://localhost:4000/health/queues/whmcs
# Reduce concurrency if WHMCS is overloaded
WHMCS_QUEUE_CONCURRENCY=5
WHMCS_QUEUE_INTERVAL_CAP=100
```
**Salesforce API Limits:**
```bash
# Check daily API usage
curl http://localhost:4000/health/queues/salesforce | jq '.dailyUsage'
# If approaching limit, reduce requests
# Consider caching more data
```
### Redis Connection Issues
If rate limiting fails due to Redis:
```bash
# Check Redis connectivity
redis-cli PING
# The guard fails open on Redis errors (allows request)
# Check logs for "Rate limiter error - failing open"
```
---
## Best Practices
### Setting Rate Limits
1. **Start Conservative** - Begin with lower limits, increase as needed
2. **Monitor Before Adjusting** - Understand traffic patterns first
3. **Consider User Experience** - Limits should rarely impact normal use
4. **Document Changes** - Track why limits were adjusted
### Rate Limit Strategies
| Strategy | Use Case | Implementation |
| ---------- | ----------------------- | ---------------------- |
| IP-based | Anonymous endpoints | Default behavior |
| User-based | Authenticated endpoints | Include user ID in key |
| Combined | Sensitive endpoints | IP + User-Agent hash |
| Tiered | Different user classes | Custom logic |
### Performance Considerations
- **Redis Latency** - Keep Redis co-located with BFF
- **Key Expiration** - Use TTL to prevent Redis bloat
- **Fail Open** - Rate limiter allows requests if Redis fails
- **Logging** - Log blocked requests for analysis
---
## Rate Limit Response Headers
The BFF includes standard rate limit headers:
```http
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1704110400
Retry-After: 60
```
Clients can use these to implement backoff.
---
## Related Documents
- [Incident Response](./incident-response.md)
- [Monitoring Setup](./monitoring-setup.md)
- [External Dependencies](./external-dependencies.md)
- [Queue Management](./queue-management.md)
---
**Last Updated:** December 2025

View File

@ -0,0 +1,402 @@
# Release and Deployment Procedures
This document covers pre-deployment checklists, deployment procedures, post-deployment verification, and rollback procedures for the Customer Portal.
---
## Deployment Overview
| Environment | Method | Script | Notes |
| ----------- | -------------- | ------------------ | ------------------------------------ |
| Development | Local | `pnpm dev` | Apps run locally, services in Docker |
| Production | Docker Compose | `pnpm prod:deploy` | Full containerized deployment |
| Updates | Docker Compose | `pnpm prod:update` | Zero-downtime application updates |
### Available Commands
```bash
pnpm prod:deploy # Full deployment (build + start + migrate)
pnpm prod:start # Start all production services
pnpm prod:stop # Stop all production services
pnpm prod:update # Zero-downtime update (rebuild and recreate apps)
pnpm prod:status # Show service status and health
pnpm prod:logs # Show service logs
pnpm prod:backup # Create database backup
pnpm prod:cleanup # Clean up old containers and images
```
---
## Pre-Deployment Checklist
### Code Review
- [ ] All changes have been reviewed and approved
- [ ] No console.log/console.error statements in production code
- [ ] No hardcoded secrets or credentials
- [ ] TypeScript compilation passes (`pnpm type-check`)
- [ ] Linting passes (`pnpm lint`)
- [ ] Tests pass (`pnpm test`)
### Environment Configuration
- [ ] All required environment variables are set in `.env`
- [ ] Database URL is correct for production
- [ ] Redis URL is correct for production
- [ ] External API credentials are valid (Salesforce, WHMCS, Freebit)
- [ ] CORS_ORIGIN matches production domain
- [ ] JWT_SECRET is secure and unique
**Required Environment Variables:**
```bash
DATABASE_URL # PostgreSQL connection string
REDIS_URL # Redis connection string
JWT_SECRET # Secure secret (min 32 chars)
POSTGRES_PASSWORD # Database password
CORS_ORIGIN # Frontend domain
NEXT_PUBLIC_API_BASE # BFF API URL
BFF_PORT # Backend port (usually 4000)
```
### Database Migration Check
- [ ] Review pending migrations (`npx prisma migrate status`)
- [ ] Test migrations on staging/local first
- [ ] Create database backup before applying migrations
- [ ] Prepare rollback SQL if migration is destructive
- [ ] Estimate migration duration for large tables
### Dependency Check
- [ ] Run security audit (`pnpm security:check`)
- [ ] No high/critical vulnerabilities
- [ ] All dependencies are at expected versions
- [ ] Lock file is up to date (`pnpm-lock.yaml`)
### Communication
- [ ] Notify team of deployment schedule
- [ ] Schedule during low-traffic window if possible
- [ ] Prepare customer communication if downtime expected
- [ ] Ensure on-call engineer is available
---
## Deployment Procedure
### Standard Deployment (First Time)
```bash
# 1. Create database backup (if updating existing system)
pnpm prod:backup
# 2. Full deployment
pnpm prod:deploy
```
This command:
1. Validates environment configuration
2. Builds production Docker images
3. Starts database and cache services
4. Waits for database readiness
5. Runs Prisma migrations
6. Starts frontend and backend services
7. Performs health checks
### Application Update (Zero-Downtime)
For updates that don't require database migrations:
```bash
# 1. Create database backup
pnpm prod:backup
# 2. Update applications
pnpm prod:update
```
This rebuilds and recreates frontend and backend containers without stopping the database.
### Database Migration Deployment
For deployments with schema changes:
```bash
# 1. Create database backup
pnpm prod:backup
# 2. Stop application to prevent writes during migration
pnpm prod:stop
# 3. Start only database
docker compose -f docker/prod/docker-compose.yml up -d database
# 4. Run migrations
docker compose -f docker/prod/docker-compose.yml run --rm backend pnpm db:migrate
# 5. Verify migration success
docker compose -f docker/prod/docker-compose.yml exec database psql -U portal -d portal_prod -c "SELECT * FROM _prisma_migrations ORDER BY finished_at DESC LIMIT 5;"
# 6. Start all services
pnpm prod:start
# 7. Verify application health
pnpm prod:status
```
---
## Post-Deployment Verification
### Immediate Checks (0-5 minutes)
- [ ] Health endpoints return `ok`
```bash
curl http://localhost:4000/health
curl http://localhost:3000/_health
```
- [ ] No error spikes in logs
```bash
pnpm prod:logs backend | grep -i error | tail -20
```
- [ ] Database migrations applied successfully
- [ ] Redis connectivity verified
### Functional Checks (5-15 minutes)
- [ ] User can log in to portal
- [ ] Dashboard loads correctly
- [ ] Invoice list displays
- [ ] Subscription list displays
- [ ] Catalog products load
### Integration Checks (15-30 minutes)
- [ ] Salesforce connectivity verified
```bash
curl http://localhost:4000/auth/health-check | jq '.services.salesforce'
```
- [ ] WHMCS connectivity verified
```bash
curl http://localhost:4000/auth/health-check | jq '.services.whmcs'
```
- [ ] Queue health verified
```bash
curl http://localhost:4000/health/queues
```
### Monitoring Checks
- [ ] Metrics are being collected
- [ ] No alert triggers from deployment
- [ ] Log aggregation is working
- [ ] Error rates are normal
---
## Rollback Procedures
### Application Rollback (No DB Changes)
If deployment fails without database changes:
```bash
# 1. Stop current deployment
pnpm prod:stop
# 2. Checkout previous version
git checkout <previous-tag-or-commit>
# 3. Rebuild and deploy
pnpm prod:deploy
```
### Application Rollback with Docker Images
If previous images are available:
```bash
# 1. Stop current services
pnpm prod:stop
# 2. Start with previous image tags
docker compose -f docker/prod/docker-compose.yml up -d \
--no-build \
-e BACKEND_IMAGE=portal-backend:previous \
-e FRONTEND_IMAGE=portal-frontend:previous
```
### Database Rollback
If database migration needs to be reverted:
**Option 1: Restore from Backup**
```bash
# 1. Stop application
pnpm prod:stop
# 2. Restore database
docker compose exec database psql -U portal -d portal_prod < backup_YYYYMMDD_HHMMSS.sql
# 3. Checkout previous code version
git checkout <previous-tag>
# 4. Rebuild and restart
pnpm prod:deploy
```
**Option 2: Manual Rollback SQL**
```bash
# 1. Stop application
pnpm prod:stop
# 2. Apply rollback script (if prepared)
docker compose exec database psql -U portal -d portal_prod < rollback_migration_YYYYMMDD.sql
# 3. Manually remove migration record
docker compose exec database psql -U portal -d portal_prod -c "DELETE FROM _prisma_migrations WHERE migration_name = '20240115_migration_name';"
# 4. Restart with previous code
git checkout <previous-tag>
pnpm prod:deploy
```
### Emergency Rollback
For critical failures requiring immediate action:
```bash
# 1. Immediately stop all services
pnpm prod:stop
# 2. Restore from most recent backup
docker compose exec database psql -U portal -d portal_prod < /path/to/latest_backup.sql
# 3. Deploy last known good version
git checkout <last-known-good-tag>
pnpm prod:deploy
# 4. Notify team
# Send incident notification
```
---
## Feature Flags
The portal does not currently use a formal feature flag system. Feature availability is controlled through:
1. **Environment Variables** - Toggle features via configuration
2. **Conditional Rendering** - Frontend checks for feature availability
3. **Backend Feature Checks** - API endpoints check configuration
### Adding a Feature Toggle
```typescript
// Backend: Check environment variable
const featureEnabled = this.configService.get("FEATURE_NEW_CHECKOUT", "false") === "true";
// Frontend: Check feature availability
if (process.env.NEXT_PUBLIC_FEATURE_NEW_CHECKOUT === "true") {
// Render new feature
}
```
### Emergency Feature Disable
To disable a feature without redeployment:
1. Update environment variable in `.env`
2. Restart affected services:
```bash
docker compose restart backend frontend
```
---
## Deployment Timeline Template
| Time | Action | Owner | Notes |
| ----- | ------------------------------- | ---------- | ------------------------- |
| T-24h | Announce deployment window | Tech Lead | Notify all stakeholders |
| T-2h | Final code review | Developers | Verify all changes merged |
| T-1h | Pre-deployment checklist | DevOps | Complete all checks |
| T-30m | Create backup | DevOps | Verify backup integrity |
| T-15m | Notify team deployment starting | DevOps | Slack/Teams message |
| T-0 | Execute deployment | DevOps | Run deployment commands |
| T+5m | Immediate verification | DevOps | Health checks |
| T+15m | Functional verification | QA/DevOps | Test key flows |
| T+30m | All-clear or rollback decision | Tech Lead | Confirm success |
| T+1h | Post-deployment monitoring | DevOps | Watch metrics |
| T+24h | Close deployment | Tech Lead | Final verification |
---
## Troubleshooting
### Build Failures
```bash
# Check Docker daemon
docker info
# Check disk space
df -h
# Clean Docker resources
docker system prune -a
```
### Migration Failures
```bash
# Check migration status
npx prisma migrate status
# View migration history
docker compose exec database psql -U portal -d portal_prod -c "SELECT * FROM _prisma_migrations;"
# Reset migration (development only!)
npx prisma migrate reset
```
### Service Startup Failures
```bash
# Check service logs
pnpm prod:logs backend
pnpm prod:logs frontend
# Check container status
docker compose ps -a
# Check resource usage
docker stats
```
### Database Connection Issues
```bash
# Test database connectivity
docker compose exec database pg_isready -U portal -d portal_prod
# Check connection count
docker compose exec database psql -U portal -d portal_prod -c "SELECT count(*) FROM pg_stat_activity;"
```
---
## Related Documents
- [Deployment Guide](../getting-started/deployment.md)
- [Database Operations](./database-operations.md)
- [Incident Response](./incident-response.md)
- [Monitoring Setup](./monitoring-setup.md)
---
**Last Updated:** December 2025