# Incident Response Runbook This document defines procedures for responding to production incidents affecting the Customer Portal. --- ## Severity Classification | Severity | Definition | Response Time | Examples | | ----------------- | -------------------------------------- | ------------- | ----------------------------------------------------------------- | | **P1 - Critical** | Complete service outage or data loss | 15 minutes | Portal unreachable, database corruption, security breach | | **P2 - High** | Major feature unavailable | 1 hour | Order provisioning failing, payment processing down | | **P3 - Medium** | Degraded performance or partial outage | 4 hours | Slow response times, intermittent errors, single integration down | | **P4 - Low** | Minor issue, workaround available | 24 hours | UI glitches, non-critical feature bugs | --- ## Escalation Matrix | Level | Scope | Contact | When to Escalate | | ------ | ---------------- | ------------------- | ---------------------------------------------------- | | **L1** | Initial Response | On-call engineer | All incidents | | **L2** | Technical Lead | Development lead | P1/P2 not resolved in 30 minutes | | **L3** | Management | Engineering manager | P1 not resolved in 1 hour, customer impact | | **L4** | External | Vendor support | External system failure (Salesforce, WHMCS, Freebit) | ### On-Call Contacts > **Note**: Update this section with actual contact information for your team. | Role | Contact Method | Backup | | ----------------- | ----------------- | ------- | | Primary On-Call | [Slack/PagerDuty] | [Phone] | | Secondary On-Call | [Slack/PagerDuty] | [Phone] | | Engineering Lead | [Slack/Email] | [Phone] | --- ## Common Incident Scenarios ### 1. Salesforce Platform Events Not Receiving **Symptoms:** - Orders stuck in "Pending Review" status - No provisioning activity in logs - `sf:pe:replay:*` Redis keys not updating **Diagnosis:** ```bash # Check BFF logs for Platform Event subscription grep "Platform Event" /var/log/bff/combined.log | tail -50 # Check Redis replay ID redis-cli GET "sf:pe:replay:/event/OrderProvisionRequested__e" # Verify Salesforce connectivity curl -X GET http://localhost:4000/health ``` **Resolution:** 1. Verify `SF_EVENTS_ENABLED=true` in environment 2. Check Salesforce Connected App JWT authentication 3. Verify Platform Event permissions for integration user 4. Set `SF_EVENTS_REPLAY=ALL` temporarily to replay missed events 5. Restart BFF to re-establish subscription **Escalation:** If unresolved in 30 minutes, contact Salesforce admin. --- ### 2. WHMCS API Unavailable **Symptoms:** - Billing pages showing "service unavailable" - Provisioning failing with WHMCS errors - Payment method checks failing **Diagnosis:** ```bash # Check WHMCS connectivity from BFF curl -X POST $WHMCS_API_URL -d "action=GetClients&responsetype=json" # Check BFF logs for WHMCS errors grep "WHMCS" /var/log/bff/error.log | tail -20 ``` **Resolution:** 1. Verify WHMCS server is accessible 2. Check WHMCS API credentials (`WHMCS_API_IDENTIFIER`, `WHMCS_API_SECRET`) 3. Check WHMCS server load and resource usage 4. Contact WHMCS hosting provider if server is down **Escalation:** If WHMCS server is down, contact hosting provider. --- ### 3. Redis Connection Failures **Symptoms:** - Authentication failing - Cache misses on every request - Rate limiting not working - SSE connections dropping **Diagnosis:** ```bash # Check Redis connectivity redis-cli ping # Check Redis memory usage redis-cli INFO memory # Check BFF health endpoint curl http://localhost:4000/health | jq '.checks.cache' ``` **Resolution:** 1. Verify Redis URL in environment (`REDIS_URL`) 2. Check Redis server memory usage and eviction policy 3. Restart Redis if memory is exhausted 4. Clear stale keys if necessary: `redis-cli FLUSHDB` (caution: clears all cache) **Impact Note:** Redis failure causes: - Token blacklist checks to fail (security risk if `AUTH_BLACKLIST_FAIL_CLOSED=false`) - All cached data to be re-fetched from source systems - Rate limiting to stop working --- ### 4. Database Connection Issues **Symptoms:** - All API requests failing with 500 errors - Health check shows database as "fail" - Prisma connection errors in logs **Diagnosis:** ```bash # Check database connectivity psql $DATABASE_URL -c "SELECT 1" # Check connection count psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity" # Check BFF health endpoint curl http://localhost:4000/health | jq '.checks.database' ``` **Resolution:** 1. Verify PostgreSQL server is running 2. Check connection pool limits (Prisma connection_limit) 3. Look for long-running queries and kill if necessary 4. Restart database if unresponsive **Escalation:** If database is corrupted, see [Database Operations Runbook](./database-operations.md). --- ### 5. High Error Rate / Performance Degradation **Symptoms:** - Increased response times (>2s average) - Error rate above 1% - Customer complaints **Diagnosis:** ```bash # Check BFF process resource usage top -p $(pgrep -f "node.*bff") # Check recent error logs tail -100 /var/log/bff/error.log # Check external API response times in logs grep "duration" /var/log/bff/combined.log | tail -20 ``` **Resolution:** 1. Identify which external API is slow (Salesforce, WHMCS, Freebit) 2. Check for traffic spikes or unusual patterns 3. Scale horizontally if CPU/memory constrained 4. Enable circuit breakers or increase timeouts temporarily --- ### 6. Security Incident **Symptoms:** - Unusual login patterns - Suspected unauthorized access - Data exfiltration alerts **Immediate Actions:** 1. **DO NOT** modify logs or evidence 2. Notify security team immediately 3. Consider isolating affected systems 4. Document all observations with timestamps **Escalation:** P1 - Immediately escalate to engineering lead and management. --- ## Incident Response Workflow ``` 1. DETECT ├── Automated alert received ├── Customer report └── Internal discovery 2. ASSESS ├── Determine severity (P1-P4) ├── Identify affected systems └── Estimate customer impact 3. RESPOND ├── Follow relevant scenario playbook ├── Communicate status └── Escalate if needed 4. RESOLVE ├── Implement fix ├── Verify resolution └── Monitor for recurrence 5. REVIEW ├── Document timeline ├── Identify root cause └── Create action items ``` --- ## Communication Templates ### Internal Status Update ``` INCIDENT UPDATE - [P1/P2/P3/P4] - [Brief Description] Status: [Investigating/Identified/Monitoring/Resolved] Impact: [Description of customer impact] Started: [Time in UTC] Last Update: [Time in UTC] Current Actions: - [Action 1] - [Action 2] Next Update: [Time] ``` ### Customer Communication (P1/P2 only) ``` We are currently experiencing issues with [service/feature]. What's happening: [Brief, non-technical description] Impact: [What customers may experience] Status: Our team is actively working to resolve this issue. We will provide updates every [30 minutes/1 hour]. We apologize for any inconvenience. ``` --- ## Post-Incident Review After every P1 or P2 incident, conduct a post-incident review within 3 business days. ### Review Template 1. **Incident Summary** - What happened? - When did it start/end? - Who was affected? 2. **Timeline** - Detection time - Response time - Resolution time - Key milestones 3. **Root Cause Analysis** - What was the direct cause? - What were contributing factors? - Why wasn't this prevented? 4. **Action Items** - Immediate fixes applied - Preventive measures needed - Monitoring improvements - Documentation updates --- ## Related Documents - [Provisioning Runbook](./provisioning-runbook.md) - [Database Operations](./database-operations.md) - [External Dependencies](./external-dependencies.md) - [Queue Management](./queue-management.md) - [Logging Guide](./logging.md) --- **Last Updated:** December 2025