Assist_Design/docs/operations/incident-response.md

# Incident Response Runbook

This document defines procedures for responding to production incidents affecting the Customer Portal.

---

## Severity Classification

| Severity          | Definition                             | Response Time | Examples                                                          |
| ----------------- | -------------------------------------- | ------------- | ----------------------------------------------------------------- |
| **P1 - Critical** | Complete service outage or data loss   | 15 minutes    | Portal unreachable, database corruption, security breach          |
| **P2 - High**     | Major feature unavailable              | 1 hour        | Order provisioning failing, payment processing down               |
| **P3 - Medium**   | Degraded performance or partial outage | 4 hours       | Slow response times, intermittent errors, single integration down |
| **P4 - Low**      | Minor issue, workaround available      | 24 hours      | UI glitches, non-critical feature bugs                            |

---

## Escalation Matrix

| Level  | Scope            | Contact             | When to Escalate                                     |
| ------ | ---------------- | ------------------- | ---------------------------------------------------- |
| **L1** | Initial Response | On-call engineer    | All incidents                                        |
| **L2** | Technical Lead   | Development lead    | P1/P2 not resolved in 30 minutes                     |
| **L3** | Management       | Engineering manager | P1 not resolved in 1 hour, customer impact           |
| **L4** | External         | Vendor support      | External system failure (Salesforce, WHMCS, Freebit) |

### On-Call Contacts

> **Note**: Update this section with actual contact information for your team.

| Role              | Contact Method    | Backup  |
| ----------------- | ----------------- | ------- |
| Primary On-Call   | [Slack/PagerDuty] | [Phone] |
| Secondary On-Call | [Slack/PagerDuty] | [Phone] |
| Engineering Lead  | [Slack/Email]     | [Phone] |

---

## Common Incident Scenarios

### 1. Salesforce Platform Events Not Receiving

**Symptoms:**

- Orders stuck in "Pending Review" status
- No provisioning activity in logs
- `sf:pe:replay:*` Redis keys not updating

**Diagnosis:**

```bash
# Check BFF logs for Platform Event subscription
grep "Platform Event" /var/log/bff/combined.log | tail -50

# Check Redis replay ID
redis-cli GET "sf:pe:replay:/event/OrderProvisionRequested__e"

# Verify Salesforce connectivity
curl -X GET http://localhost:4000/health
```

**Resolution:**

1. Verify `SF_EVENTS_ENABLED=true` in environment
2. Check Salesforce Connected App JWT authentication
3. Verify Platform Event permissions for integration user
4. Set `SF_EVENTS_REPLAY=ALL` temporarily to replay missed events
5. Restart BFF to re-establish subscription

**Escalation:** If unresolved in 30 minutes, contact Salesforce admin.

---

### 2. WHMCS API Unavailable

**Symptoms:**

- Billing pages showing "service unavailable"
- Provisioning failing with WHMCS errors
- Payment method checks failing

**Diagnosis:**

```bash
# Check WHMCS connectivity from BFF
curl -X POST $WHMCS_API_URL -d "action=GetClients&responsetype=json"

# Check BFF logs for WHMCS errors
grep "WHMCS" /var/log/bff/error.log | tail -20
```

**Resolution:**

1. Verify WHMCS server is accessible
2. Check WHMCS API credentials (`WHMCS_API_IDENTIFIER`, `WHMCS_API_SECRET`)
3. Check WHMCS server load and resource usage
4. Contact WHMCS hosting provider if server is down

**Escalation:** If WHMCS server is down, contact hosting provider.

---

### 3. Redis Connection Failures

**Symptoms:**

- Authentication failing
- Cache misses on every request
- Rate limiting not working
- SSE connections dropping

**Diagnosis:**

```bash
# Check Redis connectivity
redis-cli ping

# Check Redis memory usage
redis-cli INFO memory

# Check BFF health endpoint
curl http://localhost:4000/health | jq '.checks.cache'
```

**Resolution:**

1. Verify Redis URL in environment (`REDIS_URL`)
2. Check Redis server memory usage and eviction policy
3. Restart Redis if memory is exhausted
4. Clear stale keys if necessary: `redis-cli FLUSHDB` (caution: clears all cache)

**Impact Note:** Redis failure causes:

- Token blacklist checks to fail (security risk if `AUTH_BLACKLIST_FAIL_CLOSED=false`)
- All cached data to be re-fetched from source systems
- Rate limiting to stop working

---

### 4. Database Connection Issues

**Symptoms:**

- All API requests failing with 500 errors
- Health check shows database as "fail"
- Prisma connection errors in logs

**Diagnosis:**

```bash
# Check database connectivity
psql $DATABASE_URL -c "SELECT 1"

# Check connection count
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity"

# Check BFF health endpoint
curl http://localhost:4000/health | jq '.checks.database'
```

**Resolution:**

1. Verify PostgreSQL server is running
2. Check connection pool limits (Prisma connection_limit)
3. Look for long-running queries and kill if necessary
4. Restart database if unresponsive

**Escalation:** If database is corrupted, see [Database Operations Runbook](./database-operations.md).

---

### 5. High Error Rate / Performance Degradation

**Symptoms:**

- Increased response times (>2s average)
- Error rate above 1%
- Customer complaints

**Diagnosis:**

```bash
# Check BFF process resource usage
top -p $(pgrep -f "node.*bff")

# Check recent error logs
tail -100 /var/log/bff/error.log

# Check external API response times in logs
grep "duration" /var/log/bff/combined.log | tail -20
```

**Resolution:**

1. Identify which external API is slow (Salesforce, WHMCS, Freebit)
2. Check for traffic spikes or unusual patterns
3. Scale horizontally if CPU/memory constrained
4. Enable circuit breakers or increase timeouts temporarily

---

### 6. Security Incident

**Symptoms:**

- Unusual login patterns
- Suspected unauthorized access
- Data exfiltration alerts

**Immediate Actions:**

1. **DO NOT** modify logs or evidence
2. Notify security team immediately
3. Consider isolating affected systems
4. Document all observations with timestamps

**Escalation:** P1 - Immediately escalate to engineering lead and management.

---

## Incident Response Workflow

```
1. DETECT
   ├── Automated alert received
   ├── Customer report
   └── Internal discovery

2. ASSESS
   ├── Determine severity (P1-P4)
   ├── Identify affected systems
   └── Estimate customer impact

3. RESPOND
   ├── Follow relevant scenario playbook
   ├── Communicate status
   └── Escalate if needed

4. RESOLVE
   ├── Implement fix
   ├── Verify resolution
   └── Monitor for recurrence

5. REVIEW
   ├── Document timeline
   ├── Identify root cause
   └── Create action items
```

---

## Communication Templates

### Internal Status Update

```
INCIDENT UPDATE - [P1/P2/P3/P4] - [Brief Description]

Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Description of customer impact]
Started: [Time in UTC]
Last Update: [Time in UTC]

Current Actions:
- [Action 1]
- [Action 2]

Next Update: [Time]
```

### Customer Communication (P1/P2 only)

```
We are currently experiencing issues with [service/feature].

What's happening: [Brief, non-technical description]
Impact: [What customers may experience]
Status: Our team is actively working to resolve this issue.

We will provide updates every [30 minutes/1 hour].

We apologize for any inconvenience.
```

---

## Post-Incident Review

After every P1 or P2 incident, conduct a post-incident review within 3 business days.

### Review Template

1. **Incident Summary**
   - What happened?
   - When did it start/end?
   - Who was affected?

2. **Timeline**
   - Detection time
   - Response time
   - Resolution time
   - Key milestones

3. **Root Cause Analysis**
   - What was the direct cause?
   - What were contributing factors?
   - Why wasn't this prevented?

4. **Action Items**
   - Immediate fixes applied
   - Preventive measures needed
   - Monitoring improvements
   - Documentation updates

---

## Related Documents

- [Provisioning Runbook](./provisioning-runbook.md)
- [Database Operations](./database-operations.md)
- [External Dependencies](./external-dependencies.md)
- [Queue Management](./queue-management.md)
- [Logging Guide](./logging.md)

---

**Last Updated:** December 2025