Assist_Design/docs/operations/incident-response.md

# Incident Response Runbook

This document defines procedures for responding to production incidents affecting the Customer Portal.

---

## Severity Classification

| Severity          | Definition                             | Response Time | Examples                                                          |
| ----------------- | -------------------------------------- | ------------- | ----------------------------------------------------------------- |
| **P1 - Critical** | Complete service outage or data loss   | 15 minutes    | Portal unreachable, database corruption, security breach          |
| **P2 - High**     | Major feature unavailable              | 1 hour        | Order provisioning failing, payment processing down               |
| **P3 - Medium**   | Degraded performance or partial outage | 4 hours       | Slow response times, intermittent errors, single integration down |
| **P4 - Low**      | Minor issue, workaround available      | 24 hours      | UI glitches, non-critical feature bugs                            |

---

## Escalation Matrix

| Level  | Scope            | Contact             | When to Escalate                                     |
| ------ | ---------------- | ------------------- | ---------------------------------------------------- |
| **L1** | Initial Response | On-call engineer    | All incidents                                        |
| **L2** | Technical Lead   | Development lead    | P1/P2 not resolved in 30 minutes                     |
| **L3** | Management       | Engineering manager | P1 not resolved in 1 hour, customer impact           |
| **L4** | External         | Vendor support      | External system failure (Salesforce, WHMCS, Freebit) |

### On-Call Contacts

> **Note**: Update this section with actual contact information for your team.

| Role              | Contact Method    | Backup  |
| ----------------- | ----------------- | ------- |
| Primary On-Call   | [Slack/PagerDuty] | [Phone] |
| Secondary On-Call | [Slack/PagerDuty] | [Phone] |
| Engineering Lead  | [Slack/Email]     | [Phone] |

---

## Common Incident Scenarios

### 1. Salesforce Platform Events Not Receiving

**Symptoms:**

- Orders stuck in "Pending Review" status
- No provisioning activity in logs
- `sf:pe:replay:*` Redis keys not updating

**Diagnosis:**

```bash
# Check BFF logs for Platform Event subscription
grep "Platform Event" /var/log/bff/combined.log | tail -50

# Check Redis replay ID
redis-cli GET "sf:pe:replay:/event/OrderProvisionRequested__e"

# Verify Salesforce connectivity
curl -X GET http://localhost:4000/health
```

**Resolution:**

1. Verify `SF_EVENTS_ENABLED=true` in environment
2. Check Salesforce Connected App JWT authentication
3. Verify Platform Event permissions for integration user
4. Set `SF_EVENTS_REPLAY=ALL` temporarily to replay missed events
5. Restart BFF to re-establish subscription

**Escalation:** If unresolved in 30 minutes, contact Salesforce admin.

---

### 2. WHMCS API Unavailable

**Symptoms:**

- Billing pages showing "service unavailable"
- Provisioning failing with WHMCS errors
- Payment method checks failing

**Diagnosis:**

```bash
# Check WHMCS connectivity from BFF
curl -X POST $WHMCS_API_URL -d "action=GetClients&responsetype=json"

# Check BFF logs for WHMCS errors
grep "WHMCS" /var/log/bff/error.log | tail -20
```

**Resolution:**

1. Verify WHMCS server is accessible
2. Check WHMCS API credentials (`WHMCS_API_IDENTIFIER`, `WHMCS_API_SECRET`)
3. Check WHMCS server load and resource usage
4. Contact WHMCS hosting provider if server is down

**Escalation:** If WHMCS server is down, contact hosting provider.

---

### 3. Redis Connection Failures

**Symptoms:**

- Authentication failing
- Cache misses on every request
- Rate limiting not working
- SSE connections dropping

**Diagnosis:**

```bash
# Check Redis connectivity
redis-cli ping

# Check Redis memory usage
redis-cli INFO memory

# Check BFF health endpoint
curl http://localhost:4000/health | jq '.checks.cache'
```

**Resolution:**

1. Verify Redis URL in environment (`REDIS_URL`)
2. Check Redis server memory usage and eviction policy
3. Restart Redis if memory is exhausted
4. Clear stale keys if necessary: `redis-cli FLUSHDB` (caution: clears all cache)

**Impact Note:** Redis failure causes:

- Token blacklist checks to fail (security risk if `AUTH_BLACKLIST_FAIL_CLOSED=false`)
- All cached data to be re-fetched from source systems
- Rate limiting to stop working

---

### 4. Database Connection Issues

**Symptoms:**

- All API requests failing with 500 errors
- Health check shows database as "fail"
- Prisma connection errors in logs

**Diagnosis:**

```bash
# Check database connectivity
psql $DATABASE_URL -c "SELECT 1"

# Check connection count
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity"

# Check BFF health endpoint
curl http://localhost:4000/health | jq '.checks.database'
```

**Resolution:**

1. Verify PostgreSQL server is running
2. Check connection pool limits (Prisma connection_limit)
3. Look for long-running queries and kill if necessary
4. Restart database if unresponsive

**Escalation:** If database is corrupted, see [Database Operations Runbook](./database-operations.md).

---

### 5. High Error Rate / Performance Degradation

**Symptoms:**

- Increased response times (>2s average)
- Error rate above 1%
- Customer complaints

**Diagnosis:**

```bash
# Check BFF process resource usage
top -p $(pgrep -f "node.*bff")

# Check recent error logs
tail -100 /var/log/bff/error.log

# Check external API response times in logs
grep "duration" /var/log/bff/combined.log | tail -20
```

**Resolution:**

1. Identify which external API is slow (Salesforce, WHMCS, Freebit)
2. Check for traffic spikes or unusual patterns
3. Scale horizontally if CPU/memory constrained
4. Enable circuit breakers or increase timeouts temporarily

---

### 6. Security Incident

**Symptoms:**

- Unusual login patterns
- Suspected unauthorized access
- Data exfiltration alerts

**Immediate Actions:**

1. **DO NOT** modify logs or evidence
2. Notify security team immediately
3. Consider isolating affected systems
4. Document all observations with timestamps

**Escalation:** P1 - Immediately escalate to engineering lead and management.

---

## Incident Response Workflow

```
1. DETECT
   ├── Automated alert received
   ├── Customer report
   └── Internal discovery

2. ASSESS
   ├── Determine severity (P1-P4)
   ├── Identify affected systems
   └── Estimate customer impact

3. RESPOND
   ├── Follow relevant scenario playbook
   ├── Communicate status
   └── Escalate if needed

4. RESOLVE
   ├── Implement fix
   ├── Verify resolution
   └── Monitor for recurrence

5. REVIEW
   ├── Document timeline
   ├── Identify root cause
   └── Create action items
```

---

## Communication Templates

### Internal Status Update

```
INCIDENT UPDATE - [P1/P2/P3/P4] - [Brief Description]

Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Description of customer impact]
Started: [Time in UTC]
Last Update: [Time in UTC]

Current Actions:
- [Action 1]
- [Action 2]

Next Update: [Time]
```

### Customer Communication (P1/P2 only)

```
We are currently experiencing issues with [service/feature].

What's happening: [Brief, non-technical description]
Impact: [What customers may experience]
Status: Our team is actively working to resolve this issue.

We will provide updates every [30 minutes/1 hour].

We apologize for any inconvenience.
```

---

## Post-Incident Review

After every P1 or P2 incident, conduct a post-incident review within 3 business days.

### Review Template

1. **Incident Summary**
   - What happened?
   - When did it start/end?
   - Who was affected?

2. **Timeline**
   - Detection time
   - Response time
   - Resolution time
   - Key milestones

3. **Root Cause Analysis**
   - What was the direct cause?
   - What were contributing factors?
   - Why wasn't this prevented?

4. **Action Items**
   - Immediate fixes applied
   - Preventive measures needed
   - Monitoring improvements
   - Documentation updates

---

## Related Documents

- [Provisioning Runbook](./provisioning-runbook.md)
- [Database Operations](./database-operations.md)
- [External Dependencies](./external-dependencies.md)
- [Queue Management](./queue-management.md)
- [Logging Guide](./logging.md)

---

**Last Updated:** December 2025
Enhance Documentation Structure and Update Operational Runbooks - Added a new section for operational runbooks in README.md, detailing procedures for incident response, database operations, and queue management. - Updated the documentation structure in STRUCTURE.md to reflect the new organization of guides and resources. - Removed the deprecated disabled-modules.md file to streamline documentation. - Enhanced the _archive/README.md with historical notes on documentation alignment and corrections made in December 2025. - Updated various references in the documentation to reflect the new paths and services in the integrations directory. 2025-12-23 15:55:58 +09:00			`# Incident Response Runbook`

			`This document defines procedures for responding to production incidents affecting the Customer Portal.`

			`---`

			`## Severity Classification`

			`\| Severity \| Definition \| Response Time \| Examples \|`
			`\| ----------------- \| -------------------------------------- \| ------------- \| ----------------------------------------------------------------- \|`
			`\| P1 - Critical \| Complete service outage or data loss \| 15 minutes \| Portal unreachable, database corruption, security breach \|`
			`\| P2 - High \| Major feature unavailable \| 1 hour \| Order provisioning failing, payment processing down \|`
			`\| P3 - Medium \| Degraded performance or partial outage \| 4 hours \| Slow response times, intermittent errors, single integration down \|`
			`\| P4 - Low \| Minor issue, workaround available \| 24 hours \| UI glitches, non-critical feature bugs \|`

			`---`

			`## Escalation Matrix`

			`\| Level \| Scope \| Contact \| When to Escalate \|`
			`\| ------ \| ---------------- \| ------------------- \| ---------------------------------------------------- \|`
			`\| L1 \| Initial Response \| On-call engineer \| All incidents \|`
			`\| L2 \| Technical Lead \| Development lead \| P1/P2 not resolved in 30 minutes \|`
			`\| L3 \| Management \| Engineering manager \| P1 not resolved in 1 hour, customer impact \|`
			`\| L4 \| External \| Vendor support \| External system failure (Salesforce, WHMCS, Freebit) \|`

			`### On-Call Contacts`

			`> Note: Update this section with actual contact information for your team.`

			`\| Role \| Contact Method \| Backup \|`
			`\| ----------------- \| ----------------- \| ------- \|`
			`\| Primary On-Call \| [Slack/PagerDuty] \| [Phone] \|`
			`\| Secondary On-Call \| [Slack/PagerDuty] \| [Phone] \|`
			`\| Engineering Lead \| [Slack/Email] \| [Phone] \|`

			`---`

			`## Common Incident Scenarios`

			`### 1. Salesforce Platform Events Not Receiving`

			`Symptoms:`

			`- Orders stuck in "Pending Review" status`
			`- No provisioning activity in logs`
			- `sf:pe:replay:*` Redis keys not updating

			`Diagnosis:`

			```bash
			`# Check BFF logs for Platform Event subscription`
			`grep "Platform Event" /var/log/bff/combined.log \| tail -50`

			`# Check Redis replay ID`
			`redis-cli GET "sf:pe:replay:/event/OrderProvisionRequested__e"`

			`# Verify Salesforce connectivity`
			`curl -X GET http://localhost:4000/health`
			```

			`Resolution:`

			1. Verify `SF_EVENTS_ENABLED=true` in environment
			`2. Check Salesforce Connected App JWT authentication`
			`3. Verify Platform Event permissions for integration user`
			4. Set `SF_EVENTS_REPLAY=ALL` temporarily to replay missed events
			`5. Restart BFF to re-establish subscription`

			`Escalation: If unresolved in 30 minutes, contact Salesforce admin.`

			`---`

			`### 2. WHMCS API Unavailable`

			`Symptoms:`

			`- Billing pages showing "service unavailable"`
			`- Provisioning failing with WHMCS errors`
			`- Payment method checks failing`

			`Diagnosis:`

			```bash
			`# Check WHMCS connectivity from BFF`
			`curl -X POST $WHMCS_API_URL -d "action=GetClients&responsetype=json"`

			`# Check BFF logs for WHMCS errors`
			`grep "WHMCS" /var/log/bff/error.log \| tail -20`
			```

			`Resolution:`

			`1. Verify WHMCS server is accessible`
			2. Check WHMCS API credentials (`WHMCS_API_IDENTIFIER`, `WHMCS_API_SECRET`)
			`3. Check WHMCS server load and resource usage`
			`4. Contact WHMCS hosting provider if server is down`

			`Escalation: If WHMCS server is down, contact hosting provider.`

			`---`

			`### 3. Redis Connection Failures`

			`Symptoms:`

			`- Authentication failing`
			`- Cache misses on every request`
			`- Rate limiting not working`
			`- SSE connections dropping`

			`Diagnosis:`

			```bash
			`# Check Redis connectivity`
			`redis-cli ping`

			`# Check Redis memory usage`
			`redis-cli INFO memory`

			`# Check BFF health endpoint`
			`curl http://localhost:4000/health \| jq '.checks.cache'`
			```

			`Resolution:`

			1. Verify Redis URL in environment (`REDIS_URL`)
			`2. Check Redis server memory usage and eviction policy`
			`3. Restart Redis if memory is exhausted`
			4. Clear stale keys if necessary: `redis-cli FLUSHDB` (caution: clears all cache)

			`Impact Note: Redis failure causes:`

			- Token blacklist checks to fail (security risk if `AUTH_BLACKLIST_FAIL_CLOSED=false`)
			`- All cached data to be re-fetched from source systems`
			`- Rate limiting to stop working`

			`---`

			`### 4. Database Connection Issues`

			`Symptoms:`

			`- All API requests failing with 500 errors`
			`- Health check shows database as "fail"`
			`- Prisma connection errors in logs`

			`Diagnosis:`

			```bash
			`# Check database connectivity`
			`psql $DATABASE_URL -c "SELECT 1"`

			`# Check connection count`
			`psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity"`

			`# Check BFF health endpoint`
			`curl http://localhost:4000/health \| jq '.checks.database'`
			```

			`Resolution:`

			`1. Verify PostgreSQL server is running`
			`2. Check connection pool limits (Prisma connection_limit)`
			`3. Look for long-running queries and kill if necessary`
			`4. Restart database if unresponsive`

			`Escalation: If database is corrupted, see [Database Operations Runbook](./database-operations.md).`

			`---`

			`### 5. High Error Rate / Performance Degradation`

			`Symptoms:`

			`- Increased response times (>2s average)`
			`- Error rate above 1%`
			`- Customer complaints`

			`Diagnosis:`

			```bash
			`# Check BFF process resource usage`
			`top -p $(pgrep -f "node.*bff")`

			`# Check recent error logs`
			`tail -100 /var/log/bff/error.log`

			`# Check external API response times in logs`
			`grep "duration" /var/log/bff/combined.log \| tail -20`
			```

			`Resolution:`

			`1. Identify which external API is slow (Salesforce, WHMCS, Freebit)`
			`2. Check for traffic spikes or unusual patterns`
			`3. Scale horizontally if CPU/memory constrained`
			`4. Enable circuit breakers or increase timeouts temporarily`

			`---`

			`### 6. Security Incident`

			`Symptoms:`

			`- Unusual login patterns`
			`- Suspected unauthorized access`
			`- Data exfiltration alerts`

			`Immediate Actions:`

			`1. DO NOT modify logs or evidence`
			`2. Notify security team immediately`
			`3. Consider isolating affected systems`
			`4. Document all observations with timestamps`

			`Escalation: P1 - Immediately escalate to engineering lead and management.`

			`---`

			`## Incident Response Workflow`

			```
			`1. DETECT`
			`├── Automated alert received`
			`├── Customer report`
			`└── Internal discovery`

			`2. ASSESS`
			`├── Determine severity (P1-P4)`
			`├── Identify affected systems`
			`└── Estimate customer impact`

			`3. RESPOND`
			`├── Follow relevant scenario playbook`
			`├── Communicate status`
			`└── Escalate if needed`

			`4. RESOLVE`
			`├── Implement fix`
			`├── Verify resolution`
			`└── Monitor for recurrence`

			`5. REVIEW`
			`├── Document timeline`
			`├── Identify root cause`
			`└── Create action items`
			```

			`---`

			`## Communication Templates`

			`### Internal Status Update`

			```
			`INCIDENT UPDATE - [P1/P2/P3/P4] - [Brief Description]`

			`Status: [Investigating/Identified/Monitoring/Resolved]`
			`Impact: [Description of customer impact]`
			`Started: [Time in UTC]`
			`Last Update: [Time in UTC]`

			`Current Actions:`
			`- [Action 1]`
			`- [Action 2]`

			`Next Update: [Time]`
			```

			`### Customer Communication (P1/P2 only)`

			```
			`We are currently experiencing issues with [service/feature].`

			`What's happening: [Brief, non-technical description]`
			`Impact: [What customers may experience]`
			`Status: Our team is actively working to resolve this issue.`

			`We will provide updates every [30 minutes/1 hour].`

			`We apologize for any inconvenience.`
			```

			`---`

			`## Post-Incident Review`

			`After every P1 or P2 incident, conduct a post-incident review within 3 business days.`

			`### Review Template`

			`1. Incident Summary`
			`- What happened?`
			`- When did it start/end?`
			`- Who was affected?`

			`2. Timeline`
			`- Detection time`
			`- Response time`
			`- Resolution time`
			`- Key milestones`

			`3. Root Cause Analysis`
			`- What was the direct cause?`
			`- What were contributing factors?`
			`- Why wasn't this prevented?`

			`4. Action Items`
			`- Immediate fixes applied`
			`- Preventive measures needed`
			`- Monitoring improvements`
			`- Documentation updates`

			`---`

			`## Related Documents`

			`- [Provisioning Runbook](./provisioning-runbook.md)`
			`- [Database Operations](./database-operations.md)`
			`- [External Dependencies](./external-dependencies.md)`
			`- [Queue Management](./queue-management.md)`
			`- [Logging Guide](./logging.md)`

			`---`

			`Last Updated: December 2025`