328 lines
8.3 KiB
Markdown
328 lines
8.3 KiB
Markdown
|
|
# Incident Response Runbook
|
||
|
|
|
||
|
|
This document defines procedures for responding to production incidents affecting the Customer Portal.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Severity Classification
|
||
|
|
|
||
|
|
| Severity | Definition | Response Time | Examples |
|
||
|
|
| ----------------- | -------------------------------------- | ------------- | ----------------------------------------------------------------- |
|
||
|
|
| **P1 - Critical** | Complete service outage or data loss | 15 minutes | Portal unreachable, database corruption, security breach |
|
||
|
|
| **P2 - High** | Major feature unavailable | 1 hour | Order provisioning failing, payment processing down |
|
||
|
|
| **P3 - Medium** | Degraded performance or partial outage | 4 hours | Slow response times, intermittent errors, single integration down |
|
||
|
|
| **P4 - Low** | Minor issue, workaround available | 24 hours | UI glitches, non-critical feature bugs |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Escalation Matrix
|
||
|
|
|
||
|
|
| Level | Scope | Contact | When to Escalate |
|
||
|
|
| ------ | ---------------- | ------------------- | ---------------------------------------------------- |
|
||
|
|
| **L1** | Initial Response | On-call engineer | All incidents |
|
||
|
|
| **L2** | Technical Lead | Development lead | P1/P2 not resolved in 30 minutes |
|
||
|
|
| **L3** | Management | Engineering manager | P1 not resolved in 1 hour, customer impact |
|
||
|
|
| **L4** | External | Vendor support | External system failure (Salesforce, WHMCS, Freebit) |
|
||
|
|
|
||
|
|
### On-Call Contacts
|
||
|
|
|
||
|
|
> **Note**: Update this section with actual contact information for your team.
|
||
|
|
|
||
|
|
| Role | Contact Method | Backup |
|
||
|
|
| ----------------- | ----------------- | ------- |
|
||
|
|
| Primary On-Call | [Slack/PagerDuty] | [Phone] |
|
||
|
|
| Secondary On-Call | [Slack/PagerDuty] | [Phone] |
|
||
|
|
| Engineering Lead | [Slack/Email] | [Phone] |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Common Incident Scenarios
|
||
|
|
|
||
|
|
### 1. Salesforce Platform Events Not Receiving
|
||
|
|
|
||
|
|
**Symptoms:**
|
||
|
|
|
||
|
|
- Orders stuck in "Pending Review" status
|
||
|
|
- No provisioning activity in logs
|
||
|
|
- `sf:pe:replay:*` Redis keys not updating
|
||
|
|
|
||
|
|
**Diagnosis:**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check BFF logs for Platform Event subscription
|
||
|
|
grep "Platform Event" /var/log/bff/combined.log | tail -50
|
||
|
|
|
||
|
|
# Check Redis replay ID
|
||
|
|
redis-cli GET "sf:pe:replay:/event/OrderProvisionRequested__e"
|
||
|
|
|
||
|
|
# Verify Salesforce connectivity
|
||
|
|
curl -X GET http://localhost:4000/health
|
||
|
|
```
|
||
|
|
|
||
|
|
**Resolution:**
|
||
|
|
|
||
|
|
1. Verify `SF_EVENTS_ENABLED=true` in environment
|
||
|
|
2. Check Salesforce Connected App JWT authentication
|
||
|
|
3. Verify Platform Event permissions for integration user
|
||
|
|
4. Set `SF_EVENTS_REPLAY=ALL` temporarily to replay missed events
|
||
|
|
5. Restart BFF to re-establish subscription
|
||
|
|
|
||
|
|
**Escalation:** If unresolved in 30 minutes, contact Salesforce admin.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 2. WHMCS API Unavailable
|
||
|
|
|
||
|
|
**Symptoms:**
|
||
|
|
|
||
|
|
- Billing pages showing "service unavailable"
|
||
|
|
- Provisioning failing with WHMCS errors
|
||
|
|
- Payment method checks failing
|
||
|
|
|
||
|
|
**Diagnosis:**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check WHMCS connectivity from BFF
|
||
|
|
curl -X POST $WHMCS_API_URL -d "action=GetClients&responsetype=json"
|
||
|
|
|
||
|
|
# Check BFF logs for WHMCS errors
|
||
|
|
grep "WHMCS" /var/log/bff/error.log | tail -20
|
||
|
|
```
|
||
|
|
|
||
|
|
**Resolution:**
|
||
|
|
|
||
|
|
1. Verify WHMCS server is accessible
|
||
|
|
2. Check WHMCS API credentials (`WHMCS_API_IDENTIFIER`, `WHMCS_API_SECRET`)
|
||
|
|
3. Check WHMCS server load and resource usage
|
||
|
|
4. Contact WHMCS hosting provider if server is down
|
||
|
|
|
||
|
|
**Escalation:** If WHMCS server is down, contact hosting provider.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 3. Redis Connection Failures
|
||
|
|
|
||
|
|
**Symptoms:**
|
||
|
|
|
||
|
|
- Authentication failing
|
||
|
|
- Cache misses on every request
|
||
|
|
- Rate limiting not working
|
||
|
|
- SSE connections dropping
|
||
|
|
|
||
|
|
**Diagnosis:**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check Redis connectivity
|
||
|
|
redis-cli ping
|
||
|
|
|
||
|
|
# Check Redis memory usage
|
||
|
|
redis-cli INFO memory
|
||
|
|
|
||
|
|
# Check BFF health endpoint
|
||
|
|
curl http://localhost:4000/health | jq '.checks.cache'
|
||
|
|
```
|
||
|
|
|
||
|
|
**Resolution:**
|
||
|
|
|
||
|
|
1. Verify Redis URL in environment (`REDIS_URL`)
|
||
|
|
2. Check Redis server memory usage and eviction policy
|
||
|
|
3. Restart Redis if memory is exhausted
|
||
|
|
4. Clear stale keys if necessary: `redis-cli FLUSHDB` (caution: clears all cache)
|
||
|
|
|
||
|
|
**Impact Note:** Redis failure causes:
|
||
|
|
|
||
|
|
- Token blacklist checks to fail (security risk if `AUTH_BLACKLIST_FAIL_CLOSED=false`)
|
||
|
|
- All cached data to be re-fetched from source systems
|
||
|
|
- Rate limiting to stop working
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 4. Database Connection Issues
|
||
|
|
|
||
|
|
**Symptoms:**
|
||
|
|
|
||
|
|
- All API requests failing with 500 errors
|
||
|
|
- Health check shows database as "fail"
|
||
|
|
- Prisma connection errors in logs
|
||
|
|
|
||
|
|
**Diagnosis:**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check database connectivity
|
||
|
|
psql $DATABASE_URL -c "SELECT 1"
|
||
|
|
|
||
|
|
# Check connection count
|
||
|
|
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity"
|
||
|
|
|
||
|
|
# Check BFF health endpoint
|
||
|
|
curl http://localhost:4000/health | jq '.checks.database'
|
||
|
|
```
|
||
|
|
|
||
|
|
**Resolution:**
|
||
|
|
|
||
|
|
1. Verify PostgreSQL server is running
|
||
|
|
2. Check connection pool limits (Prisma connection_limit)
|
||
|
|
3. Look for long-running queries and kill if necessary
|
||
|
|
4. Restart database if unresponsive
|
||
|
|
|
||
|
|
**Escalation:** If database is corrupted, see [Database Operations Runbook](./database-operations.md).
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 5. High Error Rate / Performance Degradation
|
||
|
|
|
||
|
|
**Symptoms:**
|
||
|
|
|
||
|
|
- Increased response times (>2s average)
|
||
|
|
- Error rate above 1%
|
||
|
|
- Customer complaints
|
||
|
|
|
||
|
|
**Diagnosis:**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check BFF process resource usage
|
||
|
|
top -p $(pgrep -f "node.*bff")
|
||
|
|
|
||
|
|
# Check recent error logs
|
||
|
|
tail -100 /var/log/bff/error.log
|
||
|
|
|
||
|
|
# Check external API response times in logs
|
||
|
|
grep "duration" /var/log/bff/combined.log | tail -20
|
||
|
|
```
|
||
|
|
|
||
|
|
**Resolution:**
|
||
|
|
|
||
|
|
1. Identify which external API is slow (Salesforce, WHMCS, Freebit)
|
||
|
|
2. Check for traffic spikes or unusual patterns
|
||
|
|
3. Scale horizontally if CPU/memory constrained
|
||
|
|
4. Enable circuit breakers or increase timeouts temporarily
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### 6. Security Incident
|
||
|
|
|
||
|
|
**Symptoms:**
|
||
|
|
|
||
|
|
- Unusual login patterns
|
||
|
|
- Suspected unauthorized access
|
||
|
|
- Data exfiltration alerts
|
||
|
|
|
||
|
|
**Immediate Actions:**
|
||
|
|
|
||
|
|
1. **DO NOT** modify logs or evidence
|
||
|
|
2. Notify security team immediately
|
||
|
|
3. Consider isolating affected systems
|
||
|
|
4. Document all observations with timestamps
|
||
|
|
|
||
|
|
**Escalation:** P1 - Immediately escalate to engineering lead and management.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Incident Response Workflow
|
||
|
|
|
||
|
|
```
|
||
|
|
1. DETECT
|
||
|
|
├── Automated alert received
|
||
|
|
├── Customer report
|
||
|
|
└── Internal discovery
|
||
|
|
|
||
|
|
2. ASSESS
|
||
|
|
├── Determine severity (P1-P4)
|
||
|
|
├── Identify affected systems
|
||
|
|
└── Estimate customer impact
|
||
|
|
|
||
|
|
3. RESPOND
|
||
|
|
├── Follow relevant scenario playbook
|
||
|
|
├── Communicate status
|
||
|
|
└── Escalate if needed
|
||
|
|
|
||
|
|
4. RESOLVE
|
||
|
|
├── Implement fix
|
||
|
|
├── Verify resolution
|
||
|
|
└── Monitor for recurrence
|
||
|
|
|
||
|
|
5. REVIEW
|
||
|
|
├── Document timeline
|
||
|
|
├── Identify root cause
|
||
|
|
└── Create action items
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Communication Templates
|
||
|
|
|
||
|
|
### Internal Status Update
|
||
|
|
|
||
|
|
```
|
||
|
|
INCIDENT UPDATE - [P1/P2/P3/P4] - [Brief Description]
|
||
|
|
|
||
|
|
Status: [Investigating/Identified/Monitoring/Resolved]
|
||
|
|
Impact: [Description of customer impact]
|
||
|
|
Started: [Time in UTC]
|
||
|
|
Last Update: [Time in UTC]
|
||
|
|
|
||
|
|
Current Actions:
|
||
|
|
- [Action 1]
|
||
|
|
- [Action 2]
|
||
|
|
|
||
|
|
Next Update: [Time]
|
||
|
|
```
|
||
|
|
|
||
|
|
### Customer Communication (P1/P2 only)
|
||
|
|
|
||
|
|
```
|
||
|
|
We are currently experiencing issues with [service/feature].
|
||
|
|
|
||
|
|
What's happening: [Brief, non-technical description]
|
||
|
|
Impact: [What customers may experience]
|
||
|
|
Status: Our team is actively working to resolve this issue.
|
||
|
|
|
||
|
|
We will provide updates every [30 minutes/1 hour].
|
||
|
|
|
||
|
|
We apologize for any inconvenience.
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Post-Incident Review
|
||
|
|
|
||
|
|
After every P1 or P2 incident, conduct a post-incident review within 3 business days.
|
||
|
|
|
||
|
|
### Review Template
|
||
|
|
|
||
|
|
1. **Incident Summary**
|
||
|
|
- What happened?
|
||
|
|
- When did it start/end?
|
||
|
|
- Who was affected?
|
||
|
|
|
||
|
|
2. **Timeline**
|
||
|
|
- Detection time
|
||
|
|
- Response time
|
||
|
|
- Resolution time
|
||
|
|
- Key milestones
|
||
|
|
|
||
|
|
3. **Root Cause Analysis**
|
||
|
|
- What was the direct cause?
|
||
|
|
- What were contributing factors?
|
||
|
|
- Why wasn't this prevented?
|
||
|
|
|
||
|
|
4. **Action Items**
|
||
|
|
- Immediate fixes applied
|
||
|
|
- Preventive measures needed
|
||
|
|
- Monitoring improvements
|
||
|
|
- Documentation updates
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Related Documents
|
||
|
|
|
||
|
|
- [Provisioning Runbook](./provisioning-runbook.md)
|
||
|
|
- [Database Operations](./database-operations.md)
|
||
|
|
- [External Dependencies](./external-dependencies.md)
|
||
|
|
- [Queue Management](./queue-management.md)
|
||
|
|
- [Logging Guide](./logging.md)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Last Updated:** December 2025
|