Assist_Design/docs/operations/incident-response.md
barsa 72d0b66be7 Enhance Documentation Structure and Update Operational Runbooks
- Added a new section for operational runbooks in README.md, detailing procedures for incident response, database operations, and queue management.
- Updated the documentation structure in STRUCTURE.md to reflect the new organization of guides and resources.
- Removed the deprecated disabled-modules.md file to streamline documentation.
- Enhanced the _archive/README.md with historical notes on documentation alignment and corrections made in December 2025.
- Updated various references in the documentation to reflect the new paths and services in the integrations directory.
2025-12-23 15:55:58 +09:00

328 lines
8.3 KiB
Markdown

# Incident Response Runbook
This document defines procedures for responding to production incidents affecting the Customer Portal.
---
## Severity Classification
| Severity | Definition | Response Time | Examples |
| ----------------- | -------------------------------------- | ------------- | ----------------------------------------------------------------- |
| **P1 - Critical** | Complete service outage or data loss | 15 minutes | Portal unreachable, database corruption, security breach |
| **P2 - High** | Major feature unavailable | 1 hour | Order provisioning failing, payment processing down |
| **P3 - Medium** | Degraded performance or partial outage | 4 hours | Slow response times, intermittent errors, single integration down |
| **P4 - Low** | Minor issue, workaround available | 24 hours | UI glitches, non-critical feature bugs |
---
## Escalation Matrix
| Level | Scope | Contact | When to Escalate |
| ------ | ---------------- | ------------------- | ---------------------------------------------------- |
| **L1** | Initial Response | On-call engineer | All incidents |
| **L2** | Technical Lead | Development lead | P1/P2 not resolved in 30 minutes |
| **L3** | Management | Engineering manager | P1 not resolved in 1 hour, customer impact |
| **L4** | External | Vendor support | External system failure (Salesforce, WHMCS, Freebit) |
### On-Call Contacts
> **Note**: Update this section with actual contact information for your team.
| Role | Contact Method | Backup |
| ----------------- | ----------------- | ------- |
| Primary On-Call | [Slack/PagerDuty] | [Phone] |
| Secondary On-Call | [Slack/PagerDuty] | [Phone] |
| Engineering Lead | [Slack/Email] | [Phone] |
---
## Common Incident Scenarios
### 1. Salesforce Platform Events Not Receiving
**Symptoms:**
- Orders stuck in "Pending Review" status
- No provisioning activity in logs
- `sf:pe:replay:*` Redis keys not updating
**Diagnosis:**
```bash
# Check BFF logs for Platform Event subscription
grep "Platform Event" /var/log/bff/combined.log | tail -50
# Check Redis replay ID
redis-cli GET "sf:pe:replay:/event/OrderProvisionRequested__e"
# Verify Salesforce connectivity
curl -X GET http://localhost:4000/health
```
**Resolution:**
1. Verify `SF_EVENTS_ENABLED=true` in environment
2. Check Salesforce Connected App JWT authentication
3. Verify Platform Event permissions for integration user
4. Set `SF_EVENTS_REPLAY=ALL` temporarily to replay missed events
5. Restart BFF to re-establish subscription
**Escalation:** If unresolved in 30 minutes, contact Salesforce admin.
---
### 2. WHMCS API Unavailable
**Symptoms:**
- Billing pages showing "service unavailable"
- Provisioning failing with WHMCS errors
- Payment method checks failing
**Diagnosis:**
```bash
# Check WHMCS connectivity from BFF
curl -X POST $WHMCS_API_URL -d "action=GetClients&responsetype=json"
# Check BFF logs for WHMCS errors
grep "WHMCS" /var/log/bff/error.log | tail -20
```
**Resolution:**
1. Verify WHMCS server is accessible
2. Check WHMCS API credentials (`WHMCS_API_IDENTIFIER`, `WHMCS_API_SECRET`)
3. Check WHMCS server load and resource usage
4. Contact WHMCS hosting provider if server is down
**Escalation:** If WHMCS server is down, contact hosting provider.
---
### 3. Redis Connection Failures
**Symptoms:**
- Authentication failing
- Cache misses on every request
- Rate limiting not working
- SSE connections dropping
**Diagnosis:**
```bash
# Check Redis connectivity
redis-cli ping
# Check Redis memory usage
redis-cli INFO memory
# Check BFF health endpoint
curl http://localhost:4000/health | jq '.checks.cache'
```
**Resolution:**
1. Verify Redis URL in environment (`REDIS_URL`)
2. Check Redis server memory usage and eviction policy
3. Restart Redis if memory is exhausted
4. Clear stale keys if necessary: `redis-cli FLUSHDB` (caution: clears all cache)
**Impact Note:** Redis failure causes:
- Token blacklist checks to fail (security risk if `AUTH_BLACKLIST_FAIL_CLOSED=false`)
- All cached data to be re-fetched from source systems
- Rate limiting to stop working
---
### 4. Database Connection Issues
**Symptoms:**
- All API requests failing with 500 errors
- Health check shows database as "fail"
- Prisma connection errors in logs
**Diagnosis:**
```bash
# Check database connectivity
psql $DATABASE_URL -c "SELECT 1"
# Check connection count
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity"
# Check BFF health endpoint
curl http://localhost:4000/health | jq '.checks.database'
```
**Resolution:**
1. Verify PostgreSQL server is running
2. Check connection pool limits (Prisma connection_limit)
3. Look for long-running queries and kill if necessary
4. Restart database if unresponsive
**Escalation:** If database is corrupted, see [Database Operations Runbook](./database-operations.md).
---
### 5. High Error Rate / Performance Degradation
**Symptoms:**
- Increased response times (>2s average)
- Error rate above 1%
- Customer complaints
**Diagnosis:**
```bash
# Check BFF process resource usage
top -p $(pgrep -f "node.*bff")
# Check recent error logs
tail -100 /var/log/bff/error.log
# Check external API response times in logs
grep "duration" /var/log/bff/combined.log | tail -20
```
**Resolution:**
1. Identify which external API is slow (Salesforce, WHMCS, Freebit)
2. Check for traffic spikes or unusual patterns
3. Scale horizontally if CPU/memory constrained
4. Enable circuit breakers or increase timeouts temporarily
---
### 6. Security Incident
**Symptoms:**
- Unusual login patterns
- Suspected unauthorized access
- Data exfiltration alerts
**Immediate Actions:**
1. **DO NOT** modify logs or evidence
2. Notify security team immediately
3. Consider isolating affected systems
4. Document all observations with timestamps
**Escalation:** P1 - Immediately escalate to engineering lead and management.
---
## Incident Response Workflow
```
1. DETECT
├── Automated alert received
├── Customer report
└── Internal discovery
2. ASSESS
├── Determine severity (P1-P4)
├── Identify affected systems
└── Estimate customer impact
3. RESPOND
├── Follow relevant scenario playbook
├── Communicate status
└── Escalate if needed
4. RESOLVE
├── Implement fix
├── Verify resolution
└── Monitor for recurrence
5. REVIEW
├── Document timeline
├── Identify root cause
└── Create action items
```
---
## Communication Templates
### Internal Status Update
```
INCIDENT UPDATE - [P1/P2/P3/P4] - [Brief Description]
Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Description of customer impact]
Started: [Time in UTC]
Last Update: [Time in UTC]
Current Actions:
- [Action 1]
- [Action 2]
Next Update: [Time]
```
### Customer Communication (P1/P2 only)
```
We are currently experiencing issues with [service/feature].
What's happening: [Brief, non-technical description]
Impact: [What customers may experience]
Status: Our team is actively working to resolve this issue.
We will provide updates every [30 minutes/1 hour].
We apologize for any inconvenience.
```
---
## Post-Incident Review
After every P1 or P2 incident, conduct a post-incident review within 3 business days.
### Review Template
1. **Incident Summary**
- What happened?
- When did it start/end?
- Who was affected?
2. **Timeline**
- Detection time
- Response time
- Resolution time
- Key milestones
3. **Root Cause Analysis**
- What was the direct cause?
- What were contributing factors?
- Why wasn't this prevented?
4. **Action Items**
- Immediate fixes applied
- Preventive measures needed
- Monitoring improvements
- Documentation updates
---
## Related Documents
- [Provisioning Runbook](./provisioning-runbook.md)
- [Database Operations](./database-operations.md)
- [External Dependencies](./external-dependencies.md)
- [Queue Management](./queue-management.md)
- [Logging Guide](./logging.md)
---
**Last Updated:** December 2025