- Added a new section for operational runbooks in README.md, detailing procedures for incident response, database operations, and queue management. - Updated the documentation structure in STRUCTURE.md to reflect the new organization of guides and resources. - Removed the deprecated disabled-modules.md file to streamline documentation. - Enhanced the _archive/README.md with historical notes on documentation alignment and corrections made in December 2025. - Updated various references in the documentation to reflect the new paths and services in the integrations directory.
8.3 KiB
Incident Response Runbook
This document defines procedures for responding to production incidents affecting the Customer Portal.
Severity Classification
| Severity | Definition | Response Time | Examples |
|---|---|---|---|
| P1 - Critical | Complete service outage or data loss | 15 minutes | Portal unreachable, database corruption, security breach |
| P2 - High | Major feature unavailable | 1 hour | Order provisioning failing, payment processing down |
| P3 - Medium | Degraded performance or partial outage | 4 hours | Slow response times, intermittent errors, single integration down |
| P4 - Low | Minor issue, workaround available | 24 hours | UI glitches, non-critical feature bugs |
Escalation Matrix
| Level | Scope | Contact | When to Escalate |
|---|---|---|---|
| L1 | Initial Response | On-call engineer | All incidents |
| L2 | Technical Lead | Development lead | P1/P2 not resolved in 30 minutes |
| L3 | Management | Engineering manager | P1 not resolved in 1 hour, customer impact |
| L4 | External | Vendor support | External system failure (Salesforce, WHMCS, Freebit) |
On-Call Contacts
Note
: Update this section with actual contact information for your team.
| Role | Contact Method | Backup |
|---|---|---|
| Primary On-Call | [Slack/PagerDuty] | [Phone] |
| Secondary On-Call | [Slack/PagerDuty] | [Phone] |
| Engineering Lead | [Slack/Email] | [Phone] |
Common Incident Scenarios
1. Salesforce Platform Events Not Receiving
Symptoms:
- Orders stuck in "Pending Review" status
- No provisioning activity in logs
sf:pe:replay:*Redis keys not updating
Diagnosis:
# Check BFF logs for Platform Event subscription
grep "Platform Event" /var/log/bff/combined.log | tail -50
# Check Redis replay ID
redis-cli GET "sf:pe:replay:/event/OrderProvisionRequested__e"
# Verify Salesforce connectivity
curl -X GET http://localhost:4000/health
Resolution:
- Verify
SF_EVENTS_ENABLED=truein environment - Check Salesforce Connected App JWT authentication
- Verify Platform Event permissions for integration user
- Set
SF_EVENTS_REPLAY=ALLtemporarily to replay missed events - Restart BFF to re-establish subscription
Escalation: If unresolved in 30 minutes, contact Salesforce admin.
2. WHMCS API Unavailable
Symptoms:
- Billing pages showing "service unavailable"
- Provisioning failing with WHMCS errors
- Payment method checks failing
Diagnosis:
# Check WHMCS connectivity from BFF
curl -X POST $WHMCS_API_URL -d "action=GetClients&responsetype=json"
# Check BFF logs for WHMCS errors
grep "WHMCS" /var/log/bff/error.log | tail -20
Resolution:
- Verify WHMCS server is accessible
- Check WHMCS API credentials (
WHMCS_API_IDENTIFIER,WHMCS_API_SECRET) - Check WHMCS server load and resource usage
- Contact WHMCS hosting provider if server is down
Escalation: If WHMCS server is down, contact hosting provider.
3. Redis Connection Failures
Symptoms:
- Authentication failing
- Cache misses on every request
- Rate limiting not working
- SSE connections dropping
Diagnosis:
# Check Redis connectivity
redis-cli ping
# Check Redis memory usage
redis-cli INFO memory
# Check BFF health endpoint
curl http://localhost:4000/health | jq '.checks.cache'
Resolution:
- Verify Redis URL in environment (
REDIS_URL) - Check Redis server memory usage and eviction policy
- Restart Redis if memory is exhausted
- Clear stale keys if necessary:
redis-cli FLUSHDB(caution: clears all cache)
Impact Note: Redis failure causes:
- Token blacklist checks to fail (security risk if
AUTH_BLACKLIST_FAIL_CLOSED=false) - All cached data to be re-fetched from source systems
- Rate limiting to stop working
4. Database Connection Issues
Symptoms:
- All API requests failing with 500 errors
- Health check shows database as "fail"
- Prisma connection errors in logs
Diagnosis:
# Check database connectivity
psql $DATABASE_URL -c "SELECT 1"
# Check connection count
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity"
# Check BFF health endpoint
curl http://localhost:4000/health | jq '.checks.database'
Resolution:
- Verify PostgreSQL server is running
- Check connection pool limits (Prisma connection_limit)
- Look for long-running queries and kill if necessary
- Restart database if unresponsive
Escalation: If database is corrupted, see Database Operations Runbook.
5. High Error Rate / Performance Degradation
Symptoms:
- Increased response times (>2s average)
- Error rate above 1%
- Customer complaints
Diagnosis:
# Check BFF process resource usage
top -p $(pgrep -f "node.*bff")
# Check recent error logs
tail -100 /var/log/bff/error.log
# Check external API response times in logs
grep "duration" /var/log/bff/combined.log | tail -20
Resolution:
- Identify which external API is slow (Salesforce, WHMCS, Freebit)
- Check for traffic spikes or unusual patterns
- Scale horizontally if CPU/memory constrained
- Enable circuit breakers or increase timeouts temporarily
6. Security Incident
Symptoms:
- Unusual login patterns
- Suspected unauthorized access
- Data exfiltration alerts
Immediate Actions:
- DO NOT modify logs or evidence
- Notify security team immediately
- Consider isolating affected systems
- Document all observations with timestamps
Escalation: P1 - Immediately escalate to engineering lead and management.
Incident Response Workflow
1. DETECT
├── Automated alert received
├── Customer report
└── Internal discovery
2. ASSESS
├── Determine severity (P1-P4)
├── Identify affected systems
└── Estimate customer impact
3. RESPOND
├── Follow relevant scenario playbook
├── Communicate status
└── Escalate if needed
4. RESOLVE
├── Implement fix
├── Verify resolution
└── Monitor for recurrence
5. REVIEW
├── Document timeline
├── Identify root cause
└── Create action items
Communication Templates
Internal Status Update
INCIDENT UPDATE - [P1/P2/P3/P4] - [Brief Description]
Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Description of customer impact]
Started: [Time in UTC]
Last Update: [Time in UTC]
Current Actions:
- [Action 1]
- [Action 2]
Next Update: [Time]
Customer Communication (P1/P2 only)
We are currently experiencing issues with [service/feature].
What's happening: [Brief, non-technical description]
Impact: [What customers may experience]
Status: Our team is actively working to resolve this issue.
We will provide updates every [30 minutes/1 hour].
We apologize for any inconvenience.
Post-Incident Review
After every P1 or P2 incident, conduct a post-incident review within 3 business days.
Review Template
-
Incident Summary
- What happened?
- When did it start/end?
- Who was affected?
-
Timeline
- Detection time
- Response time
- Resolution time
- Key milestones
-
Root Cause Analysis
- What was the direct cause?
- What were contributing factors?
- Why wasn't this prevented?
-
Action Items
- Immediate fixes applied
- Preventive measures needed
- Monitoring improvements
- Documentation updates
Related Documents
Last Updated: December 2025