Assist_Design/docs/operations/incident-response.md
barsa 72d0b66be7 Enhance Documentation Structure and Update Operational Runbooks
- Added a new section for operational runbooks in README.md, detailing procedures for incident response, database operations, and queue management.
- Updated the documentation structure in STRUCTURE.md to reflect the new organization of guides and resources.
- Removed the deprecated disabled-modules.md file to streamline documentation.
- Enhanced the _archive/README.md with historical notes on documentation alignment and corrections made in December 2025.
- Updated various references in the documentation to reflect the new paths and services in the integrations directory.
2025-12-23 15:55:58 +09:00

8.3 KiB

Incident Response Runbook

This document defines procedures for responding to production incidents affecting the Customer Portal.


Severity Classification

Severity Definition Response Time Examples
P1 - Critical Complete service outage or data loss 15 minutes Portal unreachable, database corruption, security breach
P2 - High Major feature unavailable 1 hour Order provisioning failing, payment processing down
P3 - Medium Degraded performance or partial outage 4 hours Slow response times, intermittent errors, single integration down
P4 - Low Minor issue, workaround available 24 hours UI glitches, non-critical feature bugs

Escalation Matrix

Level Scope Contact When to Escalate
L1 Initial Response On-call engineer All incidents
L2 Technical Lead Development lead P1/P2 not resolved in 30 minutes
L3 Management Engineering manager P1 not resolved in 1 hour, customer impact
L4 External Vendor support External system failure (Salesforce, WHMCS, Freebit)

On-Call Contacts

Note

: Update this section with actual contact information for your team.

Role Contact Method Backup
Primary On-Call [Slack/PagerDuty] [Phone]
Secondary On-Call [Slack/PagerDuty] [Phone]
Engineering Lead [Slack/Email] [Phone]

Common Incident Scenarios

1. Salesforce Platform Events Not Receiving

Symptoms:

  • Orders stuck in "Pending Review" status
  • No provisioning activity in logs
  • sf:pe:replay:* Redis keys not updating

Diagnosis:

# Check BFF logs for Platform Event subscription
grep "Platform Event" /var/log/bff/combined.log | tail -50

# Check Redis replay ID
redis-cli GET "sf:pe:replay:/event/OrderProvisionRequested__e"

# Verify Salesforce connectivity
curl -X GET http://localhost:4000/health

Resolution:

  1. Verify SF_EVENTS_ENABLED=true in environment
  2. Check Salesforce Connected App JWT authentication
  3. Verify Platform Event permissions for integration user
  4. Set SF_EVENTS_REPLAY=ALL temporarily to replay missed events
  5. Restart BFF to re-establish subscription

Escalation: If unresolved in 30 minutes, contact Salesforce admin.


2. WHMCS API Unavailable

Symptoms:

  • Billing pages showing "service unavailable"
  • Provisioning failing with WHMCS errors
  • Payment method checks failing

Diagnosis:

# Check WHMCS connectivity from BFF
curl -X POST $WHMCS_API_URL -d "action=GetClients&responsetype=json"

# Check BFF logs for WHMCS errors
grep "WHMCS" /var/log/bff/error.log | tail -20

Resolution:

  1. Verify WHMCS server is accessible
  2. Check WHMCS API credentials (WHMCS_API_IDENTIFIER, WHMCS_API_SECRET)
  3. Check WHMCS server load and resource usage
  4. Contact WHMCS hosting provider if server is down

Escalation: If WHMCS server is down, contact hosting provider.


3. Redis Connection Failures

Symptoms:

  • Authentication failing
  • Cache misses on every request
  • Rate limiting not working
  • SSE connections dropping

Diagnosis:

# Check Redis connectivity
redis-cli ping

# Check Redis memory usage
redis-cli INFO memory

# Check BFF health endpoint
curl http://localhost:4000/health | jq '.checks.cache'

Resolution:

  1. Verify Redis URL in environment (REDIS_URL)
  2. Check Redis server memory usage and eviction policy
  3. Restart Redis if memory is exhausted
  4. Clear stale keys if necessary: redis-cli FLUSHDB (caution: clears all cache)

Impact Note: Redis failure causes:

  • Token blacklist checks to fail (security risk if AUTH_BLACKLIST_FAIL_CLOSED=false)
  • All cached data to be re-fetched from source systems
  • Rate limiting to stop working

4. Database Connection Issues

Symptoms:

  • All API requests failing with 500 errors
  • Health check shows database as "fail"
  • Prisma connection errors in logs

Diagnosis:

# Check database connectivity
psql $DATABASE_URL -c "SELECT 1"

# Check connection count
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity"

# Check BFF health endpoint
curl http://localhost:4000/health | jq '.checks.database'

Resolution:

  1. Verify PostgreSQL server is running
  2. Check connection pool limits (Prisma connection_limit)
  3. Look for long-running queries and kill if necessary
  4. Restart database if unresponsive

Escalation: If database is corrupted, see Database Operations Runbook.


5. High Error Rate / Performance Degradation

Symptoms:

  • Increased response times (>2s average)
  • Error rate above 1%
  • Customer complaints

Diagnosis:

# Check BFF process resource usage
top -p $(pgrep -f "node.*bff")

# Check recent error logs
tail -100 /var/log/bff/error.log

# Check external API response times in logs
grep "duration" /var/log/bff/combined.log | tail -20

Resolution:

  1. Identify which external API is slow (Salesforce, WHMCS, Freebit)
  2. Check for traffic spikes or unusual patterns
  3. Scale horizontally if CPU/memory constrained
  4. Enable circuit breakers or increase timeouts temporarily

6. Security Incident

Symptoms:

  • Unusual login patterns
  • Suspected unauthorized access
  • Data exfiltration alerts

Immediate Actions:

  1. DO NOT modify logs or evidence
  2. Notify security team immediately
  3. Consider isolating affected systems
  4. Document all observations with timestamps

Escalation: P1 - Immediately escalate to engineering lead and management.


Incident Response Workflow

1. DETECT
   ├── Automated alert received
   ├── Customer report
   └── Internal discovery

2. ASSESS
   ├── Determine severity (P1-P4)
   ├── Identify affected systems
   └── Estimate customer impact

3. RESPOND
   ├── Follow relevant scenario playbook
   ├── Communicate status
   └── Escalate if needed

4. RESOLVE
   ├── Implement fix
   ├── Verify resolution
   └── Monitor for recurrence

5. REVIEW
   ├── Document timeline
   ├── Identify root cause
   └── Create action items

Communication Templates

Internal Status Update

INCIDENT UPDATE - [P1/P2/P3/P4] - [Brief Description]

Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Description of customer impact]
Started: [Time in UTC]
Last Update: [Time in UTC]

Current Actions:
- [Action 1]
- [Action 2]

Next Update: [Time]

Customer Communication (P1/P2 only)

We are currently experiencing issues with [service/feature].

What's happening: [Brief, non-technical description]
Impact: [What customers may experience]
Status: Our team is actively working to resolve this issue.

We will provide updates every [30 minutes/1 hour].

We apologize for any inconvenience.

Post-Incident Review

After every P1 or P2 incident, conduct a post-incident review within 3 business days.

Review Template

  1. Incident Summary

    • What happened?
    • When did it start/end?
    • Who was affected?
  2. Timeline

    • Detection time
    • Response time
    • Resolution time
    • Key milestones
  3. Root Cause Analysis

    • What was the direct cause?
    • What were contributing factors?
    • Why wasn't this prevented?
  4. Action Items

    • Immediate fixes applied
    • Preventive measures needed
    • Monitoring improvements
    • Documentation updates


Last Updated: December 2025