barsa 72d0b66be7 Enhance Documentation Structure and Update Operational Runbooks

- Added a new section for operational runbooks in README.md, detailing procedures for incident response, database operations, and queue management.
- Updated the documentation structure in STRUCTURE.md to reflect the new organization of guides and resources.
- Removed the deprecated disabled-modules.md file to streamline documentation.
- Enhanced the _archive/README.md with historical notes on documentation alignment and corrections made in December 2025.
- Updated various references in the documentation to reflect the new paths and services in the integrations directory.

2025-12-23 15:55:58 +09:00

8.3 KiB

Raw Blame History

Incident Response Runbook

This document defines procedures for responding to production incidents affecting the Customer Portal.

Severity Classification

Severity	Definition	Response Time	Examples
P1 - Critical	Complete service outage or data loss	15 minutes	Portal unreachable, database corruption, security breach
P2 - High	Major feature unavailable	1 hour	Order provisioning failing, payment processing down
P3 - Medium	Degraded performance or partial outage	4 hours	Slow response times, intermittent errors, single integration down
P4 - Low	Minor issue, workaround available	24 hours	UI glitches, non-critical feature bugs

Escalation Matrix

Level	Scope	Contact	When to Escalate
L1	Initial Response	On-call engineer	All incidents
L2	Technical Lead	Development lead	P1/P2 not resolved in 30 minutes
L3	Management	Engineering manager	P1 not resolved in 1 hour, customer impact
L4	External	Vendor support	External system failure (Salesforce, WHMCS, Freebit)

On-Call Contacts

Note

: Update this section with actual contact information for your team.

Role	Contact Method	Backup
Primary On-Call	[Slack/PagerDuty]	[Phone]
Secondary On-Call	[Slack/PagerDuty]	[Phone]
Engineering Lead	[Slack/Email]	[Phone]

Common Incident Scenarios

1. Salesforce Platform Events Not Receiving

Symptoms:

Orders stuck in "Pending Review" status
No provisioning activity in logs
sf:pe:replay:* Redis keys not updating

Diagnosis:

# Check BFF logs for Platform Event subscription
grep "Platform Event" /var/log/bff/combined.log | tail -50

# Check Redis replay ID
redis-cli GET "sf:pe:replay:/event/OrderProvisionRequested__e"

# Verify Salesforce connectivity
curl -X GET http://localhost:4000/health

Resolution:

Verify SF_EVENTS_ENABLED=true in environment
Check Salesforce Connected App JWT authentication
Verify Platform Event permissions for integration user
Set SF_EVENTS_REPLAY=ALL temporarily to replay missed events
Restart BFF to re-establish subscription

Escalation: If unresolved in 30 minutes, contact Salesforce admin.

2. WHMCS API Unavailable

Symptoms:

Billing pages showing "service unavailable"
Provisioning failing with WHMCS errors
Payment method checks failing

Diagnosis:

# Check WHMCS connectivity from BFF
curl -X POST $WHMCS_API_URL -d "action=GetClients&responsetype=json"

# Check BFF logs for WHMCS errors
grep "WHMCS" /var/log/bff/error.log | tail -20

Resolution:

Verify WHMCS server is accessible
Check WHMCS API credentials (WHMCS_API_IDENTIFIER, WHMCS_API_SECRET)
Check WHMCS server load and resource usage
Contact WHMCS hosting provider if server is down

Escalation: If WHMCS server is down, contact hosting provider.

3. Redis Connection Failures

Symptoms:

Authentication failing
Cache misses on every request
Rate limiting not working
SSE connections dropping

Diagnosis:

# Check Redis connectivity
redis-cli ping

# Check Redis memory usage
redis-cli INFO memory

# Check BFF health endpoint
curl http://localhost:4000/health | jq '.checks.cache'

Resolution:

Verify Redis URL in environment (REDIS_URL)
Check Redis server memory usage and eviction policy
Restart Redis if memory is exhausted
Clear stale keys if necessary: redis-cli FLUSHDB (caution: clears all cache)

Impact Note: Redis failure causes:

Token blacklist checks to fail (security risk if AUTH_BLACKLIST_FAIL_CLOSED=false)
All cached data to be re-fetched from source systems
Rate limiting to stop working

4. Database Connection Issues

Symptoms:

All API requests failing with 500 errors
Health check shows database as "fail"
Prisma connection errors in logs

Diagnosis:

# Check database connectivity
psql $DATABASE_URL -c "SELECT 1"

# Check connection count
psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity"

# Check BFF health endpoint
curl http://localhost:4000/health | jq '.checks.database'

Resolution:

Verify PostgreSQL server is running
Check connection pool limits (Prisma connection_limit)
Look for long-running queries and kill if necessary
Restart database if unresponsive

Escalation: If database is corrupted, see Database Operations Runbook.

5. High Error Rate / Performance Degradation

Symptoms:

Increased response times (>2s average)
Error rate above 1%
Customer complaints

Diagnosis:

# Check BFF process resource usage
top -p $(pgrep -f "node.*bff")

# Check recent error logs
tail -100 /var/log/bff/error.log

# Check external API response times in logs
grep "duration" /var/log/bff/combined.log | tail -20

Resolution:

Identify which external API is slow (Salesforce, WHMCS, Freebit)
Check for traffic spikes or unusual patterns
Scale horizontally if CPU/memory constrained
Enable circuit breakers or increase timeouts temporarily

6. Security Incident

Symptoms:

Unusual login patterns
Suspected unauthorized access
Data exfiltration alerts

Immediate Actions:

DO NOT modify logs or evidence
Notify security team immediately
Consider isolating affected systems
Document all observations with timestamps

Escalation: P1 - Immediately escalate to engineering lead and management.

Incident Response Workflow

1. DETECT
   ├── Automated alert received
   ├── Customer report
   └── Internal discovery

2. ASSESS
   ├── Determine severity (P1-P4)
   ├── Identify affected systems
   └── Estimate customer impact

3. RESPOND
   ├── Follow relevant scenario playbook
   ├── Communicate status
   └── Escalate if needed

4. RESOLVE
   ├── Implement fix
   ├── Verify resolution
   └── Monitor for recurrence

5. REVIEW
   ├── Document timeline
   ├── Identify root cause
   └── Create action items

Communication Templates

Internal Status Update

INCIDENT UPDATE - [P1/P2/P3/P4] - [Brief Description]

Status: [Investigating/Identified/Monitoring/Resolved]
Impact: [Description of customer impact]
Started: [Time in UTC]
Last Update: [Time in UTC]

Current Actions:
- [Action 1]
- [Action 2]

Next Update: [Time]

Customer Communication (P1/P2 only)

We are currently experiencing issues with [service/feature].

What's happening: [Brief, non-technical description]
Impact: [What customers may experience]
Status: Our team is actively working to resolve this issue.

We will provide updates every [30 minutes/1 hour].

We apologize for any inconvenience.

Post-Incident Review

After every P1 or P2 incident, conduct a post-incident review within 3 business days.

Review Template

Incident Summary
- What happened?
- When did it start/end?
- Who was affected?
Timeline
- Detection time
- Response time
- Resolution time
- Key milestones
Root Cause Analysis
- What was the direct cause?
- What were contributing factors?
- Why wasn't this prevented?
Action Items
- Immediate fixes applied
- Preventive measures needed
- Monitoring improvements
- Documentation updates

Last Updated: December 2025

8.3 KiB Raw Blame History

Incident Response Runbook

Severity Classification

Escalation Matrix

On-Call Contacts

Common Incident Scenarios

1. Salesforce Platform Events Not Receiving

2. WHMCS API Unavailable

3. Redis Connection Failures

4. Database Connection Issues

5. High Error Rate / Performance Degradation

6. Security Incident

Incident Response Workflow

Communication Templates

Internal Status Update

Customer Communication (P1/P2 only)

Post-Incident Review

Review Template

Related Documents

8.3 KiB

Raw Blame History