barsa 72d0b66be7 Enhance Documentation Structure and Update Operational Runbooks

- Added a new section for operational runbooks in README.md, detailing procedures for incident response, database operations, and queue management.
- Updated the documentation structure in STRUCTURE.md to reflect the new organization of guides and resources.
- Removed the deprecated disabled-modules.md file to streamline documentation.
- Enhanced the _archive/README.md with historical notes on documentation alignment and corrections made in December 2025.
- Updated various references in the documentation to reflect the new paths and services in the integrations directory.

2025-12-23 15:55:58 +09:00

10 KiB

Raw Blame History

Queue Management Runbook

This document covers monitoring and management of BullMQ job queues used by the Customer Portal BFF.

Overview

The BFF uses BullMQ (backed by Redis) for asynchronous job processing:

Queue	Purpose	Processor Location
`order-provisioning`	Order fulfillment after CS approval	`apps/bff/src/modules/orders/queue/`
`sim-management`	Delayed SIM operations (network type changes)	`apps/bff/src/modules/subscriptions/sim-management/`

Queue Configuration

Environment Variables

Variable	Description	Default
`REDIS_URL`	Redis connection for queues	Required
`QUEUE_DEFAULT_ATTEMPTS`	Default retry attempts	3
`QUEUE_BACKOFF_DELAY`	Backoff delay between retries (ms)	5000

Queue Options

// Default queue configuration
{
  defaultJobOptions: {
    attempts: 3,
    backoff: {
      type: 'exponential',
      delay: 5000,
    },
    removeOnComplete: 100,  // Keep last 100 completed jobs
    removeOnFail: 500,      // Keep last 500 failed jobs
  }
}

Monitoring

Check Queue Status

# Connect to Redis and check queue keys
redis-cli KEYS "bull:*"

# Check specific queue length
redis-cli LLEN "bull:order-provisioning:wait"
redis-cli LLEN "bull:order-provisioning:active"
redis-cli ZCARD "bull:order-provisioning:delayed"
redis-cli ZCARD "bull:order-provisioning:failed"

Queue Key Structure

Key Pattern	Description
`bull:{queue}:wait`	Jobs waiting to be processed
`bull:{queue}:active`	Jobs currently being processed
`bull:{queue}:delayed`	Jobs scheduled for future execution
`bull:{queue}:completed`	Recently completed jobs
`bull:{queue}:failed`	Failed jobs

Health Metrics

Metric	Warning	Critical	Action
Wait queue depth	>10	>50	Check processor status
Failed job count	>5	>20	Investigate failures
Processing time	>30s	>60s	Check external dependencies

Order Provisioning Queue

Purpose

Processes orders after CS approval via Salesforce Platform Events.

Flow

Salesforce Platform Event (OrderProvisionRequested__e)
    ↓
Event Subscriber receives event
    ↓
Job enqueued to 'order-provisioning' queue
    ↓
Processor executes fulfillment workflow
    ↓
Order created in WHMCS + Salesforce updated

Job Data Structure

{
  sfOrderId: "8014x000000ABCDXYZ",  // Salesforce Order ID
  idempotencyKey: "8014x...-1703123456789",
  eventPayload: { ... }  // Original Platform Event data
}

Common Failure Reasons

Error	Cause	Resolution
`PAYMENT_METHOD_MISSING`	Customer has no payment method	Customer must add payment method in WHMCS
`ORDER_NOT_FOUND`	Salesforce Order doesn't exist	Check Order ID, verify not deleted
`MAPPING_ERROR`	Product mapping missing	Add `WH_Product_ID__c` to Product2 in Salesforce
`WHMCS_ERROR`	WHMCS API failure	Check WHMCS connectivity and logs

Retry Behavior

Attempts: 3 total (1 initial + 2 retries)
Backoff: Exponential (5s, 10s, 20s)
On Final Failure: Salesforce Order updated with error details

SIM Management Queue

Purpose

Handles delayed SIM operations, particularly network type changes that require a 30-minute gap.

Job Types

Job Type	Delay	Description
`networkTypeChange`	30 minutes	Change between 4G/5G networks

Job Data Structure

{
  subscriptionId: 29951,
  simAccount: "08077052946",
  operation: "networkTypeChange",
  params: {
    networkType: "5G"
  },
  scheduledAt: "2024-01-15T10:30:00Z"
}

Common Failure Reasons

Error	Cause	Resolution
`FREEBIT_AUTH_FAILED`	Freebit authentication error	Check OEM credentials
`ACCOUNT_NOT_FOUND`	SIM account not found in Freebit	Verify account identifier
`OPERATION_CONFLICT`	Another operation pending	Wait for previous operation to complete

Failed Job Investigation

View Failed Jobs

# List failed jobs (using Redis CLI)
redis-cli ZRANGE "bull:order-provisioning:failed" 0 -1

# Get job details
redis-cli HGETALL "bull:order-provisioning:{job-id}"

Common Investigation Steps

Check job data: Identify the order/subscription involved
Check error message: Look for specific failure reason
Check external system: Verify Salesforce/WHMCS/Freebit status
Check logs: Search BFF logs for job ID or order ID
Determine if retryable: Some errors are permanent (missing mapping), others are transient (network timeout)

Log Search

# Search logs for specific order
grep "8014x000000ABCDXYZ" /var/log/bff/combined.log

# Search for queue processing errors
grep "provisioning" /var/log/bff/error.log | tail -50

Manual Retry Procedures

Retry a Single Failed Job

// Using BullMQ API in Node.js
import { Queue } from "bullmq";

const queue = new Queue("order-provisioning", { connection: redisConnection });
const job = await queue.getJob("job-id");
await job.retry();

Retry All Failed Jobs

# Move all failed jobs back to waiting
redis-cli ZRANGEBYSCORE "bull:order-provisioning:failed" -inf +inf | while read jobId; do
  redis-cli LPUSH "bull:order-provisioning:wait" "$jobId"
  redis-cli ZREM "bull:order-provisioning:failed" "$jobId"
done

Warning

: Only retry jobs after fixing the root cause. Retrying without fixing will cause the same failure.

Retry via Salesforce (Recommended for Provisioning)

For order provisioning, the recommended retry method is through Salesforce:

Open the Order in Salesforce
Clear error fields (Activation_Error__c, Activation_Error_DateTime__c)
Set Activation_Status__c back to "Activating"
The Record-Triggered Flow will publish a new Platform Event

This approach ensures proper idempotency tracking and audit trail.

Clearing Stuck Jobs

Clear All Jobs from a Queue

Warning

: This removes all jobs including pending work. Use only in emergencies.

# Clear all queue data
redis-cli DEL \
  "bull:order-provisioning:wait" \
  "bull:order-provisioning:active" \
  "bull:order-provisioning:delayed" \
  "bull:order-provisioning:completed" \
  "bull:order-provisioning:failed"

Clear Old Completed/Failed Jobs

# Remove jobs older than 7 days from completed
redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:completed" -inf $(date -d '7 days ago' +%s000)

# Remove jobs older than 30 days from failed
redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:failed" -inf $(date -d '30 days ago' +%s000)

Queue Backlog Handling

Symptoms of Backlog

Wait queue depth increasing
Jobs not being processed
Customer orders stuck in "Activating" status

Diagnosis

Check processor is running

grep "BullMQ" /var/log/bff/combined.log | tail -20

Check Redis connectivity
```
redis-cli PING
```

Check for blocked jobs

redis-cli LLEN "bull:order-provisioning:active"
# If active > 0 for extended time, jobs may be stuck

Check external dependencies
- Salesforce API
- WHMCS API

Resolution

Restart BFF to reconnect queue workers
Clear stuck active jobs if processor crashed mid-job
Scale horizontally if queue depth is due to high volume
Fix root cause if jobs are failing repeatedly

Alerting Configuration

Recommended Alerts

Alert	Condition	Severity
Queue Backlog	Wait queue > 10 for > 5 minutes	Warning
Queue Backlog Critical	Wait queue > 50	Critical
Failed Jobs Spike	> 5 failures in 15 minutes	Warning
Processor Down	No job processed in 10 minutes with jobs waiting	Critical
Job Timeout	Job active for > 5 minutes	Warning

Monitoring Queries

# Check queue depths (for monitoring script)
WAIT=$(redis-cli LLEN "bull:order-provisioning:wait")
ACTIVE=$(redis-cli LLEN "bull:order-provisioning:active")
FAILED=$(redis-cli ZCARD "bull:order-provisioning:failed")

echo "Wait: $WAIT, Active: $ACTIVE, Failed: $FAILED"

Best Practices

Job Design

Include sufficient context in job data for debugging
Use idempotency keys to prevent duplicate processing
Keep job payloads small (< 10KB)

Error Handling

Distinguish between retryable and non-retryable errors
Log sufficient context before throwing
Update external systems with error status on final failure

Monitoring

Set up alerts for queue depth and failure rate
Monitor job processing duration
Track success/failure ratios over time

Last Updated: December 2025

10 KiB Raw Blame History

Queue Management Runbook

Overview

Queue Configuration

Environment Variables

Queue Options

Monitoring

Check Queue Status

Queue Key Structure

Health Metrics

Order Provisioning Queue

Purpose

Flow

Job Data Structure

Common Failure Reasons

Retry Behavior

SIM Management Queue

Purpose

Job Types

Job Data Structure

Common Failure Reasons

Failed Job Investigation

View Failed Jobs

Common Investigation Steps

Log Search

Manual Retry Procedures

Retry a Single Failed Job

Retry All Failed Jobs

Retry via Salesforce (Recommended for Provisioning)

Clearing Stuck Jobs

Clear All Jobs from a Queue

Clear Old Completed/Failed Jobs

Queue Backlog Handling

Symptoms of Backlog

Diagnosis

Resolution

Alerting Configuration

Recommended Alerts

Monitoring Queries

Best Practices

Job Design

Error Handling

Monitoring

Related Documents

10 KiB

Raw Blame History