Assist_Design/docs/operations/queue-management.md
barsa 72d0b66be7 Enhance Documentation Structure and Update Operational Runbooks
- Added a new section for operational runbooks in README.md, detailing procedures for incident response, database operations, and queue management.
- Updated the documentation structure in STRUCTURE.md to reflect the new organization of guides and resources.
- Removed the deprecated disabled-modules.md file to streamline documentation.
- Enhanced the _archive/README.md with historical notes on documentation alignment and corrections made in December 2025.
- Updated various references in the documentation to reflect the new paths and services in the integrations directory.
2025-12-23 15:55:58 +09:00

10 KiB

Queue Management Runbook

This document covers monitoring and management of BullMQ job queues used by the Customer Portal BFF.


Overview

The BFF uses BullMQ (backed by Redis) for asynchronous job processing:

Queue Purpose Processor Location
order-provisioning Order fulfillment after CS approval apps/bff/src/modules/orders/queue/
sim-management Delayed SIM operations (network type changes) apps/bff/src/modules/subscriptions/sim-management/

Queue Configuration

Environment Variables

Variable Description Default
REDIS_URL Redis connection for queues Required
QUEUE_DEFAULT_ATTEMPTS Default retry attempts 3
QUEUE_BACKOFF_DELAY Backoff delay between retries (ms) 5000

Queue Options

// Default queue configuration
{
  defaultJobOptions: {
    attempts: 3,
    backoff: {
      type: 'exponential',
      delay: 5000,
    },
    removeOnComplete: 100,  // Keep last 100 completed jobs
    removeOnFail: 500,      // Keep last 500 failed jobs
  }
}

Monitoring

Check Queue Status

# Connect to Redis and check queue keys
redis-cli KEYS "bull:*"

# Check specific queue length
redis-cli LLEN "bull:order-provisioning:wait"
redis-cli LLEN "bull:order-provisioning:active"
redis-cli ZCARD "bull:order-provisioning:delayed"
redis-cli ZCARD "bull:order-provisioning:failed"

Queue Key Structure

Key Pattern Description
bull:{queue}:wait Jobs waiting to be processed
bull:{queue}:active Jobs currently being processed
bull:{queue}:delayed Jobs scheduled for future execution
bull:{queue}:completed Recently completed jobs
bull:{queue}:failed Failed jobs

Health Metrics

Metric Warning Critical Action
Wait queue depth >10 >50 Check processor status
Failed job count >5 >20 Investigate failures
Processing time >30s >60s Check external dependencies

Order Provisioning Queue

Purpose

Processes orders after CS approval via Salesforce Platform Events.

Flow

Salesforce Platform Event (OrderProvisionRequested__e)
    ↓
Event Subscriber receives event
    ↓
Job enqueued to 'order-provisioning' queue
    ↓
Processor executes fulfillment workflow
    ↓
Order created in WHMCS + Salesforce updated

Job Data Structure

{
  sfOrderId: "8014x000000ABCDXYZ",  // Salesforce Order ID
  idempotencyKey: "8014x...-1703123456789",
  eventPayload: { ... }  // Original Platform Event data
}

Common Failure Reasons

Error Cause Resolution
PAYMENT_METHOD_MISSING Customer has no payment method Customer must add payment method in WHMCS
ORDER_NOT_FOUND Salesforce Order doesn't exist Check Order ID, verify not deleted
MAPPING_ERROR Product mapping missing Add WH_Product_ID__c to Product2 in Salesforce
WHMCS_ERROR WHMCS API failure Check WHMCS connectivity and logs

Retry Behavior

  • Attempts: 3 total (1 initial + 2 retries)
  • Backoff: Exponential (5s, 10s, 20s)
  • On Final Failure: Salesforce Order updated with error details

SIM Management Queue

Purpose

Handles delayed SIM operations, particularly network type changes that require a 30-minute gap.

Job Types

Job Type Delay Description
networkTypeChange 30 minutes Change between 4G/5G networks

Job Data Structure

{
  subscriptionId: 29951,
  simAccount: "08077052946",
  operation: "networkTypeChange",
  params: {
    networkType: "5G"
  },
  scheduledAt: "2024-01-15T10:30:00Z"
}

Common Failure Reasons

Error Cause Resolution
FREEBIT_AUTH_FAILED Freebit authentication error Check OEM credentials
ACCOUNT_NOT_FOUND SIM account not found in Freebit Verify account identifier
OPERATION_CONFLICT Another operation pending Wait for previous operation to complete

Failed Job Investigation

View Failed Jobs

# List failed jobs (using Redis CLI)
redis-cli ZRANGE "bull:order-provisioning:failed" 0 -1

# Get job details
redis-cli HGETALL "bull:order-provisioning:{job-id}"

Common Investigation Steps

  1. Check job data: Identify the order/subscription involved
  2. Check error message: Look for specific failure reason
  3. Check external system: Verify Salesforce/WHMCS/Freebit status
  4. Check logs: Search BFF logs for job ID or order ID
  5. Determine if retryable: Some errors are permanent (missing mapping), others are transient (network timeout)
# Search logs for specific order
grep "8014x000000ABCDXYZ" /var/log/bff/combined.log

# Search for queue processing errors
grep "provisioning" /var/log/bff/error.log | tail -50

Manual Retry Procedures

Retry a Single Failed Job

// Using BullMQ API in Node.js
import { Queue } from "bullmq";

const queue = new Queue("order-provisioning", { connection: redisConnection });
const job = await queue.getJob("job-id");
await job.retry();

Retry All Failed Jobs

# Move all failed jobs back to waiting
redis-cli ZRANGEBYSCORE "bull:order-provisioning:failed" -inf +inf | while read jobId; do
  redis-cli LPUSH "bull:order-provisioning:wait" "$jobId"
  redis-cli ZREM "bull:order-provisioning:failed" "$jobId"
done

Warning

: Only retry jobs after fixing the root cause. Retrying without fixing will cause the same failure.

For order provisioning, the recommended retry method is through Salesforce:

  1. Open the Order in Salesforce
  2. Clear error fields (Activation_Error__c, Activation_Error_DateTime__c)
  3. Set Activation_Status__c back to "Activating"
  4. The Record-Triggered Flow will publish a new Platform Event

This approach ensures proper idempotency tracking and audit trail.


Clearing Stuck Jobs

Clear All Jobs from a Queue

Warning

: This removes all jobs including pending work. Use only in emergencies.

# Clear all queue data
redis-cli DEL \
  "bull:order-provisioning:wait" \
  "bull:order-provisioning:active" \
  "bull:order-provisioning:delayed" \
  "bull:order-provisioning:completed" \
  "bull:order-provisioning:failed"

Clear Old Completed/Failed Jobs

# Remove jobs older than 7 days from completed
redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:completed" -inf $(date -d '7 days ago' +%s000)

# Remove jobs older than 30 days from failed
redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:failed" -inf $(date -d '30 days ago' +%s000)

Queue Backlog Handling

Symptoms of Backlog

  • Wait queue depth increasing
  • Jobs not being processed
  • Customer orders stuck in "Activating" status

Diagnosis

  1. Check processor is running

    grep "BullMQ" /var/log/bff/combined.log | tail -20
    
  2. Check Redis connectivity

    redis-cli PING
    
  3. Check for blocked jobs

    redis-cli LLEN "bull:order-provisioning:active"
    # If active > 0 for extended time, jobs may be stuck
    
  4. Check external dependencies

    • Salesforce API
    • WHMCS API

Resolution

  1. Restart BFF to reconnect queue workers
  2. Clear stuck active jobs if processor crashed mid-job
  3. Scale horizontally if queue depth is due to high volume
  4. Fix root cause if jobs are failing repeatedly

Alerting Configuration

Alert Condition Severity
Queue Backlog Wait queue > 10 for > 5 minutes Warning
Queue Backlog Critical Wait queue > 50 Critical
Failed Jobs Spike > 5 failures in 15 minutes Warning
Processor Down No job processed in 10 minutes with jobs waiting Critical
Job Timeout Job active for > 5 minutes Warning

Monitoring Queries

# Check queue depths (for monitoring script)
WAIT=$(redis-cli LLEN "bull:order-provisioning:wait")
ACTIVE=$(redis-cli LLEN "bull:order-provisioning:active")
FAILED=$(redis-cli ZCARD "bull:order-provisioning:failed")

echo "Wait: $WAIT, Active: $ACTIVE, Failed: $FAILED"

Best Practices

Job Design

  • Include sufficient context in job data for debugging
  • Use idempotency keys to prevent duplicate processing
  • Keep job payloads small (< 10KB)

Error Handling

  • Distinguish between retryable and non-retryable errors
  • Log sufficient context before throwing
  • Update external systems with error status on final failure

Monitoring

  • Set up alerts for queue depth and failure rate
  • Monitor job processing duration
  • Track success/failure ratios over time


Last Updated: December 2025