- Added a new section for operational runbooks in README.md, detailing procedures for incident response, database operations, and queue management. - Updated the documentation structure in STRUCTURE.md to reflect the new organization of guides and resources. - Removed the deprecated disabled-modules.md file to streamline documentation. - Enhanced the _archive/README.md with historical notes on documentation alignment and corrections made in December 2025. - Updated various references in the documentation to reflect the new paths and services in the integrations directory.
10 KiB
Queue Management Runbook
This document covers monitoring and management of BullMQ job queues used by the Customer Portal BFF.
Overview
The BFF uses BullMQ (backed by Redis) for asynchronous job processing:
| Queue | Purpose | Processor Location |
|---|---|---|
order-provisioning |
Order fulfillment after CS approval | apps/bff/src/modules/orders/queue/ |
sim-management |
Delayed SIM operations (network type changes) | apps/bff/src/modules/subscriptions/sim-management/ |
Queue Configuration
Environment Variables
| Variable | Description | Default |
|---|---|---|
REDIS_URL |
Redis connection for queues | Required |
QUEUE_DEFAULT_ATTEMPTS |
Default retry attempts | 3 |
QUEUE_BACKOFF_DELAY |
Backoff delay between retries (ms) | 5000 |
Queue Options
// Default queue configuration
{
defaultJobOptions: {
attempts: 3,
backoff: {
type: 'exponential',
delay: 5000,
},
removeOnComplete: 100, // Keep last 100 completed jobs
removeOnFail: 500, // Keep last 500 failed jobs
}
}
Monitoring
Check Queue Status
# Connect to Redis and check queue keys
redis-cli KEYS "bull:*"
# Check specific queue length
redis-cli LLEN "bull:order-provisioning:wait"
redis-cli LLEN "bull:order-provisioning:active"
redis-cli ZCARD "bull:order-provisioning:delayed"
redis-cli ZCARD "bull:order-provisioning:failed"
Queue Key Structure
| Key Pattern | Description |
|---|---|
bull:{queue}:wait |
Jobs waiting to be processed |
bull:{queue}:active |
Jobs currently being processed |
bull:{queue}:delayed |
Jobs scheduled for future execution |
bull:{queue}:completed |
Recently completed jobs |
bull:{queue}:failed |
Failed jobs |
Health Metrics
| Metric | Warning | Critical | Action |
|---|---|---|---|
| Wait queue depth | >10 | >50 | Check processor status |
| Failed job count | >5 | >20 | Investigate failures |
| Processing time | >30s | >60s | Check external dependencies |
Order Provisioning Queue
Purpose
Processes orders after CS approval via Salesforce Platform Events.
Flow
Salesforce Platform Event (OrderProvisionRequested__e)
↓
Event Subscriber receives event
↓
Job enqueued to 'order-provisioning' queue
↓
Processor executes fulfillment workflow
↓
Order created in WHMCS + Salesforce updated
Job Data Structure
{
sfOrderId: "8014x000000ABCDXYZ", // Salesforce Order ID
idempotencyKey: "8014x...-1703123456789",
eventPayload: { ... } // Original Platform Event data
}
Common Failure Reasons
| Error | Cause | Resolution |
|---|---|---|
PAYMENT_METHOD_MISSING |
Customer has no payment method | Customer must add payment method in WHMCS |
ORDER_NOT_FOUND |
Salesforce Order doesn't exist | Check Order ID, verify not deleted |
MAPPING_ERROR |
Product mapping missing | Add WH_Product_ID__c to Product2 in Salesforce |
WHMCS_ERROR |
WHMCS API failure | Check WHMCS connectivity and logs |
Retry Behavior
- Attempts: 3 total (1 initial + 2 retries)
- Backoff: Exponential (5s, 10s, 20s)
- On Final Failure: Salesforce Order updated with error details
SIM Management Queue
Purpose
Handles delayed SIM operations, particularly network type changes that require a 30-minute gap.
Job Types
| Job Type | Delay | Description |
|---|---|---|
networkTypeChange |
30 minutes | Change between 4G/5G networks |
Job Data Structure
{
subscriptionId: 29951,
simAccount: "08077052946",
operation: "networkTypeChange",
params: {
networkType: "5G"
},
scheduledAt: "2024-01-15T10:30:00Z"
}
Common Failure Reasons
| Error | Cause | Resolution |
|---|---|---|
FREEBIT_AUTH_FAILED |
Freebit authentication error | Check OEM credentials |
ACCOUNT_NOT_FOUND |
SIM account not found in Freebit | Verify account identifier |
OPERATION_CONFLICT |
Another operation pending | Wait for previous operation to complete |
Failed Job Investigation
View Failed Jobs
# List failed jobs (using Redis CLI)
redis-cli ZRANGE "bull:order-provisioning:failed" 0 -1
# Get job details
redis-cli HGETALL "bull:order-provisioning:{job-id}"
Common Investigation Steps
- Check job data: Identify the order/subscription involved
- Check error message: Look for specific failure reason
- Check external system: Verify Salesforce/WHMCS/Freebit status
- Check logs: Search BFF logs for job ID or order ID
- Determine if retryable: Some errors are permanent (missing mapping), others are transient (network timeout)
Log Search
# Search logs for specific order
grep "8014x000000ABCDXYZ" /var/log/bff/combined.log
# Search for queue processing errors
grep "provisioning" /var/log/bff/error.log | tail -50
Manual Retry Procedures
Retry a Single Failed Job
// Using BullMQ API in Node.js
import { Queue } from "bullmq";
const queue = new Queue("order-provisioning", { connection: redisConnection });
const job = await queue.getJob("job-id");
await job.retry();
Retry All Failed Jobs
# Move all failed jobs back to waiting
redis-cli ZRANGEBYSCORE "bull:order-provisioning:failed" -inf +inf | while read jobId; do
redis-cli LPUSH "bull:order-provisioning:wait" "$jobId"
redis-cli ZREM "bull:order-provisioning:failed" "$jobId"
done
Warning
: Only retry jobs after fixing the root cause. Retrying without fixing will cause the same failure.
Retry via Salesforce (Recommended for Provisioning)
For order provisioning, the recommended retry method is through Salesforce:
- Open the Order in Salesforce
- Clear error fields (
Activation_Error__c,Activation_Error_DateTime__c) - Set
Activation_Status__cback to "Activating" - The Record-Triggered Flow will publish a new Platform Event
This approach ensures proper idempotency tracking and audit trail.
Clearing Stuck Jobs
Clear All Jobs from a Queue
Warning
: This removes all jobs including pending work. Use only in emergencies.
# Clear all queue data
redis-cli DEL \
"bull:order-provisioning:wait" \
"bull:order-provisioning:active" \
"bull:order-provisioning:delayed" \
"bull:order-provisioning:completed" \
"bull:order-provisioning:failed"
Clear Old Completed/Failed Jobs
# Remove jobs older than 7 days from completed
redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:completed" -inf $(date -d '7 days ago' +%s000)
# Remove jobs older than 30 days from failed
redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:failed" -inf $(date -d '30 days ago' +%s000)
Queue Backlog Handling
Symptoms of Backlog
- Wait queue depth increasing
- Jobs not being processed
- Customer orders stuck in "Activating" status
Diagnosis
-
Check processor is running
grep "BullMQ" /var/log/bff/combined.log | tail -20 -
Check Redis connectivity
redis-cli PING -
Check for blocked jobs
redis-cli LLEN "bull:order-provisioning:active" # If active > 0 for extended time, jobs may be stuck -
Check external dependencies
- Salesforce API
- WHMCS API
Resolution
- Restart BFF to reconnect queue workers
- Clear stuck active jobs if processor crashed mid-job
- Scale horizontally if queue depth is due to high volume
- Fix root cause if jobs are failing repeatedly
Alerting Configuration
Recommended Alerts
| Alert | Condition | Severity |
|---|---|---|
| Queue Backlog | Wait queue > 10 for > 5 minutes | Warning |
| Queue Backlog Critical | Wait queue > 50 | Critical |
| Failed Jobs Spike | > 5 failures in 15 minutes | Warning |
| Processor Down | No job processed in 10 minutes with jobs waiting | Critical |
| Job Timeout | Job active for > 5 minutes | Warning |
Monitoring Queries
# Check queue depths (for monitoring script)
WAIT=$(redis-cli LLEN "bull:order-provisioning:wait")
ACTIVE=$(redis-cli LLEN "bull:order-provisioning:active")
FAILED=$(redis-cli ZCARD "bull:order-provisioning:failed")
echo "Wait: $WAIT, Active: $ACTIVE, Failed: $FAILED"
Best Practices
Job Design
- Include sufficient context in job data for debugging
- Use idempotency keys to prevent duplicate processing
- Keep job payloads small (< 10KB)
Error Handling
- Distinguish between retryable and non-retryable errors
- Log sufficient context before throwing
- Update external systems with error status on final failure
Monitoring
- Set up alerts for queue depth and failure rate
- Monitor job processing duration
- Track success/failure ratios over time
Related Documents
Last Updated: December 2025