# Queue Management Runbook This document covers monitoring and management of BullMQ job queues used by the Customer Portal BFF. --- ## Overview The BFF uses BullMQ (backed by Redis) for asynchronous job processing: | Queue | Purpose | Processor Location | | -------------------- | --------------------------------------------- | ---------------------------------------------------- | | `order-provisioning` | Order fulfillment after CS approval | `apps/bff/src/modules/orders/queue/` | | `sim-management` | Delayed SIM operations (network type changes) | `apps/bff/src/modules/subscriptions/sim-management/` | --- ## Queue Configuration ### Environment Variables | Variable | Description | Default | | ------------------------ | ---------------------------------- | -------- | | `REDIS_URL` | Redis connection for queues | Required | | `QUEUE_DEFAULT_ATTEMPTS` | Default retry attempts | 3 | | `QUEUE_BACKOFF_DELAY` | Backoff delay between retries (ms) | 5000 | ### Queue Options ```typescript // Default queue configuration { defaultJobOptions: { attempts: 3, backoff: { type: 'exponential', delay: 5000, }, removeOnComplete: 100, // Keep last 100 completed jobs removeOnFail: 500, // Keep last 500 failed jobs } } ``` --- ## Monitoring ### Check Queue Status ```bash # Connect to Redis and check queue keys redis-cli KEYS "bull:*" # Check specific queue length redis-cli LLEN "bull:order-provisioning:wait" redis-cli LLEN "bull:order-provisioning:active" redis-cli ZCARD "bull:order-provisioning:delayed" redis-cli ZCARD "bull:order-provisioning:failed" ``` ### Queue Key Structure | Key Pattern | Description | | ------------------------ | ----------------------------------- | | `bull:{queue}:wait` | Jobs waiting to be processed | | `bull:{queue}:active` | Jobs currently being processed | | `bull:{queue}:delayed` | Jobs scheduled for future execution | | `bull:{queue}:completed` | Recently completed jobs | | `bull:{queue}:failed` | Failed jobs | ### Health Metrics | Metric | Warning | Critical | Action | | ---------------- | ------- | -------- | --------------------------- | | Wait queue depth | >10 | >50 | Check processor status | | Failed job count | >5 | >20 | Investigate failures | | Processing time | >30s | >60s | Check external dependencies | --- ## Order Provisioning Queue ### Purpose Processes orders after CS approval via Salesforce Platform Events. ### Flow ``` Salesforce Platform Event (OrderProvisionRequested__e) ↓ Event Subscriber receives event ↓ Job enqueued to 'order-provisioning' queue ↓ Processor executes fulfillment workflow ↓ Order created in WHMCS + Salesforce updated ``` ### Job Data Structure ```typescript { sfOrderId: "8014x000000ABCDXYZ", // Salesforce Order ID idempotencyKey: "8014x...-1703123456789", eventPayload: { ... } // Original Platform Event data } ``` ### Common Failure Reasons | Error | Cause | Resolution | | ------------------------ | ------------------------------ | ------------------------------------------------ | | `PAYMENT_METHOD_MISSING` | Customer has no payment method | Customer must add payment method in WHMCS | | `ORDER_NOT_FOUND` | Salesforce Order doesn't exist | Check Order ID, verify not deleted | | `MAPPING_ERROR` | Product mapping missing | Add `WH_Product_ID__c` to Product2 in Salesforce | | `WHMCS_ERROR` | WHMCS API failure | Check WHMCS connectivity and logs | ### Retry Behavior - **Attempts**: 3 total (1 initial + 2 retries) - **Backoff**: Exponential (5s, 10s, 20s) - **On Final Failure**: Salesforce Order updated with error details --- ## SIM Management Queue ### Purpose Handles delayed SIM operations, particularly network type changes that require a 30-minute gap. ### Job Types | Job Type | Delay | Description | | ------------------- | ---------- | ----------------------------- | | `networkTypeChange` | 30 minutes | Change between 4G/5G networks | ### Job Data Structure ```typescript { subscriptionId: 29951, simAccount: "08077052946", operation: "networkTypeChange", params: { networkType: "5G" }, scheduledAt: "2024-01-15T10:30:00Z" } ``` ### Common Failure Reasons | Error | Cause | Resolution | | --------------------- | -------------------------------- | --------------------------------------- | | `FREEBIT_AUTH_FAILED` | Freebit authentication error | Check OEM credentials | | `ACCOUNT_NOT_FOUND` | SIM account not found in Freebit | Verify account identifier | | `OPERATION_CONFLICT` | Another operation pending | Wait for previous operation to complete | --- ## Failed Job Investigation ### View Failed Jobs ```bash # List failed jobs (using Redis CLI) redis-cli ZRANGE "bull:order-provisioning:failed" 0 -1 # Get job details redis-cli HGETALL "bull:order-provisioning:{job-id}" ``` ### Common Investigation Steps 1. **Check job data**: Identify the order/subscription involved 2. **Check error message**: Look for specific failure reason 3. **Check external system**: Verify Salesforce/WHMCS/Freebit status 4. **Check logs**: Search BFF logs for job ID or order ID 5. **Determine if retryable**: Some errors are permanent (missing mapping), others are transient (network timeout) ### Log Search ```bash # Search logs for specific order grep "8014x000000ABCDXYZ" /var/log/bff/combined.log # Search for queue processing errors grep "provisioning" /var/log/bff/error.log | tail -50 ``` --- ## Manual Retry Procedures ### Retry a Single Failed Job ```typescript // Using BullMQ API in Node.js import { Queue } from "bullmq"; const queue = new Queue("order-provisioning", { connection: redisConnection }); const job = await queue.getJob("job-id"); await job.retry(); ``` ### Retry All Failed Jobs ```bash # Move all failed jobs back to waiting redis-cli ZRANGEBYSCORE "bull:order-provisioning:failed" -inf +inf | while read jobId; do redis-cli LPUSH "bull:order-provisioning:wait" "$jobId" redis-cli ZREM "bull:order-provisioning:failed" "$jobId" done ``` > **Warning**: Only retry jobs after fixing the root cause. Retrying without fixing will cause the same failure. ### Retry via Salesforce (Recommended for Provisioning) For order provisioning, the recommended retry method is through Salesforce: 1. Open the Order in Salesforce 2. Clear error fields (`Activation_Error__c`, `Activation_Error_DateTime__c`) 3. Set `Activation_Status__c` back to "Activating" 4. The Record-Triggered Flow will publish a new Platform Event This approach ensures proper idempotency tracking and audit trail. --- ## Clearing Stuck Jobs ### Clear All Jobs from a Queue > **Warning**: This removes all jobs including pending work. Use only in emergencies. ```bash # Clear all queue data redis-cli DEL \ "bull:order-provisioning:wait" \ "bull:order-provisioning:active" \ "bull:order-provisioning:delayed" \ "bull:order-provisioning:completed" \ "bull:order-provisioning:failed" ``` ### Clear Old Completed/Failed Jobs ```bash # Remove jobs older than 7 days from completed redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:completed" -inf $(date -d '7 days ago' +%s000) # Remove jobs older than 30 days from failed redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:failed" -inf $(date -d '30 days ago' +%s000) ``` --- ## Queue Backlog Handling ### Symptoms of Backlog - Wait queue depth increasing - Jobs not being processed - Customer orders stuck in "Activating" status ### Diagnosis 1. **Check processor is running** ```bash grep "BullMQ" /var/log/bff/combined.log | tail -20 ``` 2. **Check Redis connectivity** ```bash redis-cli PING ``` 3. **Check for blocked jobs** ```bash redis-cli LLEN "bull:order-provisioning:active" # If active > 0 for extended time, jobs may be stuck ``` 4. **Check external dependencies** - Salesforce API - WHMCS API ### Resolution 1. **Restart BFF** to reconnect queue workers 2. **Clear stuck active jobs** if processor crashed mid-job 3. **Scale horizontally** if queue depth is due to high volume 4. **Fix root cause** if jobs are failing repeatedly --- ## Alerting Configuration ### Recommended Alerts | Alert | Condition | Severity | | ---------------------- | ------------------------------------------------ | -------- | | Queue Backlog | Wait queue > 10 for > 5 minutes | Warning | | Queue Backlog Critical | Wait queue > 50 | Critical | | Failed Jobs Spike | > 5 failures in 15 minutes | Warning | | Processor Down | No job processed in 10 minutes with jobs waiting | Critical | | Job Timeout | Job active for > 5 minutes | Warning | ### Monitoring Queries ```bash # Check queue depths (for monitoring script) WAIT=$(redis-cli LLEN "bull:order-provisioning:wait") ACTIVE=$(redis-cli LLEN "bull:order-provisioning:active") FAILED=$(redis-cli ZCARD "bull:order-provisioning:failed") echo "Wait: $WAIT, Active: $ACTIVE, Failed: $FAILED" ``` --- ## Best Practices ### Job Design - Include sufficient context in job data for debugging - Use idempotency keys to prevent duplicate processing - Keep job payloads small (< 10KB) ### Error Handling - Distinguish between retryable and non-retryable errors - Log sufficient context before throwing - Update external systems with error status on final failure ### Monitoring - Set up alerts for queue depth and failure rate - Monitor job processing duration - Track success/failure ratios over time --- ## Related Documents - [Incident Response](./incident-response.md) - [Provisioning Runbook](./provisioning-runbook.md) - [External Dependencies](./external-dependencies.md) - [SIM State Machine](../integrations/sim/state-machine.md) --- **Last Updated:** December 2025