362 lines
10 KiB
Markdown
362 lines
10 KiB
Markdown
|
|
# Queue Management Runbook
|
||
|
|
|
||
|
|
This document covers monitoring and management of BullMQ job queues used by the Customer Portal BFF.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
The BFF uses BullMQ (backed by Redis) for asynchronous job processing:
|
||
|
|
|
||
|
|
| Queue | Purpose | Processor Location |
|
||
|
|
| -------------------- | --------------------------------------------- | ---------------------------------------------------- |
|
||
|
|
| `order-provisioning` | Order fulfillment after CS approval | `apps/bff/src/modules/orders/queue/` |
|
||
|
|
| `sim-management` | Delayed SIM operations (network type changes) | `apps/bff/src/modules/subscriptions/sim-management/` |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Queue Configuration
|
||
|
|
|
||
|
|
### Environment Variables
|
||
|
|
|
||
|
|
| Variable | Description | Default |
|
||
|
|
| ------------------------ | ---------------------------------- | -------- |
|
||
|
|
| `REDIS_URL` | Redis connection for queues | Required |
|
||
|
|
| `QUEUE_DEFAULT_ATTEMPTS` | Default retry attempts | 3 |
|
||
|
|
| `QUEUE_BACKOFF_DELAY` | Backoff delay between retries (ms) | 5000 |
|
||
|
|
|
||
|
|
### Queue Options
|
||
|
|
|
||
|
|
```typescript
|
||
|
|
// Default queue configuration
|
||
|
|
{
|
||
|
|
defaultJobOptions: {
|
||
|
|
attempts: 3,
|
||
|
|
backoff: {
|
||
|
|
type: 'exponential',
|
||
|
|
delay: 5000,
|
||
|
|
},
|
||
|
|
removeOnComplete: 100, // Keep last 100 completed jobs
|
||
|
|
removeOnFail: 500, // Keep last 500 failed jobs
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Monitoring
|
||
|
|
|
||
|
|
### Check Queue Status
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Connect to Redis and check queue keys
|
||
|
|
redis-cli KEYS "bull:*"
|
||
|
|
|
||
|
|
# Check specific queue length
|
||
|
|
redis-cli LLEN "bull:order-provisioning:wait"
|
||
|
|
redis-cli LLEN "bull:order-provisioning:active"
|
||
|
|
redis-cli ZCARD "bull:order-provisioning:delayed"
|
||
|
|
redis-cli ZCARD "bull:order-provisioning:failed"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Queue Key Structure
|
||
|
|
|
||
|
|
| Key Pattern | Description |
|
||
|
|
| ------------------------ | ----------------------------------- |
|
||
|
|
| `bull:{queue}:wait` | Jobs waiting to be processed |
|
||
|
|
| `bull:{queue}:active` | Jobs currently being processed |
|
||
|
|
| `bull:{queue}:delayed` | Jobs scheduled for future execution |
|
||
|
|
| `bull:{queue}:completed` | Recently completed jobs |
|
||
|
|
| `bull:{queue}:failed` | Failed jobs |
|
||
|
|
|
||
|
|
### Health Metrics
|
||
|
|
|
||
|
|
| Metric | Warning | Critical | Action |
|
||
|
|
| ---------------- | ------- | -------- | --------------------------- |
|
||
|
|
| Wait queue depth | >10 | >50 | Check processor status |
|
||
|
|
| Failed job count | >5 | >20 | Investigate failures |
|
||
|
|
| Processing time | >30s | >60s | Check external dependencies |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Order Provisioning Queue
|
||
|
|
|
||
|
|
### Purpose
|
||
|
|
|
||
|
|
Processes orders after CS approval via Salesforce Platform Events.
|
||
|
|
|
||
|
|
### Flow
|
||
|
|
|
||
|
|
```
|
||
|
|
Salesforce Platform Event (OrderProvisionRequested__e)
|
||
|
|
↓
|
||
|
|
Event Subscriber receives event
|
||
|
|
↓
|
||
|
|
Job enqueued to 'order-provisioning' queue
|
||
|
|
↓
|
||
|
|
Processor executes fulfillment workflow
|
||
|
|
↓
|
||
|
|
Order created in WHMCS + Salesforce updated
|
||
|
|
```
|
||
|
|
|
||
|
|
### Job Data Structure
|
||
|
|
|
||
|
|
```typescript
|
||
|
|
{
|
||
|
|
sfOrderId: "8014x000000ABCDXYZ", // Salesforce Order ID
|
||
|
|
idempotencyKey: "8014x...-1703123456789",
|
||
|
|
eventPayload: { ... } // Original Platform Event data
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Common Failure Reasons
|
||
|
|
|
||
|
|
| Error | Cause | Resolution |
|
||
|
|
| ------------------------ | ------------------------------ | ------------------------------------------------ |
|
||
|
|
| `PAYMENT_METHOD_MISSING` | Customer has no payment method | Customer must add payment method in WHMCS |
|
||
|
|
| `ORDER_NOT_FOUND` | Salesforce Order doesn't exist | Check Order ID, verify not deleted |
|
||
|
|
| `MAPPING_ERROR` | Product mapping missing | Add `WH_Product_ID__c` to Product2 in Salesforce |
|
||
|
|
| `WHMCS_ERROR` | WHMCS API failure | Check WHMCS connectivity and logs |
|
||
|
|
|
||
|
|
### Retry Behavior
|
||
|
|
|
||
|
|
- **Attempts**: 3 total (1 initial + 2 retries)
|
||
|
|
- **Backoff**: Exponential (5s, 10s, 20s)
|
||
|
|
- **On Final Failure**: Salesforce Order updated with error details
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## SIM Management Queue
|
||
|
|
|
||
|
|
### Purpose
|
||
|
|
|
||
|
|
Handles delayed SIM operations, particularly network type changes that require a 30-minute gap.
|
||
|
|
|
||
|
|
### Job Types
|
||
|
|
|
||
|
|
| Job Type | Delay | Description |
|
||
|
|
| ------------------- | ---------- | ----------------------------- |
|
||
|
|
| `networkTypeChange` | 30 minutes | Change between 4G/5G networks |
|
||
|
|
|
||
|
|
### Job Data Structure
|
||
|
|
|
||
|
|
```typescript
|
||
|
|
{
|
||
|
|
subscriptionId: 29951,
|
||
|
|
simAccount: "08077052946",
|
||
|
|
operation: "networkTypeChange",
|
||
|
|
params: {
|
||
|
|
networkType: "5G"
|
||
|
|
},
|
||
|
|
scheduledAt: "2024-01-15T10:30:00Z"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Common Failure Reasons
|
||
|
|
|
||
|
|
| Error | Cause | Resolution |
|
||
|
|
| --------------------- | -------------------------------- | --------------------------------------- |
|
||
|
|
| `FREEBIT_AUTH_FAILED` | Freebit authentication error | Check OEM credentials |
|
||
|
|
| `ACCOUNT_NOT_FOUND` | SIM account not found in Freebit | Verify account identifier |
|
||
|
|
| `OPERATION_CONFLICT` | Another operation pending | Wait for previous operation to complete |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Failed Job Investigation
|
||
|
|
|
||
|
|
### View Failed Jobs
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# List failed jobs (using Redis CLI)
|
||
|
|
redis-cli ZRANGE "bull:order-provisioning:failed" 0 -1
|
||
|
|
|
||
|
|
# Get job details
|
||
|
|
redis-cli HGETALL "bull:order-provisioning:{job-id}"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Common Investigation Steps
|
||
|
|
|
||
|
|
1. **Check job data**: Identify the order/subscription involved
|
||
|
|
2. **Check error message**: Look for specific failure reason
|
||
|
|
3. **Check external system**: Verify Salesforce/WHMCS/Freebit status
|
||
|
|
4. **Check logs**: Search BFF logs for job ID or order ID
|
||
|
|
5. **Determine if retryable**: Some errors are permanent (missing mapping), others are transient (network timeout)
|
||
|
|
|
||
|
|
### Log Search
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Search logs for specific order
|
||
|
|
grep "8014x000000ABCDXYZ" /var/log/bff/combined.log
|
||
|
|
|
||
|
|
# Search for queue processing errors
|
||
|
|
grep "provisioning" /var/log/bff/error.log | tail -50
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Manual Retry Procedures
|
||
|
|
|
||
|
|
### Retry a Single Failed Job
|
||
|
|
|
||
|
|
```typescript
|
||
|
|
// Using BullMQ API in Node.js
|
||
|
|
import { Queue } from "bullmq";
|
||
|
|
|
||
|
|
const queue = new Queue("order-provisioning", { connection: redisConnection });
|
||
|
|
const job = await queue.getJob("job-id");
|
||
|
|
await job.retry();
|
||
|
|
```
|
||
|
|
|
||
|
|
### Retry All Failed Jobs
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Move all failed jobs back to waiting
|
||
|
|
redis-cli ZRANGEBYSCORE "bull:order-provisioning:failed" -inf +inf | while read jobId; do
|
||
|
|
redis-cli LPUSH "bull:order-provisioning:wait" "$jobId"
|
||
|
|
redis-cli ZREM "bull:order-provisioning:failed" "$jobId"
|
||
|
|
done
|
||
|
|
```
|
||
|
|
|
||
|
|
> **Warning**: Only retry jobs after fixing the root cause. Retrying without fixing will cause the same failure.
|
||
|
|
|
||
|
|
### Retry via Salesforce (Recommended for Provisioning)
|
||
|
|
|
||
|
|
For order provisioning, the recommended retry method is through Salesforce:
|
||
|
|
|
||
|
|
1. Open the Order in Salesforce
|
||
|
|
2. Clear error fields (`Activation_Error__c`, `Activation_Error_DateTime__c`)
|
||
|
|
3. Set `Activation_Status__c` back to "Activating"
|
||
|
|
4. The Record-Triggered Flow will publish a new Platform Event
|
||
|
|
|
||
|
|
This approach ensures proper idempotency tracking and audit trail.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Clearing Stuck Jobs
|
||
|
|
|
||
|
|
### Clear All Jobs from a Queue
|
||
|
|
|
||
|
|
> **Warning**: This removes all jobs including pending work. Use only in emergencies.
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Clear all queue data
|
||
|
|
redis-cli DEL \
|
||
|
|
"bull:order-provisioning:wait" \
|
||
|
|
"bull:order-provisioning:active" \
|
||
|
|
"bull:order-provisioning:delayed" \
|
||
|
|
"bull:order-provisioning:completed" \
|
||
|
|
"bull:order-provisioning:failed"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Clear Old Completed/Failed Jobs
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Remove jobs older than 7 days from completed
|
||
|
|
redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:completed" -inf $(date -d '7 days ago' +%s000)
|
||
|
|
|
||
|
|
# Remove jobs older than 30 days from failed
|
||
|
|
redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:failed" -inf $(date -d '30 days ago' +%s000)
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Queue Backlog Handling
|
||
|
|
|
||
|
|
### Symptoms of Backlog
|
||
|
|
|
||
|
|
- Wait queue depth increasing
|
||
|
|
- Jobs not being processed
|
||
|
|
- Customer orders stuck in "Activating" status
|
||
|
|
|
||
|
|
### Diagnosis
|
||
|
|
|
||
|
|
1. **Check processor is running**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
grep "BullMQ" /var/log/bff/combined.log | tail -20
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Check Redis connectivity**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
redis-cli PING
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Check for blocked jobs**
|
||
|
|
|
||
|
|
```bash
|
||
|
|
redis-cli LLEN "bull:order-provisioning:active"
|
||
|
|
# If active > 0 for extended time, jobs may be stuck
|
||
|
|
```
|
||
|
|
|
||
|
|
4. **Check external dependencies**
|
||
|
|
- Salesforce API
|
||
|
|
- WHMCS API
|
||
|
|
|
||
|
|
### Resolution
|
||
|
|
|
||
|
|
1. **Restart BFF** to reconnect queue workers
|
||
|
|
2. **Clear stuck active jobs** if processor crashed mid-job
|
||
|
|
3. **Scale horizontally** if queue depth is due to high volume
|
||
|
|
4. **Fix root cause** if jobs are failing repeatedly
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Alerting Configuration
|
||
|
|
|
||
|
|
### Recommended Alerts
|
||
|
|
|
||
|
|
| Alert | Condition | Severity |
|
||
|
|
| ---------------------- | ------------------------------------------------ | -------- |
|
||
|
|
| Queue Backlog | Wait queue > 10 for > 5 minutes | Warning |
|
||
|
|
| Queue Backlog Critical | Wait queue > 50 | Critical |
|
||
|
|
| Failed Jobs Spike | > 5 failures in 15 minutes | Warning |
|
||
|
|
| Processor Down | No job processed in 10 minutes with jobs waiting | Critical |
|
||
|
|
| Job Timeout | Job active for > 5 minutes | Warning |
|
||
|
|
|
||
|
|
### Monitoring Queries
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check queue depths (for monitoring script)
|
||
|
|
WAIT=$(redis-cli LLEN "bull:order-provisioning:wait")
|
||
|
|
ACTIVE=$(redis-cli LLEN "bull:order-provisioning:active")
|
||
|
|
FAILED=$(redis-cli ZCARD "bull:order-provisioning:failed")
|
||
|
|
|
||
|
|
echo "Wait: $WAIT, Active: $ACTIVE, Failed: $FAILED"
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Best Practices
|
||
|
|
|
||
|
|
### Job Design
|
||
|
|
|
||
|
|
- Include sufficient context in job data for debugging
|
||
|
|
- Use idempotency keys to prevent duplicate processing
|
||
|
|
- Keep job payloads small (< 10KB)
|
||
|
|
|
||
|
|
### Error Handling
|
||
|
|
|
||
|
|
- Distinguish between retryable and non-retryable errors
|
||
|
|
- Log sufficient context before throwing
|
||
|
|
- Update external systems with error status on final failure
|
||
|
|
|
||
|
|
### Monitoring
|
||
|
|
|
||
|
|
- Set up alerts for queue depth and failure rate
|
||
|
|
- Monitor job processing duration
|
||
|
|
- Track success/failure ratios over time
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Related Documents
|
||
|
|
|
||
|
|
- [Incident Response](./incident-response.md)
|
||
|
|
- [Provisioning Runbook](./provisioning-runbook.md)
|
||
|
|
- [External Dependencies](./external-dependencies.md)
|
||
|
|
- [SIM State Machine](../integrations/sim/state-machine.md)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Last Updated:** December 2025
|