Assist_Design/docs/operations/queue-management.md

362 lines
10 KiB
Markdown
Raw Normal View History

# Queue Management Runbook
This document covers monitoring and management of BullMQ job queues used by the Customer Portal BFF.
---
## Overview
The BFF uses BullMQ (backed by Redis) for asynchronous job processing:
| Queue | Purpose | Processor Location |
| -------------------- | --------------------------------------------- | ---------------------------------------------------- |
| `order-provisioning` | Order fulfillment after CS approval | `apps/bff/src/modules/orders/queue/` |
| `sim-management` | Delayed SIM operations (network type changes) | `apps/bff/src/modules/subscriptions/sim-management/` |
---
## Queue Configuration
### Environment Variables
| Variable | Description | Default |
| ------------------------ | ---------------------------------- | -------- |
| `REDIS_URL` | Redis connection for queues | Required |
| `QUEUE_DEFAULT_ATTEMPTS` | Default retry attempts | 3 |
| `QUEUE_BACKOFF_DELAY` | Backoff delay between retries (ms) | 5000 |
### Queue Options
```typescript
// Default queue configuration
{
defaultJobOptions: {
attempts: 3,
backoff: {
type: 'exponential',
delay: 5000,
},
removeOnComplete: 100, // Keep last 100 completed jobs
removeOnFail: 500, // Keep last 500 failed jobs
}
}
```
---
## Monitoring
### Check Queue Status
```bash
# Connect to Redis and check queue keys
redis-cli KEYS "bull:*"
# Check specific queue length
redis-cli LLEN "bull:order-provisioning:wait"
redis-cli LLEN "bull:order-provisioning:active"
redis-cli ZCARD "bull:order-provisioning:delayed"
redis-cli ZCARD "bull:order-provisioning:failed"
```
### Queue Key Structure
| Key Pattern | Description |
| ------------------------ | ----------------------------------- |
| `bull:{queue}:wait` | Jobs waiting to be processed |
| `bull:{queue}:active` | Jobs currently being processed |
| `bull:{queue}:delayed` | Jobs scheduled for future execution |
| `bull:{queue}:completed` | Recently completed jobs |
| `bull:{queue}:failed` | Failed jobs |
### Health Metrics
| Metric | Warning | Critical | Action |
| ---------------- | ------- | -------- | --------------------------- |
| Wait queue depth | >10 | >50 | Check processor status |
| Failed job count | >5 | >20 | Investigate failures |
| Processing time | >30s | >60s | Check external dependencies |
---
## Order Provisioning Queue
### Purpose
Processes orders after CS approval via Salesforce Platform Events.
### Flow
```
Salesforce Platform Event (OrderProvisionRequested__e)
Event Subscriber receives event
Job enqueued to 'order-provisioning' queue
Processor executes fulfillment workflow
Order created in WHMCS + Salesforce updated
```
### Job Data Structure
```typescript
{
sfOrderId: "8014x000000ABCDXYZ", // Salesforce Order ID
idempotencyKey: "8014x...-1703123456789",
eventPayload: { ... } // Original Platform Event data
}
```
### Common Failure Reasons
| Error | Cause | Resolution |
| ------------------------ | ------------------------------ | ------------------------------------------------ |
| `PAYMENT_METHOD_MISSING` | Customer has no payment method | Customer must add payment method in WHMCS |
| `ORDER_NOT_FOUND` | Salesforce Order doesn't exist | Check Order ID, verify not deleted |
| `MAPPING_ERROR` | Product mapping missing | Add `WH_Product_ID__c` to Product2 in Salesforce |
| `WHMCS_ERROR` | WHMCS API failure | Check WHMCS connectivity and logs |
### Retry Behavior
- **Attempts**: 3 total (1 initial + 2 retries)
- **Backoff**: Exponential (5s, 10s, 20s)
- **On Final Failure**: Salesforce Order updated with error details
---
## SIM Management Queue
### Purpose
Handles delayed SIM operations, particularly network type changes that require a 30-minute gap.
### Job Types
| Job Type | Delay | Description |
| ------------------- | ---------- | ----------------------------- |
| `networkTypeChange` | 30 minutes | Change between 4G/5G networks |
### Job Data Structure
```typescript
{
subscriptionId: 29951,
simAccount: "08077052946",
operation: "networkTypeChange",
params: {
networkType: "5G"
},
scheduledAt: "2024-01-15T10:30:00Z"
}
```
### Common Failure Reasons
| Error | Cause | Resolution |
| --------------------- | -------------------------------- | --------------------------------------- |
| `FREEBIT_AUTH_FAILED` | Freebit authentication error | Check OEM credentials |
| `ACCOUNT_NOT_FOUND` | SIM account not found in Freebit | Verify account identifier |
| `OPERATION_CONFLICT` | Another operation pending | Wait for previous operation to complete |
---
## Failed Job Investigation
### View Failed Jobs
```bash
# List failed jobs (using Redis CLI)
redis-cli ZRANGE "bull:order-provisioning:failed" 0 -1
# Get job details
redis-cli HGETALL "bull:order-provisioning:{job-id}"
```
### Common Investigation Steps
1. **Check job data**: Identify the order/subscription involved
2. **Check error message**: Look for specific failure reason
3. **Check external system**: Verify Salesforce/WHMCS/Freebit status
4. **Check logs**: Search BFF logs for job ID or order ID
5. **Determine if retryable**: Some errors are permanent (missing mapping), others are transient (network timeout)
### Log Search
```bash
# Search logs for specific order
grep "8014x000000ABCDXYZ" /var/log/bff/combined.log
# Search for queue processing errors
grep "provisioning" /var/log/bff/error.log | tail -50
```
---
## Manual Retry Procedures
### Retry a Single Failed Job
```typescript
// Using BullMQ API in Node.js
import { Queue } from "bullmq";
const queue = new Queue("order-provisioning", { connection: redisConnection });
const job = await queue.getJob("job-id");
await job.retry();
```
### Retry All Failed Jobs
```bash
# Move all failed jobs back to waiting
redis-cli ZRANGEBYSCORE "bull:order-provisioning:failed" -inf +inf | while read jobId; do
redis-cli LPUSH "bull:order-provisioning:wait" "$jobId"
redis-cli ZREM "bull:order-provisioning:failed" "$jobId"
done
```
> **Warning**: Only retry jobs after fixing the root cause. Retrying without fixing will cause the same failure.
### Retry via Salesforce (Recommended for Provisioning)
For order provisioning, the recommended retry method is through Salesforce:
1. Open the Order in Salesforce
2. Clear error fields (`Activation_Error__c`, `Activation_Error_DateTime__c`)
3. Set `Activation_Status__c` back to "Activating"
4. The Record-Triggered Flow will publish a new Platform Event
This approach ensures proper idempotency tracking and audit trail.
---
## Clearing Stuck Jobs
### Clear All Jobs from a Queue
> **Warning**: This removes all jobs including pending work. Use only in emergencies.
```bash
# Clear all queue data
redis-cli DEL \
"bull:order-provisioning:wait" \
"bull:order-provisioning:active" \
"bull:order-provisioning:delayed" \
"bull:order-provisioning:completed" \
"bull:order-provisioning:failed"
```
### Clear Old Completed/Failed Jobs
```bash
# Remove jobs older than 7 days from completed
redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:completed" -inf $(date -d '7 days ago' +%s000)
# Remove jobs older than 30 days from failed
redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:failed" -inf $(date -d '30 days ago' +%s000)
```
---
## Queue Backlog Handling
### Symptoms of Backlog
- Wait queue depth increasing
- Jobs not being processed
- Customer orders stuck in "Activating" status
### Diagnosis
1. **Check processor is running**
```bash
grep "BullMQ" /var/log/bff/combined.log | tail -20
```
2. **Check Redis connectivity**
```bash
redis-cli PING
```
3. **Check for blocked jobs**
```bash
redis-cli LLEN "bull:order-provisioning:active"
# If active > 0 for extended time, jobs may be stuck
```
4. **Check external dependencies**
- Salesforce API
- WHMCS API
### Resolution
1. **Restart BFF** to reconnect queue workers
2. **Clear stuck active jobs** if processor crashed mid-job
3. **Scale horizontally** if queue depth is due to high volume
4. **Fix root cause** if jobs are failing repeatedly
---
## Alerting Configuration
### Recommended Alerts
| Alert | Condition | Severity |
| ---------------------- | ------------------------------------------------ | -------- |
| Queue Backlog | Wait queue > 10 for > 5 minutes | Warning |
| Queue Backlog Critical | Wait queue > 50 | Critical |
| Failed Jobs Spike | > 5 failures in 15 minutes | Warning |
| Processor Down | No job processed in 10 minutes with jobs waiting | Critical |
| Job Timeout | Job active for > 5 minutes | Warning |
### Monitoring Queries
```bash
# Check queue depths (for monitoring script)
WAIT=$(redis-cli LLEN "bull:order-provisioning:wait")
ACTIVE=$(redis-cli LLEN "bull:order-provisioning:active")
FAILED=$(redis-cli ZCARD "bull:order-provisioning:failed")
echo "Wait: $WAIT, Active: $ACTIVE, Failed: $FAILED"
```
---
## Best Practices
### Job Design
- Include sufficient context in job data for debugging
- Use idempotency keys to prevent duplicate processing
- Keep job payloads small (< 10KB)
### Error Handling
- Distinguish between retryable and non-retryable errors
- Log sufficient context before throwing
- Update external systems with error status on final failure
### Monitoring
- Set up alerts for queue depth and failure rate
- Monitor job processing duration
- Track success/failure ratios over time
---
## Related Documents
- [Incident Response](./incident-response.md)
- [Provisioning Runbook](./provisioning-runbook.md)
- [External Dependencies](./external-dependencies.md)
- [SIM State Machine](../integrations/sim/state-machine.md)
---
**Last Updated:** December 2025