# Queue Management Runbook

This document covers monitoring and management of BullMQ job queues used by the Customer Portal BFF.

---

## Overview

The BFF uses BullMQ (backed by Redis) for asynchronous job processing:

| Queue                | Purpose                                       | Processor Location                                   |
| -------------------- | --------------------------------------------- | ---------------------------------------------------- |
| `order-provisioning` | Order fulfillment after CS approval           | `apps/bff/src/modules/orders/queue/`                 |
| `sim-management`     | Delayed SIM operations (network type changes) | `apps/bff/src/modules/subscriptions/sim-management/` |

---

## Queue Configuration

### Environment Variables

| Variable                 | Description                        | Default  |
| ------------------------ | ---------------------------------- | -------- |
| `REDIS_URL`              | Redis connection for queues        | Required |
| `QUEUE_DEFAULT_ATTEMPTS` | Default retry attempts             | 3        |
| `QUEUE_BACKOFF_DELAY`    | Backoff delay between retries (ms) | 5000     |

### Queue Options

```typescript
// Default queue configuration
{
  defaultJobOptions: {
    attempts: 3,
    backoff: {
      type: 'exponential',
      delay: 5000,
    },
    removeOnComplete: 100,  // Keep last 100 completed jobs
    removeOnFail: 500,      // Keep last 500 failed jobs
  }
}
```

---

## Monitoring

### Check Queue Status

```bash
# Connect to Redis and check queue keys
redis-cli KEYS "bull:*"

# Check specific queue length
redis-cli LLEN "bull:order-provisioning:wait"
redis-cli LLEN "bull:order-provisioning:active"
redis-cli ZCARD "bull:order-provisioning:delayed"
redis-cli ZCARD "bull:order-provisioning:failed"
```

### Queue Key Structure

| Key Pattern              | Description                         |
| ------------------------ | ----------------------------------- |
| `bull:{queue}:wait`      | Jobs waiting to be processed        |
| `bull:{queue}:active`    | Jobs currently being processed      |
| `bull:{queue}:delayed`   | Jobs scheduled for future execution |
| `bull:{queue}:completed` | Recently completed jobs             |
| `bull:{queue}:failed`    | Failed jobs                         |

### Health Metrics

| Metric           | Warning | Critical | Action                      |
| ---------------- | ------- | -------- | --------------------------- |
| Wait queue depth | >10     | >50      | Check processor status      |
| Failed job count | >5      | >20      | Investigate failures        |
| Processing time  | >30s    | >60s     | Check external dependencies |

---

## Order Provisioning Queue

### Purpose

Processes orders after CS approval via Salesforce Platform Events.

### Flow

```
Salesforce Platform Event (OrderProvisionRequested__e)
    ↓
Event Subscriber receives event
    ↓
Job enqueued to 'order-provisioning' queue
    ↓
Processor executes fulfillment workflow
    ↓
Order created in WHMCS + Salesforce updated
```

### Job Data Structure

```typescript
{
  sfOrderId: "8014x000000ABCDXYZ",  // Salesforce Order ID
  idempotencyKey: "8014x...-1703123456789",
  eventPayload: { ... }  // Original Platform Event data
}
```

### Common Failure Reasons

| Error                    | Cause                          | Resolution                                       |
| ------------------------ | ------------------------------ | ------------------------------------------------ |
| `PAYMENT_METHOD_MISSING` | Customer has no payment method | Customer must add payment method in WHMCS        |
| `ORDER_NOT_FOUND`        | Salesforce Order doesn't exist | Check Order ID, verify not deleted               |
| `MAPPING_ERROR`          | Product mapping missing        | Add `WH_Product_ID__c` to Product2 in Salesforce |
| `WHMCS_ERROR`            | WHMCS API failure              | Check WHMCS connectivity and logs                |

### Retry Behavior

- **Attempts**: 3 total (1 initial + 2 retries)
- **Backoff**: Exponential (5s, 10s, 20s)
- **On Final Failure**: Salesforce Order updated with error details

---

## SIM Management Queue

### Purpose

Handles delayed SIM operations, particularly network type changes that require a 30-minute gap.

### Job Types

| Job Type            | Delay      | Description                   |
| ------------------- | ---------- | ----------------------------- |
| `networkTypeChange` | 30 minutes | Change between 4G/5G networks |

### Job Data Structure

```typescript
{
  subscriptionId: 29951,
  simAccount: "08077052946",
  operation: "networkTypeChange",
  params: {
    networkType: "5G"
  },
  scheduledAt: "2024-01-15T10:30:00Z"
}
```

### Common Failure Reasons

| Error                 | Cause                            | Resolution                              |
| --------------------- | -------------------------------- | --------------------------------------- |
| `FREEBIT_AUTH_FAILED` | Freebit authentication error     | Check OEM credentials                   |
| `ACCOUNT_NOT_FOUND`   | SIM account not found in Freebit | Verify account identifier               |
| `OPERATION_CONFLICT`  | Another operation pending        | Wait for previous operation to complete |

---

## Failed Job Investigation

### View Failed Jobs

```bash
# List failed jobs (using Redis CLI)
redis-cli ZRANGE "bull:order-provisioning:failed" 0 -1

# Get job details
redis-cli HGETALL "bull:order-provisioning:{job-id}"
```

### Common Investigation Steps

1. **Check job data**: Identify the order/subscription involved
2. **Check error message**: Look for specific failure reason
3. **Check external system**: Verify Salesforce/WHMCS/Freebit status
4. **Check logs**: Search BFF logs for job ID or order ID
5. **Determine if retryable**: Some errors are permanent (missing mapping), others are transient (network timeout)

### Log Search

```bash
# Search logs for specific order
grep "8014x000000ABCDXYZ" /var/log/bff/combined.log

# Search for queue processing errors
grep "provisioning" /var/log/bff/error.log | tail -50
```

---

## Manual Retry Procedures

### Retry a Single Failed Job

```typescript
// Using BullMQ API in Node.js
import { Queue } from "bullmq";

const queue = new Queue("order-provisioning", { connection: redisConnection });
const job = await queue.getJob("job-id");
await job.retry();
```

### Retry All Failed Jobs

```bash
# Move all failed jobs back to waiting
redis-cli ZRANGEBYSCORE "bull:order-provisioning:failed" -inf +inf | while read jobId; do
  redis-cli LPUSH "bull:order-provisioning:wait" "$jobId"
  redis-cli ZREM "bull:order-provisioning:failed" "$jobId"
done
```

> **Warning**: Only retry jobs after fixing the root cause. Retrying without fixing will cause the same failure.

### Retry via Salesforce (Recommended for Provisioning)

For order provisioning, the recommended retry method is through Salesforce:

1. Open the Order in Salesforce
2. Clear error fields (`Activation_Error__c`, `Activation_Error_DateTime__c`)
3. Set `Activation_Status__c` back to "Activating"
4. The Record-Triggered Flow will publish a new Platform Event

This approach ensures proper idempotency tracking and audit trail.

---

## Clearing Stuck Jobs

### Clear All Jobs from a Queue

> **Warning**: This removes all jobs including pending work. Use only in emergencies.

```bash
# Clear all queue data
redis-cli DEL \
  "bull:order-provisioning:wait" \
  "bull:order-provisioning:active" \
  "bull:order-provisioning:delayed" \
  "bull:order-provisioning:completed" \
  "bull:order-provisioning:failed"
```

### Clear Old Completed/Failed Jobs

```bash
# Remove jobs older than 7 days from completed
redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:completed" -inf $(date -d '7 days ago' +%s000)

# Remove jobs older than 30 days from failed
redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:failed" -inf $(date -d '30 days ago' +%s000)
```

---

## Queue Backlog Handling

### Symptoms of Backlog

- Wait queue depth increasing
- Jobs not being processed
- Customer orders stuck in "Activating" status

### Diagnosis

1. **Check processor is running**

   ```bash
   grep "BullMQ" /var/log/bff/combined.log | tail -20
   ```

2. **Check Redis connectivity**

   ```bash
   redis-cli PING
   ```

3. **Check for blocked jobs**

   ```bash
   redis-cli LLEN "bull:order-provisioning:active"
   # If active > 0 for extended time, jobs may be stuck
   ```

4. **Check external dependencies**
   - Salesforce API
   - WHMCS API

### Resolution

1. **Restart BFF** to reconnect queue workers
2. **Clear stuck active jobs** if processor crashed mid-job
3. **Scale horizontally** if queue depth is due to high volume
4. **Fix root cause** if jobs are failing repeatedly

---

## Alerting Configuration

### Recommended Alerts

| Alert                  | Condition                                        | Severity |
| ---------------------- | ------------------------------------------------ | -------- |
| Queue Backlog          | Wait queue > 10 for > 5 minutes                  | Warning  |
| Queue Backlog Critical | Wait queue > 50                                  | Critical |
| Failed Jobs Spike      | > 5 failures in 15 minutes                       | Warning  |
| Processor Down         | No job processed in 10 minutes with jobs waiting | Critical |
| Job Timeout            | Job active for > 5 minutes                       | Warning  |

### Monitoring Queries

```bash
# Check queue depths (for monitoring script)
WAIT=$(redis-cli LLEN "bull:order-provisioning:wait")
ACTIVE=$(redis-cli LLEN "bull:order-provisioning:active")
FAILED=$(redis-cli ZCARD "bull:order-provisioning:failed")

echo "Wait: $WAIT, Active: $ACTIVE, Failed: $FAILED"
```

---

## Best Practices

### Job Design

- Include sufficient context in job data for debugging
- Use idempotency keys to prevent duplicate processing
- Keep job payloads small (< 10KB)

### Error Handling

- Distinguish between retryable and non-retryable errors
- Log sufficient context before throwing
- Update external systems with error status on final failure

### Monitoring

- Set up alerts for queue depth and failure rate
- Monitor job processing duration
- Track success/failure ratios over time

---

## Related Documents

- [Incident Response](./incident-response.md)
- [Provisioning Runbook](./provisioning-runbook.md)
- [External Dependencies](./external-dependencies.md)
- [SIM State Machine](../integrations/sim/state-machine.md)

---

**Last Updated:** December 2025