Assist_Design/docs/operations/queue-management.md

# Queue Management Runbook

This document covers monitoring and management of BullMQ job queues used by the Customer Portal BFF.

---

## Overview

The BFF uses BullMQ (backed by Redis) for asynchronous job processing:

| Queue                | Purpose                                       | Processor Location                                   |
| -------------------- | --------------------------------------------- | ---------------------------------------------------- |
| `order-provisioning` | Order fulfillment after CS approval           | `apps/bff/src/modules/orders/queue/`                 |
| `sim-management`     | Delayed SIM operations (network type changes) | `apps/bff/src/modules/subscriptions/sim-management/` |

---

## Queue Configuration

### Environment Variables

| Variable                 | Description                        | Default  |
| ------------------------ | ---------------------------------- | -------- |
| `REDIS_URL`              | Redis connection for queues        | Required |
| `QUEUE_DEFAULT_ATTEMPTS` | Default retry attempts             | 3        |
| `QUEUE_BACKOFF_DELAY`    | Backoff delay between retries (ms) | 5000     |

### Queue Options

```typescript
// Default queue configuration
{
  defaultJobOptions: {
    attempts: 3,
    backoff: {
      type: 'exponential',
      delay: 5000,
    },
    removeOnComplete: 100,  // Keep last 100 completed jobs
    removeOnFail: 500,      // Keep last 500 failed jobs
  }
}
```

---

## Monitoring

### Check Queue Status

```bash
# Connect to Redis and check queue keys
redis-cli KEYS "bull:*"

# Check specific queue length
redis-cli LLEN "bull:order-provisioning:wait"
redis-cli LLEN "bull:order-provisioning:active"
redis-cli ZCARD "bull:order-provisioning:delayed"
redis-cli ZCARD "bull:order-provisioning:failed"
```

### Queue Key Structure

| Key Pattern              | Description                         |
| ------------------------ | ----------------------------------- |
| `bull:{queue}:wait`      | Jobs waiting to be processed        |
| `bull:{queue}:active`    | Jobs currently being processed      |
| `bull:{queue}:delayed`   | Jobs scheduled for future execution |
| `bull:{queue}:completed` | Recently completed jobs             |
| `bull:{queue}:failed`    | Failed jobs                         |

### Health Metrics

| Metric           | Warning | Critical | Action                      |
| ---------------- | ------- | -------- | --------------------------- |
| Wait queue depth | >10     | >50      | Check processor status      |
| Failed job count | >5      | >20      | Investigate failures        |
| Processing time  | >30s    | >60s     | Check external dependencies |

---

## Order Provisioning Queue

### Purpose

Processes orders after CS approval via Salesforce Platform Events.

### Flow

```
Salesforce Platform Event (OrderProvisionRequested__e)
    ↓
Event Subscriber receives event
    ↓
Job enqueued to 'order-provisioning' queue
    ↓
Processor executes fulfillment workflow
    ↓
Order created in WHMCS + Salesforce updated
```

### Job Data Structure

```typescript
{
  sfOrderId: "8014x000000ABCDXYZ",  // Salesforce Order ID
  idempotencyKey: "8014x...-1703123456789",
  eventPayload: { ... }  // Original Platform Event data
}
```

### Common Failure Reasons

| Error                    | Cause                          | Resolution                                       |
| ------------------------ | ------------------------------ | ------------------------------------------------ |
| `PAYMENT_METHOD_MISSING` | Customer has no payment method | Customer must add payment method in WHMCS        |
| `ORDER_NOT_FOUND`        | Salesforce Order doesn't exist | Check Order ID, verify not deleted               |
| `MAPPING_ERROR`          | Product mapping missing        | Add `WH_Product_ID__c` to Product2 in Salesforce |
| `WHMCS_ERROR`            | WHMCS API failure              | Check WHMCS connectivity and logs                |

### Retry Behavior

- **Attempts**: 3 total (1 initial + 2 retries)
- **Backoff**: Exponential (5s, 10s, 20s)
- **On Final Failure**: Salesforce Order updated with error details

---

## SIM Management Queue

### Purpose

Handles delayed SIM operations, particularly network type changes that require a 30-minute gap.

### Job Types

| Job Type            | Delay      | Description                   |
| ------------------- | ---------- | ----------------------------- |
| `networkTypeChange` | 30 minutes | Change between 4G/5G networks |

### Job Data Structure

```typescript
{
  subscriptionId: 29951,
  simAccount: "08077052946",
  operation: "networkTypeChange",
  params: {
    networkType: "5G"
  },
  scheduledAt: "2024-01-15T10:30:00Z"
}
```

### Common Failure Reasons

| Error                 | Cause                            | Resolution                              |
| --------------------- | -------------------------------- | --------------------------------------- |
| `FREEBIT_AUTH_FAILED` | Freebit authentication error     | Check OEM credentials                   |
| `ACCOUNT_NOT_FOUND`   | SIM account not found in Freebit | Verify account identifier               |
| `OPERATION_CONFLICT`  | Another operation pending        | Wait for previous operation to complete |

---

## Failed Job Investigation

### View Failed Jobs

```bash
# List failed jobs (using Redis CLI)
redis-cli ZRANGE "bull:order-provisioning:failed" 0 -1

# Get job details
redis-cli HGETALL "bull:order-provisioning:{job-id}"
```

### Common Investigation Steps

1. **Check job data**: Identify the order/subscription involved
2. **Check error message**: Look for specific failure reason
3. **Check external system**: Verify Salesforce/WHMCS/Freebit status
4. **Check logs**: Search BFF logs for job ID or order ID
5. **Determine if retryable**: Some errors are permanent (missing mapping), others are transient (network timeout)

### Log Search

```bash
# Search logs for specific order
grep "8014x000000ABCDXYZ" /var/log/bff/combined.log

# Search for queue processing errors
grep "provisioning" /var/log/bff/error.log | tail -50
```

---

## Manual Retry Procedures

### Retry a Single Failed Job

```typescript
// Using BullMQ API in Node.js
import { Queue } from "bullmq";

const queue = new Queue("order-provisioning", { connection: redisConnection });
const job = await queue.getJob("job-id");
await job.retry();
```

### Retry All Failed Jobs

```bash
# Move all failed jobs back to waiting
redis-cli ZRANGEBYSCORE "bull:order-provisioning:failed" -inf +inf | while read jobId; do
  redis-cli LPUSH "bull:order-provisioning:wait" "$jobId"
  redis-cli ZREM "bull:order-provisioning:failed" "$jobId"
done
```

> **Warning**: Only retry jobs after fixing the root cause. Retrying without fixing will cause the same failure.

### Retry via Salesforce (Recommended for Provisioning)

For order provisioning, the recommended retry method is through Salesforce:

1. Open the Order in Salesforce
2. Clear error fields (`Activation_Error__c`, `Activation_Error_DateTime__c`)
3. Set `Activation_Status__c` back to "Activating"
4. The Record-Triggered Flow will publish a new Platform Event

This approach ensures proper idempotency tracking and audit trail.

---

## Clearing Stuck Jobs

### Clear All Jobs from a Queue

> **Warning**: This removes all jobs including pending work. Use only in emergencies.

```bash
# Clear all queue data
redis-cli DEL \
  "bull:order-provisioning:wait" \
  "bull:order-provisioning:active" \
  "bull:order-provisioning:delayed" \
  "bull:order-provisioning:completed" \
  "bull:order-provisioning:failed"
```

### Clear Old Completed/Failed Jobs

```bash
# Remove jobs older than 7 days from completed
redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:completed" -inf $(date -d '7 days ago' +%s000)

# Remove jobs older than 30 days from failed
redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:failed" -inf $(date -d '30 days ago' +%s000)
```

---

## Queue Backlog Handling

### Symptoms of Backlog

- Wait queue depth increasing
- Jobs not being processed
- Customer orders stuck in "Activating" status

### Diagnosis

1. **Check processor is running**

   ```bash
   grep "BullMQ" /var/log/bff/combined.log | tail -20
   ```

2. **Check Redis connectivity**

   ```bash
   redis-cli PING
   ```

3. **Check for blocked jobs**

   ```bash
   redis-cli LLEN "bull:order-provisioning:active"
   # If active > 0 for extended time, jobs may be stuck
   ```

4. **Check external dependencies**
   - Salesforce API
   - WHMCS API

### Resolution

1. **Restart BFF** to reconnect queue workers
2. **Clear stuck active jobs** if processor crashed mid-job
3. **Scale horizontally** if queue depth is due to high volume
4. **Fix root cause** if jobs are failing repeatedly

---

## Alerting Configuration

### Recommended Alerts

| Alert                  | Condition                                        | Severity |
| ---------------------- | ------------------------------------------------ | -------- |
| Queue Backlog          | Wait queue > 10 for > 5 minutes                  | Warning  |
| Queue Backlog Critical | Wait queue > 50                                  | Critical |
| Failed Jobs Spike      | > 5 failures in 15 minutes                       | Warning  |
| Processor Down         | No job processed in 10 minutes with jobs waiting | Critical |
| Job Timeout            | Job active for > 5 minutes                       | Warning  |

### Monitoring Queries

```bash
# Check queue depths (for monitoring script)
WAIT=$(redis-cli LLEN "bull:order-provisioning:wait")
ACTIVE=$(redis-cli LLEN "bull:order-provisioning:active")
FAILED=$(redis-cli ZCARD "bull:order-provisioning:failed")

echo "Wait: $WAIT, Active: $ACTIVE, Failed: $FAILED"
```

---

## Best Practices

### Job Design

- Include sufficient context in job data for debugging
- Use idempotency keys to prevent duplicate processing
- Keep job payloads small (< 10KB)

### Error Handling

- Distinguish between retryable and non-retryable errors
- Log sufficient context before throwing
- Update external systems with error status on final failure

### Monitoring

- Set up alerts for queue depth and failure rate
- Monitor job processing duration
- Track success/failure ratios over time

---

## Related Documents

- [Incident Response](./incident-response.md)
- [Provisioning Runbook](./provisioning-runbook.md)
- [External Dependencies](./external-dependencies.md)
- [SIM State Machine](../integrations/sim/state-machine.md)

---

**Last Updated:** December 2025
Enhance Documentation Structure and Update Operational Runbooks - Added a new section for operational runbooks in README.md, detailing procedures for incident response, database operations, and queue management. - Updated the documentation structure in STRUCTURE.md to reflect the new organization of guides and resources. - Removed the deprecated disabled-modules.md file to streamline documentation. - Enhanced the _archive/README.md with historical notes on documentation alignment and corrections made in December 2025. - Updated various references in the documentation to reflect the new paths and services in the integrations directory. 2025-12-23 15:55:58 +09:00			`# Queue Management Runbook`

			`This document covers monitoring and management of BullMQ job queues used by the Customer Portal BFF.`

			`---`

			`## Overview`

			`The BFF uses BullMQ (backed by Redis) for asynchronous job processing:`

			`\| Queue \| Purpose \| Processor Location \|`
			`\| -------------------- \| --------------------------------------------- \| ---------------------------------------------------- \|`
			\| `order-provisioning` \| Order fulfillment after CS approval \| `apps/bff/src/modules/orders/queue/` \|
			\| `sim-management` \| Delayed SIM operations (network type changes) \| `apps/bff/src/modules/subscriptions/sim-management/` \|

			`---`

			`## Queue Configuration`

			`### Environment Variables`

			`\| Variable \| Description \| Default \|`
			`\| ------------------------ \| ---------------------------------- \| -------- \|`
			\| `REDIS_URL` \| Redis connection for queues \| Required \|
			\| `QUEUE_DEFAULT_ATTEMPTS` \| Default retry attempts \| 3 \|
			\| `QUEUE_BACKOFF_DELAY` \| Backoff delay between retries (ms) \| 5000 \|

			`### Queue Options`

			```typescript
			`// Default queue configuration`
			`{`
			`defaultJobOptions: {`
			`attempts: 3,`
			`backoff: {`
			`type: 'exponential',`
			`delay: 5000,`
			`},`
			`removeOnComplete: 100, // Keep last 100 completed jobs`
			`removeOnFail: 500, // Keep last 500 failed jobs`
			`}`
			`}`
			```

			`---`

			`## Monitoring`

			`### Check Queue Status`

			```bash
			`# Connect to Redis and check queue keys`
			`redis-cli KEYS "bull:*"`

			`# Check specific queue length`
			`redis-cli LLEN "bull:order-provisioning:wait"`
			`redis-cli LLEN "bull:order-provisioning:active"`
			`redis-cli ZCARD "bull:order-provisioning:delayed"`
			`redis-cli ZCARD "bull:order-provisioning:failed"`
			```

			`### Queue Key Structure`

			`\| Key Pattern \| Description \|`
			`\| ------------------------ \| ----------------------------------- \|`
			\| `bull:{queue}:wait` \| Jobs waiting to be processed \|
			\| `bull:{queue}:active` \| Jobs currently being processed \|
			\| `bull:{queue}:delayed` \| Jobs scheduled for future execution \|
			\| `bull:{queue}:completed` \| Recently completed jobs \|
			\| `bull:{queue}:failed` \| Failed jobs \|

			`### Health Metrics`

			`\| Metric \| Warning \| Critical \| Action \|`
			`\| ---------------- \| ------- \| -------- \| --------------------------- \|`
			`\| Wait queue depth \| >10 \| >50 \| Check processor status \|`
			`\| Failed job count \| >5 \| >20 \| Investigate failures \|`
			`\| Processing time \| >30s \| >60s \| Check external dependencies \|`

			`---`

			`## Order Provisioning Queue`

			`### Purpose`

			`Processes orders after CS approval via Salesforce Platform Events.`

			`### Flow`

			```
			`Salesforce Platform Event (OrderProvisionRequested__e)`
			`↓`
			`Event Subscriber receives event`
			`↓`
			`Job enqueued to 'order-provisioning' queue`
			`↓`
			`Processor executes fulfillment workflow`
			`↓`
			`Order created in WHMCS + Salesforce updated`
			```

			`### Job Data Structure`

			```typescript
			`{`
			`sfOrderId: "8014x000000ABCDXYZ", // Salesforce Order ID`
			`idempotencyKey: "8014x...-1703123456789",`
			`eventPayload: { ... } // Original Platform Event data`
			`}`
			```

			`### Common Failure Reasons`

			`\| Error \| Cause \| Resolution \|`
			`\| ------------------------ \| ------------------------------ \| ------------------------------------------------ \|`
			\| `PAYMENT_METHOD_MISSING` \| Customer has no payment method \| Customer must add payment method in WHMCS \|
			\| `ORDER_NOT_FOUND` \| Salesforce Order doesn't exist \| Check Order ID, verify not deleted \|
			\| `MAPPING_ERROR` \| Product mapping missing \| Add `WH_Product_ID__c` to Product2 in Salesforce \|
			\| `WHMCS_ERROR` \| WHMCS API failure \| Check WHMCS connectivity and logs \|

			`### Retry Behavior`

			`- Attempts: 3 total (1 initial + 2 retries)`
			`- Backoff: Exponential (5s, 10s, 20s)`
			`- On Final Failure: Salesforce Order updated with error details`

			`---`

			`## SIM Management Queue`

			`### Purpose`

			`Handles delayed SIM operations, particularly network type changes that require a 30-minute gap.`

			`### Job Types`

			`\| Job Type \| Delay \| Description \|`
			`\| ------------------- \| ---------- \| ----------------------------- \|`
			\| `networkTypeChange` \| 30 minutes \| Change between 4G/5G networks \|

			`### Job Data Structure`

			```typescript
			`{`
			`subscriptionId: 29951,`
			`simAccount: "08077052946",`
			`operation: "networkTypeChange",`
			`params: {`
			`networkType: "5G"`
			`},`
			`scheduledAt: "2024-01-15T10:30:00Z"`
			`}`
			```

			`### Common Failure Reasons`

			`\| Error \| Cause \| Resolution \|`
			`\| --------------------- \| -------------------------------- \| --------------------------------------- \|`
			\| `FREEBIT_AUTH_FAILED` \| Freebit authentication error \| Check OEM credentials \|
			\| `ACCOUNT_NOT_FOUND` \| SIM account not found in Freebit \| Verify account identifier \|
			\| `OPERATION_CONFLICT` \| Another operation pending \| Wait for previous operation to complete \|

			`---`

			`## Failed Job Investigation`

			`### View Failed Jobs`

			```bash
			`# List failed jobs (using Redis CLI)`
			`redis-cli ZRANGE "bull:order-provisioning:failed" 0 -1`

			`# Get job details`
			`redis-cli HGETALL "bull:order-provisioning:{job-id}"`
			```

			`### Common Investigation Steps`

			`1. Check job data: Identify the order/subscription involved`
			`2. Check error message: Look for specific failure reason`
			`3. Check external system: Verify Salesforce/WHMCS/Freebit status`
			`4. Check logs: Search BFF logs for job ID or order ID`
			`5. Determine if retryable: Some errors are permanent (missing mapping), others are transient (network timeout)`

			`### Log Search`

			```bash
			`# Search logs for specific order`
			`grep "8014x000000ABCDXYZ" /var/log/bff/combined.log`

			`# Search for queue processing errors`
			`grep "provisioning" /var/log/bff/error.log \| tail -50`
			```

			`---`

			`## Manual Retry Procedures`

			`### Retry a Single Failed Job`

			```typescript
			`// Using BullMQ API in Node.js`
			`import { Queue } from "bullmq";`

			`const queue = new Queue("order-provisioning", { connection: redisConnection });`
			`const job = await queue.getJob("job-id");`
			`await job.retry();`
			```

			`### Retry All Failed Jobs`

			```bash
			`# Move all failed jobs back to waiting`
			`redis-cli ZRANGEBYSCORE "bull:order-provisioning:failed" -inf +inf \| while read jobId; do`
			`redis-cli LPUSH "bull:order-provisioning:wait" "$jobId"`
			`redis-cli ZREM "bull:order-provisioning:failed" "$jobId"`
			`done`
			```

			`> Warning: Only retry jobs after fixing the root cause. Retrying without fixing will cause the same failure.`

			`### Retry via Salesforce (Recommended for Provisioning)`

			`For order provisioning, the recommended retry method is through Salesforce:`

			`1. Open the Order in Salesforce`
			2. Clear error fields (`Activation_Error__c`, `Activation_Error_DateTime__c`)
			3. Set `Activation_Status__c` back to "Activating"
			`4. The Record-Triggered Flow will publish a new Platform Event`

			`This approach ensures proper idempotency tracking and audit trail.`

			`---`

			`## Clearing Stuck Jobs`

			`### Clear All Jobs from a Queue`

			`> Warning: This removes all jobs including pending work. Use only in emergencies.`

			```bash
			`# Clear all queue data`
			`redis-cli DEL \`
			`"bull:order-provisioning:wait" \`
			`"bull:order-provisioning:active" \`
			`"bull:order-provisioning:delayed" \`
			`"bull:order-provisioning:completed" \`
			`"bull:order-provisioning:failed"`
			```

			`### Clear Old Completed/Failed Jobs`

			```bash
			`# Remove jobs older than 7 days from completed`
			`redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:completed" -inf $(date -d '7 days ago' +%s000)`

			`# Remove jobs older than 30 days from failed`
			`redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:failed" -inf $(date -d '30 days ago' +%s000)`
			```

			`---`

			`## Queue Backlog Handling`

			`### Symptoms of Backlog`

			`- Wait queue depth increasing`
			`- Jobs not being processed`
			`- Customer orders stuck in "Activating" status`

			`### Diagnosis`

			`1. Check processor is running`

			```bash
			`grep "BullMQ" /var/log/bff/combined.log \| tail -20`
			```

			`2. Check Redis connectivity`

			```bash
			`redis-cli PING`
			```

			`3. Check for blocked jobs`

			```bash
			`redis-cli LLEN "bull:order-provisioning:active"`
			`# If active > 0 for extended time, jobs may be stuck`
			```

			`4. Check external dependencies`
			`- Salesforce API`
			`- WHMCS API`

			`### Resolution`

			`1. Restart BFF to reconnect queue workers`
			`2. Clear stuck active jobs if processor crashed mid-job`
			`3. Scale horizontally if queue depth is due to high volume`
			`4. Fix root cause if jobs are failing repeatedly`

			`---`

			`## Alerting Configuration`

			`### Recommended Alerts`

			`\| Alert \| Condition \| Severity \|`
			`\| ---------------------- \| ------------------------------------------------ \| -------- \|`
			`\| Queue Backlog \| Wait queue > 10 for > 5 minutes \| Warning \|`
			`\| Queue Backlog Critical \| Wait queue > 50 \| Critical \|`
			`\| Failed Jobs Spike \| > 5 failures in 15 minutes \| Warning \|`
			`\| Processor Down \| No job processed in 10 minutes with jobs waiting \| Critical \|`
			`\| Job Timeout \| Job active for > 5 minutes \| Warning \|`

			`### Monitoring Queries`

			```bash
			`# Check queue depths (for monitoring script)`
			`WAIT=$(redis-cli LLEN "bull:order-provisioning:wait")`
			`ACTIVE=$(redis-cli LLEN "bull:order-provisioning:active")`
			`FAILED=$(redis-cli ZCARD "bull:order-provisioning:failed")`

			`echo "Wait: $WAIT, Active: $ACTIVE, Failed: $FAILED"`
			```

			`---`

			`## Best Practices`

			`### Job Design`

			`- Include sufficient context in job data for debugging`
			`- Use idempotency keys to prevent duplicate processing`
			`- Keep job payloads small (< 10KB)`

			`### Error Handling`

			`- Distinguish between retryable and non-retryable errors`
			`- Log sufficient context before throwing`
			`- Update external systems with error status on final failure`

			`### Monitoring`

			`- Set up alerts for queue depth and failure rate`
			`- Monitor job processing duration`
			`- Track success/failure ratios over time`

			`---`

			`## Related Documents`

			`- [Incident Response](./incident-response.md)`
			`- [Provisioning Runbook](./provisioning-runbook.md)`
			`- [External Dependencies](./external-dependencies.md)`
			`- [SIM State Machine](../integrations/sim/state-machine.md)`

			`---`

			`Last Updated: December 2025`