Assist_Design/docs/operations/rate-limit-tuning.md

# Rate Limit Tuning Guide

This document covers rate limiting configuration, adjustment procedures, and troubleshooting for the Customer Portal.

---

## Rate Limiting Overview

The portal uses multiple rate limiting mechanisms:

| Type                      | Scope                              | Backend             | Purpose                     |
| ------------------------- | ---------------------------------- | ------------------- | --------------------------- |
| **Auth Rate Limiting**    | Per endpoint (login, signup, etc.) | Redis               | Prevent brute force attacks |
| **Global Rate Limiting**  | Per route/controller               | Redis               | API abuse prevention        |
| **Request Queues**        | Per external API                   | In-memory (p-queue) | External API protection     |
| **SSE Connection Limits** | Per user                           | In-memory           | Resource protection         |

---

## Authentication Rate Limits

### Configuration

| Endpoint             | Env Variable                      | Default     | Window |
| -------------------- | --------------------------------- | ----------- | ------ |
| Login                | `LOGIN_RATE_LIMIT_LIMIT`          | 5 attempts  | 15 min |
| Login (TTL)          | `LOGIN_RATE_LIMIT_TTL`            | 900000 ms   | -      |
| Signup               | `SIGNUP_RATE_LIMIT_LIMIT`         | 5 attempts  | 15 min |
| Signup (TTL)         | `SIGNUP_RATE_LIMIT_TTL`           | 900000 ms   | -      |
| Password Reset       | `PASSWORD_RESET_RATE_LIMIT_LIMIT` | 5 attempts  | 15 min |
| Password Reset (TTL) | `PASSWORD_RESET_RATE_LIMIT_TTL`   | 900000 ms   | -      |
| Token Refresh        | `AUTH_REFRESH_RATE_LIMIT_LIMIT`   | 10 attempts | 5 min  |
| Token Refresh (TTL)  | `AUTH_REFRESH_RATE_LIMIT_TTL`     | 300000 ms   | -      |

### CAPTCHA Configuration

| Setting           | Env Variable                   | Default | Description                          |
| ----------------- | ------------------------------ | ------- | ------------------------------------ |
| CAPTCHA Threshold | `LOGIN_CAPTCHA_AFTER_ATTEMPTS` | 3       | Show CAPTCHA after N failed attempts |
| CAPTCHA Always On | `AUTH_CAPTCHA_ALWAYS_ON`       | false   | Require CAPTCHA for all logins       |

### Adjusting Auth Rate Limits

**In Production (requires restart):**

```bash
# Edit .env file
LOGIN_RATE_LIMIT_LIMIT=10        # Increase to 10 attempts
LOGIN_RATE_LIMIT_TTL=1800000     # Extend window to 30 minutes

# Restart backend
docker compose restart backend
```

**Temporary Increase via Redis (immediate, no restart):**

```bash
# Check current rate limit for a key
redis-cli GET "auth-login:<ip-hash>"

# Delete a rate limit record to allow immediate retry
redis-cli DEL "auth-login:<ip-hash>"
```

---

## Global API Rate Limits

### Configuration

Global rate limits are applied via the `@RateLimit` decorator:

```typescript
@RateLimit({ limit: 100, ttl: 60 })  // 100 requests per minute
@Controller('invoices')
export class InvoicesController { ... }
```

### Common Rate Limit Settings

| Endpoint      | Limit | TTL | Notes                 |
| ------------- | ----- | --- | --------------------- |
| Invoices      | 100   | 60s | High-traffic endpoint |
| Subscriptions | 100   | 60s | High-traffic endpoint |
| Catalog       | 200   | 60s | Cached, higher limit  |
| Orders        | 50    | 60s | Write operations      |
| Profile       | 60    | 60s | Standard limit        |

### Adjusting Global Rate Limits

Global rate limits are defined in code. To adjust:

1. Modify the `@RateLimit` decorator in the controller
2. Deploy the change

```typescript
// Before
@RateLimit({ limit: 50, ttl: 60 })

// After (double the limit)
@RateLimit({ limit: 100, ttl: 60 })
```

---

## External API Request Queues

### WHMCS Queue Configuration

| Setting      | Env Variable               | Default | Description             |
| ------------ | -------------------------- | ------- | ----------------------- |
| Concurrency  | `WHMCS_QUEUE_CONCURRENCY`  | 15      | Max parallel requests   |
| Interval Cap | `WHMCS_QUEUE_INTERVAL_CAP` | 300     | Max requests per minute |
| Timeout      | `WHMCS_QUEUE_TIMEOUT_MS`   | 30000   | Request timeout (ms)    |

### Salesforce Queue Configuration

| Setting                  | Env Variable                  | Default | Description             |
| ------------------------ | ----------------------------- | ------- | ----------------------- |
| Standard Concurrency     | `SF_QUEUE_CONCURRENCY`        | 10      | Standard operations     |
| Long-Running Concurrency | `SF_LONG_RUNNING_CONCURRENCY` | 5       | Bulk operations         |
| Interval Cap             | `SF_QUEUE_INTERVAL_CAP`       | 200     | Max requests per minute |
| Timeout                  | `SF_QUEUE_TIMEOUT_MS`         | 30000   | Request timeout (ms)    |

### Adjusting Queue Limits

**Production Adjustment:**

```bash
# Edit .env file
WHMCS_QUEUE_CONCURRENCY=20      # Increase concurrent requests
WHMCS_QUEUE_INTERVAL_CAP=500    # Increase requests per minute

# Restart backend
docker compose restart backend
```

### Queue Health Monitoring

```bash
# Check queue metrics
curl http://localhost:4000/health/queues | jq '.'

# Expected output:
{
  "whmcs": {
    "health": "healthy",
    "metrics": {
      "queueSize": 0,
      "pendingRequests": 2,
      "failedRequests": 0
    }
  },
  "salesforce": {
    "health": "healthy",
    "metrics": { ... },
    "dailyUsage": { "used": 5000, "limit": 15000 }
  }
}
```

---

## SSE Connection Limits

### Configuration

```typescript
// Per-user SSE connection limit (in-memory)
private readonly maxPerUser = 3;
```

This prevents a single user from opening unlimited SSE connections.

### Adjusting SSE Limits

This requires a code change in `realtime-connection-limiter.service.ts`:

```typescript
// Change from
private readonly maxPerUser = 3;

// To
private readonly maxPerUser = 5;
```

---

## Bypassing Rate Limits for Testing

### Temporary Bypass via Redis

```bash
# Clear all rate limit keys for testing
redis-cli KEYS "auth-*" | xargs redis-cli DEL
redis-cli KEYS "rate-limit:*" | xargs redis-cli DEL

# Clear specific user's rate limit
redis-cli KEYS "*<ip-or-user-identifier>*" | xargs redis-cli DEL
```

### Using SkipRateLimit Decorator

For development/testing routes:

```typescript
@SkipRateLimit()
@Get('test-endpoint')
async testEndpoint() { ... }
```

### Environment-Based Bypass

Add a development bypass in configuration:

```bash
# In .env (development only!)
RATE_LIMIT_BYPASS_ENABLED=true
```

```typescript
// In guard
if (this.configService.get("RATE_LIMIT_BYPASS_ENABLED") === "true") {
  return true;
}
```

> **Warning**: Never enable bypass in production!

---

## Signs of Rate Limit Issues

### User-Facing Symptoms

| Symptom                    | Possible Cause      | Investigation             |
| -------------------------- | ------------------- | ------------------------- |
| "Too many requests" errors | Rate limit exceeded | Check Redis keys, logs    |
| Login failures             | Auth rate limit     | Check `auth-login:*` keys |
| Slow API responses         | Queue backlog       | Check `/health/queues`    |
| 429 errors in logs         | Any rate limit      | Check logs for specifics  |

### Monitoring Indicators

| Metric            | Warning       | Critical | Action                   |
| ----------------- | ------------- | -------- | ------------------------ |
| 429 error rate    | >1%           | >5%      | Review rate limits       |
| Queue size        | >10           | >50      | Increase concurrency     |
| Average wait time | >1s           | >5s      | Scale or increase limits |
| CAPTCHA triggers  | Unusual spike | -        | Possible attack          |

### Log Analysis

```bash
# Find rate limit exceeded events
grep "Rate limit exceeded" /var/log/bff/combined.log | tail -20

# Find 429 responses
grep '"statusCode":429' /var/log/bff/combined.log | tail -20

# Count rate limit events by path
grep "Rate limit exceeded" /var/log/bff/combined.log | \
  jq -r '.path' | sort | uniq -c | sort -rn
```

---

## Troubleshooting

### Too Many 429 Errors

**Diagnosis:**

```bash
# Check which endpoints are rate limited
grep "Rate limit exceeded" /var/log/bff/combined.log | \
  jq '{path: .path, key: .key}' | head -20

# Check queue health
curl http://localhost:4000/health/queues
```

**Resolution:**

1. Identify the affected endpoint
2. Check if limit is appropriate for traffic
3. Increase limit if legitimate traffic
4. Add caching if requests are repetitive

### Legitimate Users Being Blocked

**Diagnosis:**

```bash
# Check rate limit state for specific key
redis-cli KEYS "*<identifier>*"
redis-cli GET "auth-login:<hash>"
```

**Resolution:**

```bash
# Clear the user's rate limit record
redis-cli DEL "auth-login:<hash>"
```

### External API Rate Limit Violations

**WHMCS Rate Limiting:**

```bash
# Check queue metrics
curl http://localhost:4000/health/queues/whmcs

# Reduce concurrency if WHMCS is overloaded
WHMCS_QUEUE_CONCURRENCY=5
WHMCS_QUEUE_INTERVAL_CAP=100
```

**Salesforce API Limits:**

```bash
# Check daily API usage
curl http://localhost:4000/health/queues/salesforce | jq '.dailyUsage'

# If approaching limit, reduce requests
# Consider caching more data
```

### Redis Connection Issues

If rate limiting fails due to Redis:

```bash
# Check Redis connectivity
redis-cli PING

# The guard fails open on Redis errors (allows request)
# Check logs for "Rate limiter error - failing open"
```

---

## Best Practices

### Setting Rate Limits

1. **Start Conservative** - Begin with lower limits, increase as needed
2. **Monitor Before Adjusting** - Understand traffic patterns first
3. **Consider User Experience** - Limits should rarely impact normal use
4. **Document Changes** - Track why limits were adjusted

### Rate Limit Strategies

| Strategy   | Use Case                | Implementation         |
| ---------- | ----------------------- | ---------------------- |
| IP-based   | Anonymous endpoints     | Default behavior       |
| User-based | Authenticated endpoints | Include user ID in key |
| Combined   | Sensitive endpoints     | IP + User-Agent hash   |
| Tiered     | Different user classes  | Custom logic           |

### Performance Considerations

- **Redis Latency** - Keep Redis co-located with BFF
- **Key Expiration** - Use TTL to prevent Redis bloat
- **Fail Open** - Rate limiter allows requests if Redis fails
- **Logging** - Log blocked requests for analysis

---

## Rate Limit Response Headers

The BFF includes standard rate limit headers:

```http
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1704110400
Retry-After: 60
```

Clients can use these to implement backoff.

---

## Related Documents

- [Incident Response](./incident-response.md)
- [Monitoring Setup](./monitoring-setup.md)
- [External Dependencies](./external-dependencies.md)
- [Queue Management](./queue-management.md)

---

**Last Updated:** December 2025