Assist_Design/docs/operations/rate-limit-tuning.md
barsa 90ab71b94d Update README.md to Enhance Documentation Clarity and Add New Sections
- Added a new section for Release Procedures, detailing deployment and rollback processes.
- Updated the System Operations section to include Monitoring Setup, Rate Limit Tuning, and Customer Data Management for improved operational guidance.
- Reformatted the table structure for better readability and consistency across documentation.
2025-12-23 16:08:15 +09:00

396 lines
11 KiB
Markdown

# Rate Limit Tuning Guide
This document covers rate limiting configuration, adjustment procedures, and troubleshooting for the Customer Portal.
---
## Rate Limiting Overview
The portal uses multiple rate limiting mechanisms:
| Type | Scope | Backend | Purpose |
| ------------------------- | ---------------------------------- | ------------------- | --------------------------- |
| **Auth Rate Limiting** | Per endpoint (login, signup, etc.) | Redis | Prevent brute force attacks |
| **Global Rate Limiting** | Per route/controller | Redis | API abuse prevention |
| **Request Queues** | Per external API | In-memory (p-queue) | External API protection |
| **SSE Connection Limits** | Per user | In-memory | Resource protection |
---
## Authentication Rate Limits
### Configuration
| Endpoint | Env Variable | Default | Window |
| -------------------- | --------------------------------- | ----------- | ------ |
| Login | `LOGIN_RATE_LIMIT_LIMIT` | 5 attempts | 15 min |
| Login (TTL) | `LOGIN_RATE_LIMIT_TTL` | 900000 ms | - |
| Signup | `SIGNUP_RATE_LIMIT_LIMIT` | 5 attempts | 15 min |
| Signup (TTL) | `SIGNUP_RATE_LIMIT_TTL` | 900000 ms | - |
| Password Reset | `PASSWORD_RESET_RATE_LIMIT_LIMIT` | 5 attempts | 15 min |
| Password Reset (TTL) | `PASSWORD_RESET_RATE_LIMIT_TTL` | 900000 ms | - |
| Token Refresh | `AUTH_REFRESH_RATE_LIMIT_LIMIT` | 10 attempts | 5 min |
| Token Refresh (TTL) | `AUTH_REFRESH_RATE_LIMIT_TTL` | 300000 ms | - |
### CAPTCHA Configuration
| Setting | Env Variable | Default | Description |
| ----------------- | ------------------------------ | ------- | ------------------------------------ |
| CAPTCHA Threshold | `LOGIN_CAPTCHA_AFTER_ATTEMPTS` | 3 | Show CAPTCHA after N failed attempts |
| CAPTCHA Always On | `AUTH_CAPTCHA_ALWAYS_ON` | false | Require CAPTCHA for all logins |
### Adjusting Auth Rate Limits
**In Production (requires restart):**
```bash
# Edit .env file
LOGIN_RATE_LIMIT_LIMIT=10 # Increase to 10 attempts
LOGIN_RATE_LIMIT_TTL=1800000 # Extend window to 30 minutes
# Restart backend
docker compose restart backend
```
**Temporary Increase via Redis (immediate, no restart):**
```bash
# Check current rate limit for a key
redis-cli GET "auth-login:<ip-hash>"
# Delete a rate limit record to allow immediate retry
redis-cli DEL "auth-login:<ip-hash>"
```
---
## Global API Rate Limits
### Configuration
Global rate limits are applied via the `@RateLimit` decorator:
```typescript
@RateLimit({ limit: 100, ttl: 60 }) // 100 requests per minute
@Controller('invoices')
export class InvoicesController { ... }
```
### Common Rate Limit Settings
| Endpoint | Limit | TTL | Notes |
| ------------- | ----- | --- | --------------------- |
| Invoices | 100 | 60s | High-traffic endpoint |
| Subscriptions | 100 | 60s | High-traffic endpoint |
| Catalog | 200 | 60s | Cached, higher limit |
| Orders | 50 | 60s | Write operations |
| Profile | 60 | 60s | Standard limit |
### Adjusting Global Rate Limits
Global rate limits are defined in code. To adjust:
1. Modify the `@RateLimit` decorator in the controller
2. Deploy the change
```typescript
// Before
@RateLimit({ limit: 50, ttl: 60 })
// After (double the limit)
@RateLimit({ limit: 100, ttl: 60 })
```
---
## External API Request Queues
### WHMCS Queue Configuration
| Setting | Env Variable | Default | Description |
| ------------ | -------------------------- | ------- | ----------------------- |
| Concurrency | `WHMCS_QUEUE_CONCURRENCY` | 15 | Max parallel requests |
| Interval Cap | `WHMCS_QUEUE_INTERVAL_CAP` | 300 | Max requests per minute |
| Timeout | `WHMCS_QUEUE_TIMEOUT_MS` | 30000 | Request timeout (ms) |
### Salesforce Queue Configuration
| Setting | Env Variable | Default | Description |
| ------------------------ | ----------------------------- | ------- | ----------------------- |
| Standard Concurrency | `SF_QUEUE_CONCURRENCY` | 10 | Standard operations |
| Long-Running Concurrency | `SF_LONG_RUNNING_CONCURRENCY` | 5 | Bulk operations |
| Interval Cap | `SF_QUEUE_INTERVAL_CAP` | 200 | Max requests per minute |
| Timeout | `SF_QUEUE_TIMEOUT_MS` | 30000 | Request timeout (ms) |
### Adjusting Queue Limits
**Production Adjustment:**
```bash
# Edit .env file
WHMCS_QUEUE_CONCURRENCY=20 # Increase concurrent requests
WHMCS_QUEUE_INTERVAL_CAP=500 # Increase requests per minute
# Restart backend
docker compose restart backend
```
### Queue Health Monitoring
```bash
# Check queue metrics
curl http://localhost:4000/health/queues | jq '.'
# Expected output:
{
"whmcs": {
"health": "healthy",
"metrics": {
"queueSize": 0,
"pendingRequests": 2,
"failedRequests": 0
}
},
"salesforce": {
"health": "healthy",
"metrics": { ... },
"dailyUsage": { "used": 5000, "limit": 15000 }
}
}
```
---
## SSE Connection Limits
### Configuration
```typescript
// Per-user SSE connection limit (in-memory)
private readonly maxPerUser = 3;
```
This prevents a single user from opening unlimited SSE connections.
### Adjusting SSE Limits
This requires a code change in `realtime-connection-limiter.service.ts`:
```typescript
// Change from
private readonly maxPerUser = 3;
// To
private readonly maxPerUser = 5;
```
---
## Bypassing Rate Limits for Testing
### Temporary Bypass via Redis
```bash
# Clear all rate limit keys for testing
redis-cli KEYS "auth-*" | xargs redis-cli DEL
redis-cli KEYS "rate-limit:*" | xargs redis-cli DEL
# Clear specific user's rate limit
redis-cli KEYS "*<ip-or-user-identifier>*" | xargs redis-cli DEL
```
### Using SkipRateLimit Decorator
For development/testing routes:
```typescript
@SkipRateLimit()
@Get('test-endpoint')
async testEndpoint() { ... }
```
### Environment-Based Bypass
Add a development bypass in configuration:
```bash
# In .env (development only!)
RATE_LIMIT_BYPASS_ENABLED=true
```
```typescript
// In guard
if (this.configService.get("RATE_LIMIT_BYPASS_ENABLED") === "true") {
return true;
}
```
> **Warning**: Never enable bypass in production!
---
## Signs of Rate Limit Issues
### User-Facing Symptoms
| Symptom | Possible Cause | Investigation |
| -------------------------- | ------------------- | ------------------------- |
| "Too many requests" errors | Rate limit exceeded | Check Redis keys, logs |
| Login failures | Auth rate limit | Check `auth-login:*` keys |
| Slow API responses | Queue backlog | Check `/health/queues` |
| 429 errors in logs | Any rate limit | Check logs for specifics |
### Monitoring Indicators
| Metric | Warning | Critical | Action |
| ----------------- | ------------- | -------- | ------------------------ |
| 429 error rate | >1% | >5% | Review rate limits |
| Queue size | >10 | >50 | Increase concurrency |
| Average wait time | >1s | >5s | Scale or increase limits |
| CAPTCHA triggers | Unusual spike | - | Possible attack |
### Log Analysis
```bash
# Find rate limit exceeded events
grep "Rate limit exceeded" /var/log/bff/combined.log | tail -20
# Find 429 responses
grep '"statusCode":429' /var/log/bff/combined.log | tail -20
# Count rate limit events by path
grep "Rate limit exceeded" /var/log/bff/combined.log | \
jq -r '.path' | sort | uniq -c | sort -rn
```
---
## Troubleshooting
### Too Many 429 Errors
**Diagnosis:**
```bash
# Check which endpoints are rate limited
grep "Rate limit exceeded" /var/log/bff/combined.log | \
jq '{path: .path, key: .key}' | head -20
# Check queue health
curl http://localhost:4000/health/queues
```
**Resolution:**
1. Identify the affected endpoint
2. Check if limit is appropriate for traffic
3. Increase limit if legitimate traffic
4. Add caching if requests are repetitive
### Legitimate Users Being Blocked
**Diagnosis:**
```bash
# Check rate limit state for specific key
redis-cli KEYS "*<identifier>*"
redis-cli GET "auth-login:<hash>"
```
**Resolution:**
```bash
# Clear the user's rate limit record
redis-cli DEL "auth-login:<hash>"
```
### External API Rate Limit Violations
**WHMCS Rate Limiting:**
```bash
# Check queue metrics
curl http://localhost:4000/health/queues/whmcs
# Reduce concurrency if WHMCS is overloaded
WHMCS_QUEUE_CONCURRENCY=5
WHMCS_QUEUE_INTERVAL_CAP=100
```
**Salesforce API Limits:**
```bash
# Check daily API usage
curl http://localhost:4000/health/queues/salesforce | jq '.dailyUsage'
# If approaching limit, reduce requests
# Consider caching more data
```
### Redis Connection Issues
If rate limiting fails due to Redis:
```bash
# Check Redis connectivity
redis-cli PING
# The guard fails open on Redis errors (allows request)
# Check logs for "Rate limiter error - failing open"
```
---
## Best Practices
### Setting Rate Limits
1. **Start Conservative** - Begin with lower limits, increase as needed
2. **Monitor Before Adjusting** - Understand traffic patterns first
3. **Consider User Experience** - Limits should rarely impact normal use
4. **Document Changes** - Track why limits were adjusted
### Rate Limit Strategies
| Strategy | Use Case | Implementation |
| ---------- | ----------------------- | ---------------------- |
| IP-based | Anonymous endpoints | Default behavior |
| User-based | Authenticated endpoints | Include user ID in key |
| Combined | Sensitive endpoints | IP + User-Agent hash |
| Tiered | Different user classes | Custom logic |
### Performance Considerations
- **Redis Latency** - Keep Redis co-located with BFF
- **Key Expiration** - Use TTL to prevent Redis bloat
- **Fail Open** - Rate limiter allows requests if Redis fails
- **Logging** - Log blocked requests for analysis
---
## Rate Limit Response Headers
The BFF includes standard rate limit headers:
```http
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1704110400
Retry-After: 60
```
Clients can use these to implement backoff.
---
## Related Documents
- [Incident Response](./incident-response.md)
- [Monitoring Setup](./monitoring-setup.md)
- [External Dependencies](./external-dependencies.md)
- [Queue Management](./queue-management.md)
---
**Last Updated:** December 2025