From 72d0b66be72a84d0b9e1cbc600db1a07d6bce5ff Mon Sep 17 00:00:00 2001 From: barsa Date: Tue, 23 Dec 2025 15:55:58 +0900 Subject: [PATCH] Enhance Documentation Structure and Update Operational Runbooks - Added a new section for operational runbooks in README.md, detailing procedures for incident response, database operations, and queue management. - Updated the documentation structure in STRUCTURE.md to reflect the new organization of guides and resources. - Removed the deprecated disabled-modules.md file to streamline documentation. - Enhanced the _archive/README.md with historical notes on documentation alignment and corrections made in December 2025. - Updated various references in the documentation to reflect the new paths and services in the integrations directory. --- docs/README.md | 31 +- docs/STRUCTURE.md | 47 +- docs/_archive/README.md | 16 + .../refactoring/CLEAN-ARCHITECTURE-SUMMARY.md | 2 +- docs/architecture/modular-provisioning.md | 50 +-- docs/architecture/system-overview.md | 107 +++-- .../portal/integration-overview.md | 26 +- docs/getting-started/setup.md | 20 +- docs/integrations/sim/freebit.md | 8 +- docs/integrations/whmcs/troubleshooting.md | 4 +- docs/operations/database-operations.md | 407 ++++++++++++++++++ docs/operations/disabled-modules.md | 28 -- docs/operations/external-dependencies.md | 325 ++++++++++++++ docs/operations/external-processes.md | 325 ++++++++++++++ docs/operations/incident-response.md | 327 ++++++++++++++ docs/operations/provisioning-runbook.md | 76 ++++ docs/operations/queue-management.md | 361 ++++++++++++++++ 17 files changed, 2007 insertions(+), 153 deletions(-) create mode 100644 docs/operations/database-operations.md delete mode 100644 docs/operations/disabled-modules.md create mode 100644 docs/operations/external-dependencies.md create mode 100644 docs/operations/external-processes.md create mode 100644 docs/operations/incident-response.md create mode 100644 docs/operations/queue-management.md diff --git a/docs/README.md b/docs/README.md index a0b76bde..ecf8f8e3 100644 --- a/docs/README.md +++ b/docs/README.md @@ -138,13 +138,24 @@ Feature guides explaining how the portal functions: ## πŸ› οΈ Operations -| Document | Description | -| ------------------------------------------------------------------ | ----------------------------- | -| [Logging](./operations/logging.md) | Centralized logging system | -| [Security Monitoring](./operations/security-monitoring.md) | Security monitoring setup | -| [Provisioning Runbook](./operations/provisioning-runbook.md) | Provisioning procedures | -| [Subscription Management](./operations/subscription-management.md) | Service management | -| [Disabled Modules](./operations/disabled-modules.md) | Temporarily disabled features | +### Runbooks + +| Document | Description | +| -------------------------------------------------------------- | ----------------------------- | +| [Incident Response](./operations/incident-response.md) | Emergency procedures | +| [Provisioning Runbook](./operations/provisioning-runbook.md) | Order fulfillment procedures | +| [Database Operations](./operations/database-operations.md) | Backup, recovery, maintenance | +| [External Dependencies](./operations/external-dependencies.md) | Integration health checks | +| [Queue Management](./operations/queue-management.md) | BullMQ job monitoring | +| [External Processes](./operations/external-processes.md) | Team handoffs and workflows | + +### System Operations + +| Document | Description | +| ------------------------------------------------------------------ | -------------------------- | +| [Logging](./operations/logging.md) | Centralized logging system | +| [Security Monitoring](./operations/security-monitoring.md) | Security monitoring setup | +| [Subscription Management](./operations/subscription-management.md) | Service management | --- @@ -178,11 +189,13 @@ Historical documents kept for reference: 2. [Domain Types](./development/domain/types.md) 3. [Performance](./development/portal/performance.md) -### DevOps +### DevOps / Operations 1. [Deployment](./getting-started/deployment.md) -2. [Logging](./operations/logging.md) +2. [Incident Response](./operations/incident-response.md) 3. [Provisioning Runbook](./operations/provisioning-runbook.md) +4. [Database Operations](./operations/database-operations.md) +5. [External Dependencies](./operations/external-dependencies.md) --- diff --git a/docs/STRUCTURE.md b/docs/STRUCTURE.md index 841838e0..c60c5624 100644 --- a/docs/STRUCTURE.md +++ b/docs/STRUCTURE.md @@ -114,13 +114,18 @@ Coding standards β”‚ └── plesk-deploy.sh # βœ… Plesk deployment script β”‚ β”œβ”€β”€ πŸ“š docs/ # Documentation -β”‚ β”œβ”€β”€ README.md # βœ… Comprehensive guide -β”‚ β”œβ”€β”€ GETTING_STARTED.md # βœ… Quick start guide -β”‚ β”œβ”€β”€ RUN.md # βœ… Development workflow -β”‚ β”œβ”€β”€ DEPLOY.md # βœ… Production deployment -β”‚ β”œβ”€β”€ LOGGING.md # βœ… Logging configuration -β”‚ β”œβ”€β”€ SECURITY.md # βœ… Security features and best practices -β”‚ └── STRUCTURE.md # βœ… This file +β”‚ β”œβ”€β”€ README.md # βœ… Documentation index +β”‚ β”œβ”€β”€ STRUCTURE.md # βœ… This file +β”‚ β”œβ”€β”€ getting-started/ # Setup and running guides +β”‚ β”‚ β”œβ”€β”€ setup.md # Initial project setup +β”‚ β”‚ β”œβ”€β”€ running.md # Local development +β”‚ β”‚ └── deployment.md # Production deployment +β”‚ β”œβ”€β”€ architecture/ # System design documents +β”‚ β”œβ”€β”€ how-it-works/ # Feature guides +β”‚ β”œβ”€β”€ integrations/ # External system integration +β”‚ β”œβ”€β”€ development/ # Development guides +β”‚ β”œβ”€β”€ operations/ # Operational runbooks +β”‚ └── _archive/ # Historical documents β”‚ β”œβ”€β”€ πŸ“¦ packages/ # Shared packages β”‚ └── domain/ # Domain TypeScript utilities @@ -135,11 +140,12 @@ Coding standards ### **Environment Template Approach** -- **`.env.dev.example`** - Development-optimized template -- **`.env.production.example`** - Production-optimized template -- **`.env.example`** - Basic template for custom setups -- **`.env`** - Your actual configuration (gitignored) -- **Environment-specific defaults** - Appropriate values per environment +Environment templates are located in the `env/` folder: + +- **`env/dev.env.sample`** - Development environment template +- **`env/portal-backend.env.sample`** - Backend-specific variables +- **`env/portal-frontend.env.sample`** - Frontend-specific variables +- **`.env`** - Your actual configuration (gitignored, at project root) ### **Environment Variables** @@ -206,13 +212,16 @@ pnpm prod:backup # Database backup ### **Essential Guides** -- **`README.md`** - Project overview and architecture -- **`GETTING_STARTED.md`** - Quick setup guide -- **`RUN.md`** - Development workflow -- **`DEPLOY.md`** - Production deployment -- **`LOGGING.md`** - Logging configuration -- **`SECURITY.md`** - Security features and best practices -- **`STRUCTURE.md`** - This file +Documentation is organized in subdirectories: + +- **`docs/README.md`** - Documentation index and navigation +- **`docs/STRUCTURE.md`** - This file (project structure) +- **`docs/getting-started/`** - Setup, running, and deployment guides +- **`docs/architecture/`** - System design and architecture +- **`docs/how-it-works/`** - Feature guides and workflows +- **`docs/integrations/`** - Salesforce, WHMCS, SIM integration +- **`docs/development/`** - BFF, Portal, Auth development guides +- **`docs/operations/`** - Runbooks and operational procedures ### **No Redundancy** diff --git a/docs/_archive/README.md b/docs/_archive/README.md index 7450ea04..976710da 100644 --- a/docs/_archive/README.md +++ b/docs/_archive/README.md @@ -27,4 +27,20 @@ Point-in-time code reviews and analysis documents: --- +## Historical Notes + +### December 2025 Documentation Alignment + +A comprehensive documentation review was performed in December 2025 to align documentation with the actual codebase. The following corrections were made: + +1. **Removed fictional package descriptions** from `system-overview.md` that referenced non-existent `packages/contracts`, `packages/schemas`, and `packages/integrations` packages +2. **Deleted `disabled-modules.md`** which referenced non-existent "Cases" and "Jobs" modules +3. **Fixed path references** from `vendors/whmcs` to `integrations/whmcs` throughout documentation +4. **Updated module lists** to reflect actual BFF modules +5. **Created new operational runbooks**: incident-response, database-operations, external-dependencies, queue-management, external-processes + +Documents in this archive folder predate these corrections and may contain outdated references. + +--- + **Note:** These documents may contain outdated information. For current system behavior, refer to the active documentation in the parent `docs/` directory. diff --git a/docs/_archive/refactoring/CLEAN-ARCHITECTURE-SUMMARY.md b/docs/_archive/refactoring/CLEAN-ARCHITECTURE-SUMMARY.md index d6e017c3..84bc74b8 100644 --- a/docs/_archive/refactoring/CLEAN-ARCHITECTURE-SUMMARY.md +++ b/docs/_archive/refactoring/CLEAN-ARCHITECTURE-SUMMARY.md @@ -8,7 +8,7 @@ I've completely restructured the Salesforce-to-Portal order provisioning system ### **1. Dedicated WHMCS Order Service** -**File**: `/apps/bff/src/vendors/whmcs/services/whmcs-order.service.ts` +**File**: `/apps/bff/src/integrations/whmcs/services/whmcs-order.service.ts` - **Purpose**: Handles all WHMCS order operations (AddOrder, AcceptOrder) - **Features**: diff --git a/docs/architecture/modular-provisioning.md b/docs/architecture/modular-provisioning.md index b54a5c4f..6fe02c18 100644 --- a/docs/architecture/modular-provisioning.md +++ b/docs/architecture/modular-provisioning.md @@ -8,40 +8,42 @@ I've restructured the provisioning system to **match the exact same clean modula ### **Order Creation (Existing) ↔ Order Provisioning (New)** -| **Order Creation** | **Order Provisioning** | **Purpose** | -| ------------------- | -------------------------- | ----------------------------------- | -| `OrderValidator` | `ProvisioningValidator` | Validates requests & business rules | -| `OrderBuilder` | `WhmcsOrderMapper` | Transforms/maps data structures | -| `OrderItemBuilder` | _(integrated in mapper)_ | Handles item-level processing | -| `OrderOrchestrator` | `ProvisioningOrchestrator` | Coordinates the complete workflow | -| `OrdersController` | `PlatformEventsSubscriber` | Event handling (no inbound HTTP) | +| **Order Creation** | **Order Fulfillment** | **Purpose** | +| ------------------- | ------------------------------ | ----------------------------------- | +| `OrderValidator` | `OrderFulfillmentValidator` | Validates requests & business rules | +| `OrderBuilder` | `OrderBuilder` | Transforms/maps data structures | +| `OrderItemBuilder` | `OrderItemBuilder` | Handles item-level processing | +| `OrderOrchestrator` | `OrderFulfillmentOrchestrator` | Coordinates the complete workflow | +| `OrdersController` | `PlatformEventsSubscriber` | Event handling (no inbound HTTP) | ## πŸ“ **Clean File Structure** ``` -apps/bff/src/orders/ +apps/bff/src/modules/orders/ β”œβ”€β”€ controllers/ -β”‚ └── orders.controller.ts # Customer-facing operations +β”‚ └── orders.controller.ts # Customer-facing operations β”œβ”€β”€ queue/ -β”‚ β”œβ”€β”€ provisioning.queue.ts # Enqueue provisioning jobs -β”‚ └── provisioning.processor.ts # Worker processes jobs +β”‚ β”œβ”€β”€ provisioning.queue.ts # Enqueue provisioning jobs +β”‚ └── provisioning.processor.ts # Worker processes jobs β”œβ”€β”€ services/ -β”‚ # Order Creation (existing) -β”‚ β”œβ”€β”€ order-validator.service.ts # Request & business validation -β”‚ β”œβ”€β”€ order-builder.service.ts # Order header construction -β”‚ β”œβ”€β”€ order-item-builder.service.ts # Order items construction -β”‚ β”œβ”€β”€ order-orchestrator.service.ts # Creation workflow coordination +β”‚ # Order Creation +β”‚ β”œβ”€β”€ order-validator.service.ts # Request & business validation +β”‚ β”œβ”€β”€ order-builder.service.ts # Order header construction +β”‚ β”œβ”€β”€ order-item-builder.service.ts # Order items construction +β”‚ β”œβ”€β”€ order-orchestrator.service.ts # Creation workflow coordination β”‚ β”‚ -β”‚ # Order Provisioning (new - matching structure) -β”‚ β”œβ”€β”€ provisioning-validator.service.ts # Provisioning validation -β”‚ β”œβ”€β”€ whmcs-order-mapper.service.ts # SF β†’ WHMCS mapping -β”‚ β”œβ”€β”€ provisioning-orchestrator.service.ts # Provisioning workflow coordination -β”‚ └── order-provisioning.service.ts # Main provisioning interface +β”‚ # Order Fulfillment/Provisioning +β”‚ β”œβ”€β”€ order-fulfillment-validator.service.ts # Provisioning validation +β”‚ β”œβ”€β”€ order-fulfillment-orchestrator.service.ts # Provisioning workflow coordination +β”‚ β”œβ”€β”€ order-fulfillment-error.service.ts # Error handling +β”‚ β”œβ”€β”€ sim-fulfillment.service.ts # SIM-specific fulfillment +β”‚ β”œβ”€β”€ payment-validator.service.ts # Payment method validation +β”‚ └── checkout.service.ts # Checkout flow coordination ``` ## 🎯 **Modular Provisioning Services** -### **1. ProvisioningValidator** +### **1. OrderFulfillmentValidator** **Purpose**: Validates all provisioning prerequisites @@ -51,7 +53,7 @@ apps/bff/src/orders/ - βœ… Idempotency checking - βœ… Request payload validation -### **2. WhmcsOrderMapper** +### **2. OrderBuilder / OrderItemBuilder** **Purpose**: Maps Salesforce OrderItems β†’ WHMCS format @@ -61,7 +63,7 @@ apps/bff/src/orders/ - βœ… Custom fields mapping - βœ… Order notes generation with SF tracking -### **3. ProvisioningOrchestrator** +### **3. OrderFulfillmentOrchestrator** **Purpose**: Coordinates complete provisioning workflow diff --git a/docs/architecture/system-overview.md b/docs/architecture/system-overview.md index 4b44b6d3..33979c37 100644 --- a/docs/architecture/system-overview.md +++ b/docs/architecture/system-overview.md @@ -11,9 +11,7 @@ apps/ portal/ # Next.js frontend bff/ # NestJS Backend-for-Frontend packages/ - domain/ # Pure domain/types/utils (isomorphic) - logging/ # Centralized logging utilities - validation/ # Shared validation schemas + domain/ # Pure domain types, validation schemas, and utilities (isomorphic) ``` ## 🎯 **Architecture Principles** @@ -67,16 +65,26 @@ src/ ``` src/ modules/ # Feature-aligned modules - auth/ # Authentication - billing/ # Invoice and payment management + auth/ # Authentication and authorization + users/ # User management + id-mappings/ # Portal-WHMCS-Salesforce ID mappings catalog/ # Product catalog - orders/ # Order processing - subscriptions/ # Service management + orders/ # Order creation and fulfillment + invoices/ # Invoice management + subscriptions/ # Service and subscription management + currency/ # Currency handling + support/ # Support case management + realtime/ # Server-Sent Events API + verification/ # ID verification + notifications/ # User notifications + health/ # Health check endpoints core/ # Core services and utilities + infra/ # Infrastructure (database, cache, queue, email) integrations/ # External service integrations salesforce/ # Salesforce CRM integration whmcs/ # WHMCS billing integration - common/ # Nest providers/interceptors/guards + freebit/ # Freebit SIM provider integration + sftp/ # SFTP file transfer main.ts # Application entry point ``` @@ -89,60 +97,67 @@ src/ ## πŸ“¦ **Shared Packages** -### **Layered Type System Architecture** +### **Domain Package (`packages/domain/`)** -The codebase follows a strict layering pattern to ensure single source of truth for all types and prevent drift: +The domain package is the single source of truth for shared types, validation schemas, and utilities across both the BFF and Portal applications. ``` -@customer-portal/contracts (Pure TypeScript types) - ↓ -@customer-portal/schemas (Runtime validation with Zod) - ↓ -@customer-portal/integrations (Mappers for external APIs) - ↓ - Applications (BFF, Portal) +packages/domain/ +β”œβ”€β”€ auth/ # Authentication types and validation +β”œβ”€β”€ billing/ # Invoice and payment types +β”œβ”€β”€ catalog/ # Product catalog types +β”œβ”€β”€ checkout/ # Checkout flow types +β”œβ”€β”€ common/ # Shared utilities and base types +β”œβ”€β”€ customer/ # Customer profile types +β”œβ”€β”€ dashboard/ # Dashboard data types +β”œβ”€β”€ mappings/ # ID mapping types (Portal-WHMCS-SF) +β”œβ”€β”€ notifications/ # Notification types +β”œβ”€β”€ opportunity/ # Salesforce opportunity types +β”œβ”€β”€ orders/ # Order types and Salesforce mappings +β”œβ”€β”€ payments/ # Payment method types +β”œβ”€β”€ providers/ # Provider-specific type definitions +β”œβ”€β”€ realtime/ # SSE event types +β”œβ”€β”€ salesforce/ # Salesforce API types +β”œβ”€β”€ sim/ # SIM lifecycle and Freebit types +β”œβ”€β”€ subscriptions/ # Subscription types +β”œβ”€β”€ support/ # Support case types +β”œβ”€β”€ toolkit/ # Utility functions +└── index.ts # Public exports ``` -#### **1. Contracts Package (`packages/contracts/`)** +#### **Key Principles** -- **Purpose**: Pure TypeScript interface definitions - single source of truth -- **Contents**: Cross-layer contracts for billing, subscriptions, payments, SIM, orders -- **Exports**: Organized by domain (e.g., `@customer-portal/contracts/billing`) -- **Rule**: ZERO runtime dependencies, only pure types +- **Framework-agnostic**: No NestJS or React dependencies +- **Isomorphic**: Works in both Node.js and browser environments +- **Zod-first validation**: Schemas defined with Zod for runtime validation +- **Provider mappers**: Transform external API responses to domain types -#### **2. Schemas Package (`packages/schemas/`)** +#### **Usage** -- **Purpose**: Runtime validation schemas using Zod -- **Contents**: Matching Zod validators for each contract + integration-specific payload schemas -- **Exports**: Organized by domain and integration provider -- **Usage**: Validate external API responses, request payloads, and user input +Import via `@customer-portal/domain`: -#### **3. Integration Packages (`packages/integrations/`)** +```typescript +import { Invoice, SIM_LIFECYCLE_STAGE, OrderStatus } from "@customer-portal/domain"; +import { invoiceSchema, orderSchema } from "@customer-portal/domain/validation"; +``` -- **Purpose**: Transform raw provider data into shared contracts -- **Structure**: - - `packages/integrations/whmcs/` - WHMCS billing integration - - `packages/integrations/freebit/` - Freebit SIM provider integration -- **Contents**: Mappers, utilities, and helper functions -- **Rule**: Must use `@customer-portal/schemas` for validation at boundaries +#### **Integration with BFF** -#### **4. Application Layers** +The BFF integration layer (`apps/bff/src/integrations/`) uses domain mappers to transform raw provider data: -- **BFF** (`apps/bff/`): Import from contracts/schemas, never define duplicate interfaces -- **Portal** (`apps/portal/`): Import from contracts/schemas, use shared types everywhere -- **Rule**: Applications only consume, never define domain types +``` +External API β†’ Raw Response β†’ Domain Mapper β†’ Domain Type β†’ Use Everywhere +``` -### **Legacy: Domain Package (Deprecated)** +This ensures a single transformation point and consistent types across the application. -- **Status**: Being phased out in favor of contracts + schemas -- **Migration**: Re-exports now point to contracts package for backward compatibility -- **Rule**: New code should import from `@customer-portal/contracts` or `@customer-portal/schemas` +### **Logging** -### **Logging Package** +Centralized logging is implemented in the BFF using `nestjs-pino`: -- **Purpose**: Centralized structured logging -- **Features**: Pino-based logging with correlation IDs -- **Security**: Automatic PII redaction [[memory:6689308]] +- **Structured JSON logging** for production +- **Correlation IDs** for request tracing +- **Automatic PII redaction** for security ## πŸ”— **Integration Architecture** diff --git a/docs/development/portal/integration-overview.md b/docs/development/portal/integration-overview.md index 416c8032..fcbca442 100644 --- a/docs/development/portal/integration-overview.md +++ b/docs/development/portal/integration-overview.md @@ -41,10 +41,10 @@ This document explains how the portal integrates Salesforce (catalog, orders, pr - Endpoints: `GET /invoices`, `GET /invoices/:id`, `GET /invoices/:id/subscriptions`, `POST /invoices/:id/sso-link`, `POST /invoices/:id/payment-link` (apps/bff/src/invoices/invoices.controller.ts:1). - Service flow: resolve mapping β†’ fetch from WHMCS via `WhmcsService` β†’ transform/cache β†’ return (apps/bff/src/invoices/invoices.service.ts:24). - - List/paginate via WHMCS GetInvoices; details enriched with line items and `serviceId` links (apps/bff/src/vendors/whmcs/services/whmcs-invoice.service.ts:1). - - Subscriptions listed via WHMCS GetClientsProducts; transformed and cached (apps/bff/src/vendors/whmcs/services/whmcs-subscription.service.ts:1). - - Payment methods/gateways via WHMCS; cached in Redis; also used for gating order creation/provisioning (apps/bff/src/vendors/whmcs/services/whmcs-payment.service.ts:1). -- SSO links: invoice view/download/pay and payment-page with preselected method/gateway (apps/bff/src/vendors/whmcs/services/whmcs-payment.service.ts:168). + - List/paginate via WHMCS GetInvoices; details enriched with line items and `serviceId` links (apps/bff/src/integrations/whmcs/services/whmcs-invoice.service.ts:1). + - Subscriptions listed via WHMCS GetClientsProducts; transformed and cached (apps/bff/src/integrations/whmcs/services/whmcs-subscription.service.ts:1). + - Payment methods/gateways via WHMCS; cached in Redis; also used for gating order creation/provisioning (apps/bff/src/integrations/whmcs/services/whmcs-payment.service.ts:1). +- SSO links: invoice view/download/pay and payment-page with preselected method/gateway (apps/bff/src/integrations/whmcs/services/whmcs-payment.service.ts:168). ## Orders β€” Creation (Portal ➝ Salesforce) @@ -74,23 +74,23 @@ This document explains how the portal integrates Salesforce (catalog, orders, pr - Validate request: not already provisioned (checks `WHMCS_Order_ID__c`), ensure client has payment method; resolve mapping (apps/bff/src/orders/services/order-fulfillment-validator.service.ts:23) - Set SF activation status to `Activating` (apps/bff/src/orders/services/order-fulfillment-orchestrator.service.ts:98) - Load SF Order details + OrderItems, map each to WHMCS items using the Product2 mapping (`WH_Product_ID__c`) and billing cycle (apps/bff/src/orders/services/order-whmcs-mapper.service.ts:1) - - Create WHMCS order (AddOrder) with Stripe as payment method; optional promo code and tracking notes (apps/bff/src/vendors/whmcs/services/whmcs-order.service.ts:20) - - Accept/provision order (AcceptOrder), capture service IDs and invoice ID returned (apps/bff/src/vendors/whmcs/services/whmcs-order.service.ts:60) + - Create WHMCS order (AddOrder) with Stripe as payment method; optional promo code and tracking notes (apps/bff/src/integrations/whmcs/services/whmcs-order.service.ts:20) + - Accept/provision order (AcceptOrder), capture service IDs and invoice ID returned (apps/bff/src/integrations/whmcs/services/whmcs-order.service.ts:60) - Update SF: `Status=Completed`, `Activation_Status__c=Activated`, and write back `WHMCS_Order_ID__c` (apps/bff/src/orders/services/order-fulfillment-orchestrator.service.ts:117) - Error handling: On failure, set `Status=Pending Review`, `Activation_Status__c=Failed`, and write concise error code/message for operator triage (apps/bff/src/orders/services/order-fulfillment-orchestrator.service.ts:146). ## Subscriptions (Shown in Portal) -- Data comes from WHMCS products/services via `GetClientsProducts` and is transformed into a standard Subscription list (apps/bff/src/vendors/whmcs/services/whmcs-subscription.service.ts:1). -- Cached per user; supports status filtering; invoice items link to `serviceId` to show related subscriptions (apps/bff/src/vendors/whmcs/transformers/whmcs-data.transformer.ts:35). +- Data comes from WHMCS products/services via `GetClientsProducts` and is transformed into a standard Subscription list (apps/bff/src/integrations/whmcs/services/whmcs-subscription.service.ts:1). +- Cached per user; supports status filtering; invoice items link to `serviceId` to show related subscriptions (apps/bff/src/integrations/whmcs/transformers/whmcs-data.transformer.ts:35). ## Payments & SSO -- Payment methods summary drives UI gating and provisioning validation (apps/bff/src/vendors/whmcs/services/whmcs-payment.service.ts:44). +- Payment methods summary drives UI gating and provisioning validation (apps/bff/src/integrations/whmcs/services/whmcs-payment.service.ts:44). - SSO flows - General WHMCS SSO (dashboard/settings) via `CreateSsoToken` - - Invoice view/download/pay SSO (apps/bff/src/vendors/whmcs/services/whmcs-payment.service.ts:168) - - Payment link with pre‑selected saved method or gateway (apps/bff/src/vendors/whmcs/services/whmcs-payment.service.ts:168) + - Invoice view/download/pay SSO (apps/bff/src/integrations/whmcs/services/whmcs-payment.service.ts:168) + - Payment link with pre‑selected saved method or gateway (apps/bff/src/integrations/whmcs/services/whmcs-payment.service.ts:168) ## Caching & Performance @@ -139,8 +139,8 @@ This document explains how the portal integrates Salesforce (catalog, orders, pr - Salesforce events subscriber: apps/bff/src/vendors/salesforce/events/pubsub.subscriber.ts:58 - Provisioning queue processor: apps/bff/src/orders/queue/provisioning.processor.ts:26 - Invoices service: apps/bff/src/invoices/invoices.service.ts:24 -- Subscriptions service: apps/bff/src/vendors/whmcs/services/whmcs-subscription.service.ts:1 -- Payment/SSO service: apps/bff/src/vendors/whmcs/services/whmcs-payment.service.ts:1 +- Subscriptions service: apps/bff/src/integrations/whmcs/services/whmcs-subscription.service.ts:1 +- Payment/SSO service: apps/bff/src/integrations/whmcs/services/whmcs-payment.service.ts:1 --- diff --git a/docs/getting-started/setup.md b/docs/getting-started/setup.md index 9c06c878..6c6c3f20 100644 --- a/docs/getting-started/setup.md +++ b/docs/getting-started/setup.md @@ -6,21 +6,27 @@ We provide **environment-specific templates** for easy setup: ### πŸ“ **Available Templates:** -- πŸ”Έ **`.env.example`** - Standard environment template for all environments -- πŸ”Έ **Environment-specific values** - Adjust settings based on development vs production needs +Located in the `env/` folder: + +- πŸ”Έ **`env/dev.env.sample`** - Development environment template +- πŸ”Έ **`env/portal-backend.env.sample`** - Backend-specific variables reference +- πŸ”Έ **`env/portal-frontend.env.sample`** - Frontend-specific variables reference ### 🎯 **Benefits:** - βœ… **Environment-specific**: Clear separation of dev vs prod - βœ… **Secure defaults**: Production uses strong security settings -- βœ… **Easy setup**: Copy the right template for your needs +- βœ… **Easy setup**: Copy the template for your needs - βœ… **No confusion**: Clear instructions for each environment ## πŸ”§ **Environment File Structure** ``` πŸ“¦ Customer Portal -β”œβ”€β”€ .env.example # πŸ”Έ Environment template +β”œβ”€β”€ env/ +β”‚ β”œβ”€β”€ dev.env.sample # πŸ”Έ Development template +β”‚ β”œβ”€β”€ portal-backend.env.sample # Backend variables +β”‚ └── portal-frontend.env.sample # Frontend variables β”œβ”€β”€ .env # βœ… Your actual config (gitignored) β”œβ”€β”€ apps/ β”‚ β”œβ”€β”€ bff/ # πŸš€ Backend reads from root .env @@ -42,7 +48,7 @@ We provide **environment-specific templates** for easy setup: ```bash # Copy development environment template -cp .env.dev.example .env +cp env/dev.env.sample .env # Edit with your dev values (most defaults work!) nano .env # Configure for local development @@ -51,8 +57,8 @@ nano .env # Configure for local development **πŸ”Έ For Production:** ```bash -# Copy production environment template -cp .env.production.example .env +# Start from the development template and adjust for production +cp env/dev.env.sample .env # Edit with your production values (REQUIRED!) nano .env # Replace with secure production values diff --git a/docs/integrations/sim/freebit.md b/docs/integrations/sim/freebit.md index dd447c22..fdfb4e81 100644 --- a/docs/integrations/sim/freebit.md +++ b/docs/integrations/sim/freebit.md @@ -888,10 +888,10 @@ User Action β†’ Cost Calculation β†’ Invoice Creation β†’ Payment Capture β†’ Da ### πŸ“ **Implementation Files Modified**: -1. `apps/bff/src/vendors/whmcs/types/whmcs-api.types.ts` - Added WHMCS API types -2. `apps/bff/src/vendors/whmcs/services/whmcs-connection.service.ts` - Added API methods -3. `apps/bff/src/vendors/whmcs/services/whmcs-invoice.service.ts` - Added invoice creation -4. `apps/bff/src/vendors/whmcs/whmcs.service.ts` - Exposed new methods +1. `apps/bff/src/integrations/whmcs/types/whmcs-api.types.ts` - Added WHMCS API types +2. `apps/bff/src/integrations/whmcs/connection/whmcs-connection.service.ts` - Added API methods +3. `apps/bff/src/integrations/whmcs/services/whmcs-invoice.service.ts` - Added invoice creation +4. `apps/bff/src/integrations/whmcs/whmcs.service.ts` - Exposed new methods 5. `apps/bff/src/subscriptions/sim-management.service.ts` - Complete payment flow ## 🎯 **Latest Update: Simplified Top-Up Interface (January 2025)** diff --git a/docs/integrations/whmcs/troubleshooting.md b/docs/integrations/whmcs/troubleshooting.md index 4fd799ac..b72bb39d 100644 --- a/docs/integrations/whmcs/troubleshooting.md +++ b/docs/integrations/whmcs/troubleshooting.md @@ -61,7 +61,7 @@ The WHMCS `GetPayMethods` API returns payment method data with different field n ### 1. Payment Method Transformer -**File**: `apps/bff/src/vendors/whmcs/transformers/whmcs-data.transformer.ts` +**File**: `apps/bff/src/integrations/whmcs/transformers/whmcs-data.transformer.ts` **Changes Made:** @@ -81,7 +81,7 @@ ccType: whmcsPayMethod.cc_type || whmcsPayMethod.card_type, ### 2. Payment Service Enhancement -**File**: `apps/bff/src/vendors/whmcs/services/whmcs-payment.service.ts` +**File**: `apps/bff/src/integrations/whmcs/services/whmcs-payment.service.ts` **Changes Made:** diff --git a/docs/operations/database-operations.md b/docs/operations/database-operations.md new file mode 100644 index 00000000..d229b21f --- /dev/null +++ b/docs/operations/database-operations.md @@ -0,0 +1,407 @@ +# Database Operations Runbook + +This document covers operational procedures for the PostgreSQL database used by the Customer Portal BFF. + +--- + +## Overview + +| Component | Technology | Location | +| --------------- | ------------------------- | ----------------------------- | +| Database | PostgreSQL 17 | Configured via `DATABASE_URL` | +| ORM | Prisma 6 | `apps/bff/prisma/` | +| Connection Pool | Prisma connection pooling | Default: 10 connections | + +--- + +## Backup Procedures + +### Automated Backups + +> **Note**: Configure automated backups based on your hosting environment. + +**Recommended Schedule:** + +- Full backup: Daily at 02:00 UTC +- Transaction log backup: Every 15 minutes +- Retention: 30 days + +### Manual Backup + +```bash +# Create a full database backup +pg_dump $DATABASE_URL > backup_$(date +%Y%m%d_%H%M%S).sql + +# Create a compressed backup +pg_dump $DATABASE_URL | gzip > backup_$(date +%Y%m%d_%H%M%S).sql.gz + +# Backup specific tables +pg_dump $DATABASE_URL -t users -t id_mappings > user_data_backup.sql +``` + +### Backup Verification + +```bash +# Verify backup integrity (restore to temp database) +createdb portal_backup_test +psql portal_backup_test < backup_YYYYMMDD.sql + +# Run basic integrity checks +psql portal_backup_test -c "SELECT COUNT(*) FROM users" +psql portal_backup_test -c "SELECT COUNT(*) FROM id_mappings" + +# Clean up +dropdb portal_backup_test +``` + +--- + +## Recovery Procedures + +### Point-in-Time Recovery + +**Prerequisites:** + +- WAL archiving enabled +- Continuous backup configured + +```bash +# Stop the application +pnpm prod:stop + +# Restore from backup +pg_restore -d $DATABASE_URL backup_YYYYMMDD.dump + +# Run Prisma migrations to ensure schema is current +pnpm db:migrate + +# Restart the application +pnpm prod:start +``` + +### Restore from SQL Backup + +```bash +# Stop the application to prevent writes +pnpm prod:stop + +# Drop and recreate database (DESTRUCTIVE) +dropdb portal_production +createdb portal_production + +# Restore from backup +psql $DATABASE_URL < backup_YYYYMMDD.sql + +# Verify restoration +psql $DATABASE_URL -c "SELECT COUNT(*) FROM users" + +# Restart application +pnpm prod:start +``` + +--- + +## Migration Management + +### Running Migrations + +```bash +# Development: Apply pending migrations +pnpm db:migrate + +# Production: Deploy migrations +pnpm db:migrate --skip-generate + +# View migration status +npx prisma migrate status +``` + +### Migration Checklist + +Before deploying migrations to production: + +1. [ ] Test migration on staging environment +2. [ ] Verify rollback procedure exists +3. [ ] Estimate migration duration +4. [ ] Schedule maintenance window if needed +5. [ ] Create backup before migration +6. [ ] Notify team of deployment + +### Rollback Procedure + +Prisma does not have built-in rollback. Use these approaches: + +**Option 1: Restore from Backup** + +```bash +# Restore database to pre-migration state +psql $DATABASE_URL < pre_migration_backup.sql + +# Revert migration files in codebase +git revert +``` + +**Option 2: Manual Rollback SQL** + +```bash +# Create rollback SQL for each migration +# Store in: apps/bff/prisma/rollbacks/ + +# Example rollback +psql $DATABASE_URL < rollbacks/20240115_rollback.sql +``` + +**Option 3: Reset and Reseed (Development Only)** + +```bash +# WARNING: Destroys all data +pnpm db:reset +``` + +--- + +## ID Mappings Data Integrity + +The `id_mappings` table links portal users to WHMCS and Salesforce accounts. Corruption here causes authentication and data access failures. + +### Verify Mapping Integrity + +```sql +-- Check for orphaned mappings (portal user deleted but mapping exists) +SELECT m.* FROM id_mappings m +LEFT JOIN users u ON m.user_id = u.id +WHERE u.id IS NULL; + +-- Check for duplicate WHMCS mappings +SELECT whmcs_client_id, COUNT(*) as count +FROM id_mappings +WHERE whmcs_client_id IS NOT NULL +GROUP BY whmcs_client_id +HAVING COUNT(*) > 1; + +-- Check for duplicate Salesforce mappings +SELECT sf_account_id, COUNT(*) as count +FROM id_mappings +WHERE sf_account_id IS NOT NULL +GROUP BY sf_account_id +HAVING COUNT(*) > 1; +``` + +### Fix Orphaned Mappings + +```sql +-- Remove mappings for deleted users +DELETE FROM id_mappings +WHERE user_id NOT IN (SELECT id FROM users); +``` + +### Fix Duplicate Mappings + +> **Warning**: Investigate duplicates before deleting. They may indicate data issues. + +```sql +-- View duplicate details before fixing +SELECT m.*, u.email FROM id_mappings m +JOIN users u ON m.user_id = u.id +WHERE m.whmcs_client_id IN ( + SELECT whmcs_client_id FROM id_mappings + GROUP BY whmcs_client_id HAVING COUNT(*) > 1 +); +``` + +--- + +## PostgreSQL Maintenance + +### VACUUM and ANALYZE + +```sql +-- Analyze all tables for query optimization +ANALYZE; + +-- Vacuum to reclaim space (non-blocking) +VACUUM; + +-- Full vacuum (blocking, reclaims more space) +VACUUM FULL; + +-- Vacuum specific table +VACUUM ANALYZE id_mappings; +``` + +**Recommended Schedule:** + +- `VACUUM ANALYZE`: Daily during low-traffic hours +- `VACUUM FULL`: Monthly during maintenance window + +### Index Maintenance + +```sql +-- Check index usage +SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read +FROM pg_stat_user_indexes +ORDER BY idx_scan DESC; + +-- Find unused indexes (candidates for removal) +SELECT schemaname, tablename, indexname +FROM pg_stat_user_indexes +WHERE idx_scan = 0; + +-- Reindex a table +REINDEX TABLE id_mappings; + +-- Reindex entire database (during maintenance window) +REINDEX DATABASE portal_production; +``` + +### Check Table Bloat + +```sql +-- Estimate table bloat +SELECT + schemaname, tablename, + pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) as size, + n_dead_tup as dead_rows, + n_live_tup as live_rows, + ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) as dead_pct +FROM pg_stat_user_tables +ORDER BY n_dead_tup DESC; +``` + +--- + +## Connection Pool Monitoring + +### Check Active Connections + +```sql +-- Current connection count +SELECT COUNT(*) as connections FROM pg_stat_activity; + +-- Connections by state +SELECT state, COUNT(*) FROM pg_stat_activity GROUP BY state; + +-- Connections by application +SELECT application_name, COUNT(*) +FROM pg_stat_activity +GROUP BY application_name; + +-- Long-running queries (>5 minutes) +SELECT pid, now() - pg_stat_activity.query_start AS duration, query +FROM pg_stat_activity +WHERE state = 'active' + AND now() - pg_stat_activity.query_start > interval '5 minutes'; +``` + +### Kill Stuck Connections + +```sql +-- Terminate a specific query +SELECT pg_terminate_backend(); + +-- Terminate all connections except current +SELECT pg_terminate_backend(pid) +FROM pg_stat_activity +WHERE pid <> pg_backend_pid() + AND datname = current_database(); +``` + +### Prisma Connection Pool Settings + +Configure in `DATABASE_URL` query parameters: + +``` +postgresql://user:pass@host:5432/db?connection_limit=10&pool_timeout=10 +``` + +| Parameter | Default | Recommended | +| ------------------ | ------- | ------------------ | +| `connection_limit` | 10 | 10-20 per instance | +| `pool_timeout` | 10s | 10-30s | + +--- + +## Monitoring Queries + +### Database Size + +```sql +-- Total database size +SELECT pg_size_pretty(pg_database_size(current_database())); + +-- Size per table +SELECT + tablename, + pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) as total_size +FROM pg_tables +WHERE schemaname = 'public' +ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC; +``` + +### Query Performance + +```sql +-- Slowest queries (requires pg_stat_statements extension) +SELECT query, calls, mean_time, total_time +FROM pg_stat_statements +ORDER BY mean_time DESC +LIMIT 10; +``` + +### Lock Monitoring + +```sql +-- Check for locks +SELECT + pg_locks.pid, + pg_stat_activity.query, + pg_locks.mode, + pg_locks.granted +FROM pg_locks +JOIN pg_stat_activity ON pg_locks.pid = pg_stat_activity.pid +WHERE NOT pg_locks.granted; +``` + +--- + +## Emergency Procedures + +### Database Unresponsive + +1. Check PostgreSQL process status +2. Check disk space and memory +3. Kill long-running queries +4. Restart PostgreSQL if necessary +5. Check application connectivity after restart + +### Disk Space Full + +```bash +# Check disk usage +df -h + +# Find large files in PostgreSQL data directory +du -sh /var/lib/postgresql/data/* + +# Clear transaction logs (if WAL archiving is working) +# WARNING: Only if logs are properly archived +``` + +### Corruption Detected + +1. **STOP** the application immediately +2. Do not attempt repairs without backup verification +3. Restore from last known good backup +4. Investigate root cause before resuming service + +--- + +## Related Documents + +- [Incident Response](./incident-response.md) +- [External Dependencies](./external-dependencies.md) +- [Provisioning Runbook](./provisioning-runbook.md) + +--- + +**Last Updated:** December 2025 diff --git a/docs/operations/disabled-modules.md b/docs/operations/disabled-modules.md deleted file mode 100644 index 3d8aedd8..00000000 --- a/docs/operations/disabled-modules.md +++ /dev/null @@ -1,28 +0,0 @@ -# Temporarily Disabled Modules - -The backend currently omits two partially implemented modules from the runtime -NestJS configuration so that the public API surface only exposes completed -features. - -## Cases Module - -- Removed from `AppModule` and `apiRoutes` to ensure the unfinished `/cases` - endpoints are not routable. -- All existing code remains in `apps/bff/src/modules/cases/` for future - development; re-enable by importing the module in - `apps/bff/src/app.module.ts` and adding it back to the router configuration in - `apps/bff/src/core/config/router.config.ts` once the endpoints are ready. - -## Jobs Module - -- Temporarily excluded from `AppModule` while the reconciliation workflows are - fleshed out. -- The BullMQ processor now logs an explicit warning and acknowledges each job so - queue workers do not hang when the module is re-registered. -- When background processing is ready, restore the `JobsModule` import in - `apps/bff/src/app.module.ts` and replace the placeholder logic in - `ReconcileProcessor.process` with the real reconciliation implementation. - -> **Note**: If additional queues or HTTP routes reference these modules, make -> sure they fail fast with a `501 Not Implemented` response or similar logging so -> that downstream systems have clear telemetry while the modules are disabled. diff --git a/docs/operations/external-dependencies.md b/docs/operations/external-dependencies.md new file mode 100644 index 00000000..cd981adc --- /dev/null +++ b/docs/operations/external-dependencies.md @@ -0,0 +1,325 @@ +# External Dependencies Runbook + +This document covers health checking, monitoring, and troubleshooting for external systems integrated with the Customer Portal. + +--- + +## System Overview + +| System | Purpose | Integration | Health Check | +| ---------------------- | -------------------------------- | -------------------------- | --------------- | +| **Salesforce** | CRM, Orders, Catalog | REST API + Platform Events | JWT auth test | +| **WHMCS** | Billing, Payments, Subscriptions | REST API | API action test | +| **Freebit** | SIM Management | REST API | OEM auth test | +| **SFTP (fs.mvno.net)** | Call/SMS Records | SFTP | Connection test | +| **Redis** | Cache, Sessions, Queues | Direct connection | PING command | +| **PostgreSQL** | User data, Mappings | Direct connection | Query test | + +--- + +## Salesforce + +### Configuration + +| Variable | Description | +| ---------------------------- | ------------------------------------------------------- | +| `SF_LOGIN_URL` | Login URL (login.salesforce.com or test.salesforce.com) | +| `SF_CLIENT_ID` | Connected App Consumer Key | +| `SF_USERNAME` | Integration user username | +| `SF_PRIVATE_KEY_PATH` | Path to JWT private key | +| `SF_EVENTS_ENABLED` | Enable Platform Event subscription | +| `SF_PROVISION_EVENT_CHANNEL` | Platform Event channel for provisioning | +| `PORTAL_PRICEBOOK_ID` | Salesforce Pricebook ID for catalog | + +### Health Check + +```bash +# Check Salesforce connectivity via BFF health endpoint +curl http://localhost:4000/health | jq '.' + +# Test JWT authentication manually +# The BFF authenticates automatically; check logs for auth errors +grep "Salesforce" /var/log/bff/combined.log | tail -20 +``` + +### Common Issues + +**JWT Authentication Failure** + +- Verify private key file exists and is readable +- Check Connected App settings in Salesforce +- Ensure integration user is pre-authorized for Connected App +- Verify `SF_USERNAME` matches the user assigned to Connected App + +**Platform Events Not Receiving** + +- Check `SF_EVENTS_ENABLED=true` +- Verify Platform Event permissions for integration user +- Check Redis for replay ID: `redis-cli GET "sf:pe:replay:/event/OrderProvisionRequested__e"` +- Set `SF_EVENTS_REPLAY=ALL` temporarily to catch up on missed events + +**API Limits** + +- Salesforce has daily API call limits +- Monitor usage in Salesforce Setup > API Usage +- Consider caching frequently accessed data + +### Expected Response Times + +| Operation | Expected | Alert Threshold | +| -------------- | --------- | --------------- | +| Query | <500ms | >2s | +| Update | <1s | >3s | +| Platform Event | Real-time | >5s delay | + +--- + +## WHMCS + +### Configuration + +| Variable | Description | +| -------------------------------- | ----------------------------------- | +| `WHMCS_API_URL` | WHMCS API endpoint URL | +| `WHMCS_API_IDENTIFIER` | API credentials identifier | +| `WHMCS_API_SECRET` | API credentials secret | +| `WHMCS_CUSTOMER_NUMBER_FIELD_ID` | Custom field ID for Customer Number | + +### Health Check + +```bash +# Test WHMCS API directly +curl -X POST "$WHMCS_API_URL" \ + -d "identifier=$WHMCS_API_IDENTIFIER" \ + -d "secret=$WHMCS_API_SECRET" \ + -d "action=GetClients" \ + -d "responsetype=json" \ + -d "limitnum=1" + +# Should return: {"result":"success","totalresults":...} +``` + +### Common Issues + +**Authentication Failure** + +- Verify API credentials in WHMCS Admin > Setup > Staff Management > API Credentials +- Check IP whitelist settings (if configured) +- Ensure API credentials have required permissions + +**Rate Limiting** + +- WHMCS may rate limit excessive requests +- Check for 429 responses in logs +- Implement request queuing if needed + +**Field Mapping Issues** + +- Payment method fields may use different names between WHMCS versions +- Check [WHMCS Troubleshooting](../integrations/whmcs/troubleshooting.md) for field mapping + +### Expected Response Times + +| Operation | Expected | Alert Threshold | +| ----------- | -------- | --------------- | +| GetInvoices | <500ms | >2s | +| AddOrder | <1s | >3s | +| AcceptOrder | <1s | >3s | +| SSO Token | <500ms | >2s | + +--- + +## Freebit + +### Configuration + +| Variable | Description | +| ------------------ | ---------------------- | +| `FREEBIT_BASE_URL` | Freebit API base URL | +| `FREEBIT_OEM_ID` | OEM identifier | +| `FREEBIT_OEM_KEY` | OEM authentication key | +| `FREEBIT_TIMEOUT` | Request timeout (ms) | + +### Health Check + +```bash +# Check Freebit OEM authentication +# The BFF handles auth automatically; check logs for auth errors +grep "Freebit" /var/log/bff/combined.log | tail -20 + +# Check for auth token in cache +redis-cli GET "freebit:auth:token" +``` + +### Common Issues + +**OEM Authentication Failure** + +- Verify `FREEBIT_OEM_ID` and `FREEBIT_OEM_KEY` +- Check Freebit API endpoint accessibility +- Auth tokens are cached; clear cache if credentials changed + +**SIM Operations Failing** + +- Verify SIM account identifier (phone number) format +- Check 30-minute operation gap requirements +- See [Freebit SIM Management](../integrations/sim/freebit.md) for operation constraints + +**Network Type Changes Delayed** + +- Network type changes are queued with 30-minute delay +- Check BullMQ queue for pending jobs + +### Expected Response Times + +| Operation | Expected | Alert Threshold | +| ------------- | -------- | --------------- | +| Auth (cached) | <100ms | >500ms | +| GetDetail | <1s | >3s | +| Plan Change | <2s | >5s | +| Top-up | <2s | >5s | + +--- + +## SFTP (fs.mvno.net) + +### Configuration + +| Variable | Description | +| ----------------------- | ----------------------- | +| `SFTP_HOST` | SFTP server hostname | +| `SFTP_PORT` | SFTP port (default: 22) | +| `SFTP_USERNAME` | SFTP username | +| `SFTP_PRIVATE_KEY_PATH` | Path to SSH private key | + +### Health Check + +```bash +# Test SFTP connectivity +sftp -i $SFTP_PRIVATE_KEY_PATH $SFTP_USERNAME@$SFTP_HOST << EOF +ls +exit +EOF +``` + +### Common Issues + +**Connection Refused** + +- Verify SFTP server is accessible +- Check firewall rules +- Verify SSH key fingerprint + +**Authentication Failure** + +- Verify SSH private key is correct +- Check key permissions (should be 600) +- Ensure public key is authorized on SFTP server + +**Files Not Found** + +- Call/SMS records are available 2 months behind current date +- File naming: `PASI_talk-detail-YYYYMM.csv`, `PASI_sms-detail-YYYYMM.csv` + +### Data Availability + +| Record Type | Availability | File Pattern | +| ------------ | --------------- | ----------------------------- | +| Call Details | 2 months behind | `PASI_talk-detail-YYYYMM.csv` | +| SMS Details | 2 months behind | `PASI_sms-detail-YYYYMM.csv` | + +--- + +## Credential Rotation + +### Salesforce JWT Key Rotation + +1. Generate new key pair +2. Upload new public key to Connected App +3. Update `SF_PRIVATE_KEY_PATH` or `SF_PRIVATE_KEY_BASE64` +4. Deploy and verify authentication +5. Remove old key from Connected App after verification + +### WHMCS API Credentials Rotation + +1. Create new API credentials in WHMCS Admin +2. Update `WHMCS_API_IDENTIFIER` and `WHMCS_API_SECRET` +3. Deploy and verify API calls work +4. Disable old API credentials + +### Freebit Key Rotation + +1. Request new OEM key from Freebit +2. Update `FREEBIT_OEM_KEY` +3. Clear cached auth token: `redis-cli DEL "freebit:auth:token"` +4. Deploy and verify authentication + +### SSH Key Rotation (SFTP) + +1. Generate new SSH key pair +2. Provide public key to SFTP administrator +3. Wait for key to be authorized +4. Update `SFTP_PRIVATE_KEY_PATH` +5. Test connectivity +6. Request old key removal from SFTP server + +--- + +## Monitoring Recommendations + +### Alerting Thresholds + +| System | Metric | Warning | Critical | +| ---------- | ------------- | ------- | -------- | +| Salesforce | Response time | >2s | >5s | +| Salesforce | Error rate | >1% | >5% | +| WHMCS | Response time | >2s | >5s | +| WHMCS | Error rate | >1% | >5% | +| Freebit | Response time | >3s | >10s | +| Redis | Response time | >100ms | >500ms | +| PostgreSQL | Response time | >500ms | >2s | + +### Key Metrics to Monitor + +- External API response times +- Error rates per integration +- Authentication success/failure rates +- Cache hit rates +- Queue depths (for async operations) + +### Health Check Schedule + +| System | Check Frequency | Method | +| ---------- | ---------------- | ------------------ | +| Salesforce | Every 5 minutes | Query test | +| WHMCS | Every 5 minutes | GetClients call | +| Freebit | Every 15 minutes | Auth token refresh | +| Redis | Every 1 minute | PING | +| PostgreSQL | Every 1 minute | SELECT 1 | +| SFTP | Every 1 hour | Connection test | + +--- + +## Fallback Behaviors + +| System Down | User Impact | Fallback | +| ----------- | ----------------------- | ------------------------------------ | +| Salesforce | No orders, no catalog | Show cached catalog, queue orders | +| WHMCS | No billing, no payments | Show cached invoices, block checkout | +| Freebit | No SIM management | Show cached data, disable actions | +| Redis | Slow performance | Direct API calls (no cache) | +| PostgreSQL | Portal unusable | Display maintenance message | + +--- + +## Related Documents + +- [Incident Response](./incident-response.md) +- [Provisioning Runbook](./provisioning-runbook.md) +- [Salesforce Requirements](../integrations/salesforce/requirements.md) +- [WHMCS Troubleshooting](../integrations/whmcs/troubleshooting.md) +- [Freebit SIM Management](../integrations/sim/freebit.md) + +--- + +**Last Updated:** December 2025 diff --git a/docs/operations/external-processes.md b/docs/operations/external-processes.md new file mode 100644 index 00000000..682a7972 --- /dev/null +++ b/docs/operations/external-processes.md @@ -0,0 +1,325 @@ +# External Processes and Team Handoffs + +This document describes operational processes that occur outside the Customer Portal but are necessary for system operation and customer service. + +--- + +## Process Ownership Matrix + +| Process | Owner | Trigger | Dependencies | Documentation | +| ----------------------------- | ----------------- | ------------------------- | --------------------------- | ----------------------------------------------- | +| Salesforce Account Creation | Sales Team | Customer inquiry | Salesforce Admin access | Salesforce training docs | +| Customer Number Assignment | Sales Team | New customer onboarding | SF Account created | Sales procedures | +| CS Order Approval | CS Team | Order in "Pending Review" | Salesforce access | CS training docs | +| Internet Eligibility Check | CS Team | Eligibility request Case | Customer address info | CS procedures | +| WHMCS Product Setup | DevOps | New product launch | WHMCS Admin access | This document | +| Salesforce Flow Maintenance | SF Admin | Feature changes | SF Admin + Dev access | SF Flow documentation | +| Freebit Account Configuration | Partner Relations | New SIM products | Freebit partner credentials | Freebit contract docs | +| SSL Certificate Renewal | DevOps | Expiration alerts | Certificate provider access | This document | +| Database Backups | DevOps | Scheduled / On-demand | DB Admin access | [Database Operations](./database-operations.md) | + +--- + +## Customer Onboarding Flow + +### Pre-Portal Setup (Sales Team) + +Before a customer can use the portal, Sales must complete these steps: + +1. **Create Salesforce Account** + - Create Account record with customer details + - Assign unique `SF_Account_No__c` (Customer Number) + - Set initial account status + +2. **Verify Customer Information** + - Confirm contact details + - Verify billing address + - Complete KYC requirements if applicable + +3. **Internet Eligibility (if applicable)** + - Submit eligibility check via portal OR + - Manually check eligibility and update Account fields: + - `Internet_Eligibility__c` + - `Internet_Eligibility_Status__c` + +### Handoff to Portal + +Once Sales completes setup, customer can: + +- Sign up using their Customer Number +- Link existing WHMCS account (if migrating) +- Place orders through the portal + +--- + +## Order Approval Flow + +### CS Review Process + +When an order is placed, CS must review and approve: + +**Order Review Checklist:** + +1. [ ] Verify customer identity matches Salesforce Account +2. [ ] Confirm product eligibility (Internet type matches eligibility) +3. [ ] Verify installation address is serviceable +4. [ ] Check for duplicate active services +5. [ ] Review any special instructions or notes + +**Approval Actions:** + +- Approve: Set Order `Status = Approved` + - Triggers provisioning workflow automatically +- Reject: Set Order `Status = Cancelled` + - Add rejection reason to Order notes + - Customer is notified via portal + +**SLA:** + +- Standard orders: Review within 2 business hours +- Priority orders: Review within 30 minutes + +### Escalation Triggers + +Escalate to supervisor if: + +- Customer disputes eligibility result +- Multiple orders from same account in short period +- Order value exceeds threshold +- Address verification fails + +--- + +## Internet Eligibility Process + +### Request Flow + +1. **Customer submits eligibility request** (Portal) + - Creates Salesforce Case (Type: Eligibility Check) + - Updates Account fields to "Pending" + - Creates/updates Opportunity (Stage: Introduction) + +2. **CS reviews request** (Salesforce) + - Verify address details + - Check service availability databases + - Determine eligibility type (Apartment 1G, Home 1G, etc.) + +3. **CS updates Salesforce** (Salesforce) + - Set `Internet_Eligibility__c` to result + - Set `Internet_Eligibility_Status__c = Checked` + - Update Opportunity stage (Ready or Void) + - Close the Case + +4. **Customer sees result** (Portal) + - Portal reads updated Account fields + - Catalog shows eligible products + +**SLA:** + +- Standard check: 24-48 business hours +- Express check: 4 business hours + +--- + +## Cancellation Request Process + +### Customer-Initiated Cancellation + +1. **Customer requests cancellation** (Portal) + - Creates Salesforce Case (Type: Cancellation Request) + - Finds linked Opportunity via `WHMCS_Service_ID__c` + - Updates Opportunity stage to "β–³Cancelling" + - Sets `ScheduledCancellationDateAndTime__c` + +2. **CS reviews request** (Salesforce) + - Verify customer authorization + - Check cancellation terms and fees + - Confirm scheduled date + +3. **CS processes cancellation** (WHMCS + Salesforce) + - Cancel service in WHMCS (if not automatic) + - Update Opportunity stage to "β–³Cancelled" + - Close the Case + +4. **Final billing** (WHMCS) + - Generate final invoice if applicable + - Process any prorated refunds + +### Cancellation Types + +| Type | Notice Period | Effective Date | +| -------- | ---------------------- | ---------------------- | +| Internet | 30 days | End of notice period | +| SIM | Immediate or scheduled | 1st of following month | +| VPN | Immediate | Same day | + +--- + +## Product Configuration + +### Adding New Products + +When launching new products, coordinate between teams: + +**1. Salesforce Setup (SF Admin)** + +- Create Product2 record +- Set required fields: + - `Name`, `StockKeepingUnit` + - `WH_Product_ID__c` (WHMCS product ID) + - `Billing_Cycle__c` + - `Item_Class__c` (Service, Activation, Add-on) +- Add to portal Pricebook (`PORTAL_PRICEBOOK_ID`) + +**2. WHMCS Setup (DevOps/Billing)** + +- Create product in WHMCS Products/Services +- Configure pricing and billing cycle +- Set up any required custom fields +- Test product creation via API + +**3. Portal Verification (Development)** + +- Verify product appears in catalog +- Test checkout flow with new product +- Confirm provisioning works correctly + +**4. Documentation (All Teams)** + +- Update product documentation +- Add to [WHMCS Mapping Reference](../integrations/salesforce/whmcs-mapping.md) + +### Product Change Checklist + +- [ ] Salesforce Product2 updated +- [ ] WHMCS product updated +- [ ] Pricing synced between systems +- [ ] Portal cache cleared +- [ ] Tested in staging environment +- [ ] Documentation updated + +--- + +## Salesforce Flow Maintenance + +### Record-Triggered Flows + +The portal depends on these Salesforce Flows: + +| Flow | Trigger | Action | +| ----------------------- | ---------------------------------- | ------------------------------------ | +| Order Approval Flow | Order Status β†’ Approved | Publish `OrderProvisionRequested__e` | +| Eligibility Update Flow | Account eligibility fields changed | (Optional) Notify customer | + +### Flow Change Procedure + +1. **Development** (SF Admin + Dev) + - Clone existing Flow for modification + - Test in Salesforce Sandbox + - Document changes + +2. **Deployment** (SF Admin) + - Schedule deployment during low-traffic period + - Notify development team + - Activate new Flow version + +3. **Verification** (Dev + QA) + - Test affected portal functionality + - Verify Platform Events are received + - Check BFF logs for any errors + +4. **Rollback Plan** + - Keep previous Flow version available + - Document rollback procedure + - Have SF Admin available during deployment + +--- + +## SSL Certificate Management + +### Certificate Inventory + +| Domain | Provider | Expiration | Renewal Process | +| ------------------ | ------------- | ---------- | --------------- | +| portal.example.com | Let's Encrypt | Auto-renew | Automated | +| api.example.com | Let's Encrypt | Auto-renew | Automated | +| whmcs.example.com | [Provider] | [Date] | Manual | + +### Renewal Procedure + +**Automated (Let's Encrypt):** + +- Certbot runs automatically +- Monitor for renewal failures +- Alert if cert expires within 14 days + +**Manual:** + +1. Generate CSR +2. Submit to certificate provider +3. Complete domain verification +4. Download and install certificate +5. Restart affected services +6. Verify certificate in browser + +### Certificate Expiration Alerts + +- 30 days: Warning notification +- 14 days: Urgent notification +- 7 days: Critical alert + +--- + +## Credential and Access Management + +### Access Request Process + +| System | Request To | Approval By | Access Level Options | +| ---------- | ---------- | ----------- | --------------------- | +| Salesforce | SF Admin | Manager | Read-only, CS, Admin | +| WHMCS | DevOps | Manager | Staff, Admin | +| BFF/Portal | DevOps | Tech Lead | Developer, Operator | +| Database | DevOps | Tech Lead | Read-only, Read-write | + +### Offboarding Checklist + +When a team member leaves: + +- [ ] Revoke Salesforce access +- [ ] Revoke WHMCS access +- [ ] Remove from deployment systems +- [ ] Rotate any shared credentials they had access to +- [ ] Update on-call schedules +- [ ] Transfer ownership of documentation + +--- + +## Communication Channels + +### Team Contacts + +| Team | Channel | Escalation | +| ----------- | --------------------- | ------------- | +| Development | [Slack/Teams channel] | Tech Lead | +| CS Team | [Slack/Teams channel] | CS Manager | +| Sales Team | [Slack/Teams channel] | Sales Manager | +| DevOps | [Slack/Teams channel] | Ops Lead | +| SF Admin | [Email/Slack] | IT Manager | + +### Incident Communication + +See [Incident Response Runbook](./incident-response.md) for incident communication procedures. + +--- + +## Related Documents + +- [Incident Response](./incident-response.md) +- [Provisioning Runbook](./provisioning-runbook.md) +- [Salesforce Requirements](../integrations/salesforce/requirements.md) +- [WHMCS Mapping Reference](../integrations/salesforce/whmcs-mapping.md) +- [Complete Operations Guide](../how-it-works/COMPLETE-GUIDE.md) + +--- + +**Last Updated:** December 2025 diff --git a/docs/operations/incident-response.md b/docs/operations/incident-response.md new file mode 100644 index 00000000..80733301 --- /dev/null +++ b/docs/operations/incident-response.md @@ -0,0 +1,327 @@ +# Incident Response Runbook + +This document defines procedures for responding to production incidents affecting the Customer Portal. + +--- + +## Severity Classification + +| Severity | Definition | Response Time | Examples | +| ----------------- | -------------------------------------- | ------------- | ----------------------------------------------------------------- | +| **P1 - Critical** | Complete service outage or data loss | 15 minutes | Portal unreachable, database corruption, security breach | +| **P2 - High** | Major feature unavailable | 1 hour | Order provisioning failing, payment processing down | +| **P3 - Medium** | Degraded performance or partial outage | 4 hours | Slow response times, intermittent errors, single integration down | +| **P4 - Low** | Minor issue, workaround available | 24 hours | UI glitches, non-critical feature bugs | + +--- + +## Escalation Matrix + +| Level | Scope | Contact | When to Escalate | +| ------ | ---------------- | ------------------- | ---------------------------------------------------- | +| **L1** | Initial Response | On-call engineer | All incidents | +| **L2** | Technical Lead | Development lead | P1/P2 not resolved in 30 minutes | +| **L3** | Management | Engineering manager | P1 not resolved in 1 hour, customer impact | +| **L4** | External | Vendor support | External system failure (Salesforce, WHMCS, Freebit) | + +### On-Call Contacts + +> **Note**: Update this section with actual contact information for your team. + +| Role | Contact Method | Backup | +| ----------------- | ----------------- | ------- | +| Primary On-Call | [Slack/PagerDuty] | [Phone] | +| Secondary On-Call | [Slack/PagerDuty] | [Phone] | +| Engineering Lead | [Slack/Email] | [Phone] | + +--- + +## Common Incident Scenarios + +### 1. Salesforce Platform Events Not Receiving + +**Symptoms:** + +- Orders stuck in "Pending Review" status +- No provisioning activity in logs +- `sf:pe:replay:*` Redis keys not updating + +**Diagnosis:** + +```bash +# Check BFF logs for Platform Event subscription +grep "Platform Event" /var/log/bff/combined.log | tail -50 + +# Check Redis replay ID +redis-cli GET "sf:pe:replay:/event/OrderProvisionRequested__e" + +# Verify Salesforce connectivity +curl -X GET http://localhost:4000/health +``` + +**Resolution:** + +1. Verify `SF_EVENTS_ENABLED=true` in environment +2. Check Salesforce Connected App JWT authentication +3. Verify Platform Event permissions for integration user +4. Set `SF_EVENTS_REPLAY=ALL` temporarily to replay missed events +5. Restart BFF to re-establish subscription + +**Escalation:** If unresolved in 30 minutes, contact Salesforce admin. + +--- + +### 2. WHMCS API Unavailable + +**Symptoms:** + +- Billing pages showing "service unavailable" +- Provisioning failing with WHMCS errors +- Payment method checks failing + +**Diagnosis:** + +```bash +# Check WHMCS connectivity from BFF +curl -X POST $WHMCS_API_URL -d "action=GetClients&responsetype=json" + +# Check BFF logs for WHMCS errors +grep "WHMCS" /var/log/bff/error.log | tail -20 +``` + +**Resolution:** + +1. Verify WHMCS server is accessible +2. Check WHMCS API credentials (`WHMCS_API_IDENTIFIER`, `WHMCS_API_SECRET`) +3. Check WHMCS server load and resource usage +4. Contact WHMCS hosting provider if server is down + +**Escalation:** If WHMCS server is down, contact hosting provider. + +--- + +### 3. Redis Connection Failures + +**Symptoms:** + +- Authentication failing +- Cache misses on every request +- Rate limiting not working +- SSE connections dropping + +**Diagnosis:** + +```bash +# Check Redis connectivity +redis-cli ping + +# Check Redis memory usage +redis-cli INFO memory + +# Check BFF health endpoint +curl http://localhost:4000/health | jq '.checks.cache' +``` + +**Resolution:** + +1. Verify Redis URL in environment (`REDIS_URL`) +2. Check Redis server memory usage and eviction policy +3. Restart Redis if memory is exhausted +4. Clear stale keys if necessary: `redis-cli FLUSHDB` (caution: clears all cache) + +**Impact Note:** Redis failure causes: + +- Token blacklist checks to fail (security risk if `AUTH_BLACKLIST_FAIL_CLOSED=false`) +- All cached data to be re-fetched from source systems +- Rate limiting to stop working + +--- + +### 4. Database Connection Issues + +**Symptoms:** + +- All API requests failing with 500 errors +- Health check shows database as "fail" +- Prisma connection errors in logs + +**Diagnosis:** + +```bash +# Check database connectivity +psql $DATABASE_URL -c "SELECT 1" + +# Check connection count +psql $DATABASE_URL -c "SELECT count(*) FROM pg_stat_activity" + +# Check BFF health endpoint +curl http://localhost:4000/health | jq '.checks.database' +``` + +**Resolution:** + +1. Verify PostgreSQL server is running +2. Check connection pool limits (Prisma connection_limit) +3. Look for long-running queries and kill if necessary +4. Restart database if unresponsive + +**Escalation:** If database is corrupted, see [Database Operations Runbook](./database-operations.md). + +--- + +### 5. High Error Rate / Performance Degradation + +**Symptoms:** + +- Increased response times (>2s average) +- Error rate above 1% +- Customer complaints + +**Diagnosis:** + +```bash +# Check BFF process resource usage +top -p $(pgrep -f "node.*bff") + +# Check recent error logs +tail -100 /var/log/bff/error.log + +# Check external API response times in logs +grep "duration" /var/log/bff/combined.log | tail -20 +``` + +**Resolution:** + +1. Identify which external API is slow (Salesforce, WHMCS, Freebit) +2. Check for traffic spikes or unusual patterns +3. Scale horizontally if CPU/memory constrained +4. Enable circuit breakers or increase timeouts temporarily + +--- + +### 6. Security Incident + +**Symptoms:** + +- Unusual login patterns +- Suspected unauthorized access +- Data exfiltration alerts + +**Immediate Actions:** + +1. **DO NOT** modify logs or evidence +2. Notify security team immediately +3. Consider isolating affected systems +4. Document all observations with timestamps + +**Escalation:** P1 - Immediately escalate to engineering lead and management. + +--- + +## Incident Response Workflow + +``` +1. DETECT + β”œβ”€β”€ Automated alert received + β”œβ”€β”€ Customer report + └── Internal discovery + +2. ASSESS + β”œβ”€β”€ Determine severity (P1-P4) + β”œβ”€β”€ Identify affected systems + └── Estimate customer impact + +3. RESPOND + β”œβ”€β”€ Follow relevant scenario playbook + β”œβ”€β”€ Communicate status + └── Escalate if needed + +4. RESOLVE + β”œβ”€β”€ Implement fix + β”œβ”€β”€ Verify resolution + └── Monitor for recurrence + +5. REVIEW + β”œβ”€β”€ Document timeline + β”œβ”€β”€ Identify root cause + └── Create action items +``` + +--- + +## Communication Templates + +### Internal Status Update + +``` +INCIDENT UPDATE - [P1/P2/P3/P4] - [Brief Description] + +Status: [Investigating/Identified/Monitoring/Resolved] +Impact: [Description of customer impact] +Started: [Time in UTC] +Last Update: [Time in UTC] + +Current Actions: +- [Action 1] +- [Action 2] + +Next Update: [Time] +``` + +### Customer Communication (P1/P2 only) + +``` +We are currently experiencing issues with [service/feature]. + +What's happening: [Brief, non-technical description] +Impact: [What customers may experience] +Status: Our team is actively working to resolve this issue. + +We will provide updates every [30 minutes/1 hour]. + +We apologize for any inconvenience. +``` + +--- + +## Post-Incident Review + +After every P1 or P2 incident, conduct a post-incident review within 3 business days. + +### Review Template + +1. **Incident Summary** + - What happened? + - When did it start/end? + - Who was affected? + +2. **Timeline** + - Detection time + - Response time + - Resolution time + - Key milestones + +3. **Root Cause Analysis** + - What was the direct cause? + - What were contributing factors? + - Why wasn't this prevented? + +4. **Action Items** + - Immediate fixes applied + - Preventive measures needed + - Monitoring improvements + - Documentation updates + +--- + +## Related Documents + +- [Provisioning Runbook](./provisioning-runbook.md) +- [Database Operations](./database-operations.md) +- [External Dependencies](./external-dependencies.md) +- [Queue Management](./queue-management.md) +- [Logging Guide](./logging.md) + +--- + +**Last Updated:** December 2025 diff --git a/docs/operations/provisioning-runbook.md b/docs/operations/provisioning-runbook.md index cf9f031c..7dd9fb02 100644 --- a/docs/operations/provisioning-runbook.md +++ b/docs/operations/provisioning-runbook.md @@ -73,3 +73,79 @@ Portal does not auto-retry jobs. Network/5xx/timeouts will mark the Order Failed - `Activation_Error_Code__c` (e.g., 429, 503, ETIMEOUT) - `Activation_Error_Message__c` (short reason) + +--- + +## Escalation Paths + +| Condition | Escalation | Contact | +| ---------------------------- | --------------------- | --------------------------------------------------- | +| Issue persists >30 minutes | Salesforce admin | Check Flow configuration, Platform Event publishing | +| WHMCS returns 5xx >5 times | WHMCS hosting support | Server may be overloaded or down | +| Event replay doesn't recover | Development team | May need code investigation | +| Product mapping errors | Salesforce admin | Add missing `WH_Product_ID__c` values | +| Payment method issues | Customer support | Guide customer to add payment method in WHMCS | + +For general incident response procedures, see [Incident Response Runbook](./incident-response.md). + +--- + +## SLA Expectations + +| Metric | Target | Warning | Critical | +| ----------------------- | ---------- | ----------- | ----------- | +| Provisioning completion | <5 seconds | >10 seconds | >30 seconds | +| Event processing delay | <1 second | >5 seconds | >30 seconds | +| Error rate | <1% | >1% | >5% | + +### Performance Monitoring + +- Monitor provisioning duration in logs (from "Platform Event enqueued" to "Activated") +- Track WHMCS API response times +- Alert on Salesforce update failures + +--- + +## Manual Intervention Checklist + +When automated retry fails, follow these steps: + +1. **Check Salesforce Order** + - Open the Order in Salesforce + - Review `Activation_Status__c`, `Activation_Error_Code__c`, `Activation_Error_Message__c` + - Check if `WHMCS_Order_ID__c` was partially set + +2. **Verify Customer Data** + - Confirm customer has valid WHMCS payment method via `GetPayMethods` + - Check `id_mappings` table for correct portal-WHMCS-SF linkage + +3. **Validate Product Mappings** + - For each OrderItem, verify `Product2.WH_Product_ID__c` is set + - Verify `Product2.Billing_Cycle__c` matches WHMCS expectations + +4. **Check BFF Logs** + - Search for the Salesforce Order ID in logs + - Identify the specific step that failed + - Look for external API errors (WHMCS, Salesforce) + +5. **Manual Recovery** + - If WHMCS order was created but SF not updated: + - Manually update `WHMCS_Order_ID__c` and `Activation_Status__c` in Salesforce + - If WHMCS order was not created: + - Fix the root cause (payment method, mapping) + - Retry via Salesforce (set `Activation_Status__c = Activating`) + +6. **Verify Resolution** + - Confirm Salesforce Order shows `Activated` + - Confirm WHMCS has the order and services + - Confirm customer can see their subscription in the portal + +--- + +## Related Documents + +- [Incident Response](./incident-response.md) +- [Queue Management](./queue-management.md) +- [External Dependencies](./external-dependencies.md) +- [Salesforce Requirements](../integrations/salesforce/requirements.md) +- [WHMCS Mapping Reference](../integrations/salesforce/whmcs-mapping.md) diff --git a/docs/operations/queue-management.md b/docs/operations/queue-management.md new file mode 100644 index 00000000..19cb0beb --- /dev/null +++ b/docs/operations/queue-management.md @@ -0,0 +1,361 @@ +# Queue Management Runbook + +This document covers monitoring and management of BullMQ job queues used by the Customer Portal BFF. + +--- + +## Overview + +The BFF uses BullMQ (backed by Redis) for asynchronous job processing: + +| Queue | Purpose | Processor Location | +| -------------------- | --------------------------------------------- | ---------------------------------------------------- | +| `order-provisioning` | Order fulfillment after CS approval | `apps/bff/src/modules/orders/queue/` | +| `sim-management` | Delayed SIM operations (network type changes) | `apps/bff/src/modules/subscriptions/sim-management/` | + +--- + +## Queue Configuration + +### Environment Variables + +| Variable | Description | Default | +| ------------------------ | ---------------------------------- | -------- | +| `REDIS_URL` | Redis connection for queues | Required | +| `QUEUE_DEFAULT_ATTEMPTS` | Default retry attempts | 3 | +| `QUEUE_BACKOFF_DELAY` | Backoff delay between retries (ms) | 5000 | + +### Queue Options + +```typescript +// Default queue configuration +{ + defaultJobOptions: { + attempts: 3, + backoff: { + type: 'exponential', + delay: 5000, + }, + removeOnComplete: 100, // Keep last 100 completed jobs + removeOnFail: 500, // Keep last 500 failed jobs + } +} +``` + +--- + +## Monitoring + +### Check Queue Status + +```bash +# Connect to Redis and check queue keys +redis-cli KEYS "bull:*" + +# Check specific queue length +redis-cli LLEN "bull:order-provisioning:wait" +redis-cli LLEN "bull:order-provisioning:active" +redis-cli ZCARD "bull:order-provisioning:delayed" +redis-cli ZCARD "bull:order-provisioning:failed" +``` + +### Queue Key Structure + +| Key Pattern | Description | +| ------------------------ | ----------------------------------- | +| `bull:{queue}:wait` | Jobs waiting to be processed | +| `bull:{queue}:active` | Jobs currently being processed | +| `bull:{queue}:delayed` | Jobs scheduled for future execution | +| `bull:{queue}:completed` | Recently completed jobs | +| `bull:{queue}:failed` | Failed jobs | + +### Health Metrics + +| Metric | Warning | Critical | Action | +| ---------------- | ------- | -------- | --------------------------- | +| Wait queue depth | >10 | >50 | Check processor status | +| Failed job count | >5 | >20 | Investigate failures | +| Processing time | >30s | >60s | Check external dependencies | + +--- + +## Order Provisioning Queue + +### Purpose + +Processes orders after CS approval via Salesforce Platform Events. + +### Flow + +``` +Salesforce Platform Event (OrderProvisionRequested__e) + ↓ +Event Subscriber receives event + ↓ +Job enqueued to 'order-provisioning' queue + ↓ +Processor executes fulfillment workflow + ↓ +Order created in WHMCS + Salesforce updated +``` + +### Job Data Structure + +```typescript +{ + sfOrderId: "8014x000000ABCDXYZ", // Salesforce Order ID + idempotencyKey: "8014x...-1703123456789", + eventPayload: { ... } // Original Platform Event data +} +``` + +### Common Failure Reasons + +| Error | Cause | Resolution | +| ------------------------ | ------------------------------ | ------------------------------------------------ | +| `PAYMENT_METHOD_MISSING` | Customer has no payment method | Customer must add payment method in WHMCS | +| `ORDER_NOT_FOUND` | Salesforce Order doesn't exist | Check Order ID, verify not deleted | +| `MAPPING_ERROR` | Product mapping missing | Add `WH_Product_ID__c` to Product2 in Salesforce | +| `WHMCS_ERROR` | WHMCS API failure | Check WHMCS connectivity and logs | + +### Retry Behavior + +- **Attempts**: 3 total (1 initial + 2 retries) +- **Backoff**: Exponential (5s, 10s, 20s) +- **On Final Failure**: Salesforce Order updated with error details + +--- + +## SIM Management Queue + +### Purpose + +Handles delayed SIM operations, particularly network type changes that require a 30-minute gap. + +### Job Types + +| Job Type | Delay | Description | +| ------------------- | ---------- | ----------------------------- | +| `networkTypeChange` | 30 minutes | Change between 4G/5G networks | + +### Job Data Structure + +```typescript +{ + subscriptionId: 29951, + simAccount: "08077052946", + operation: "networkTypeChange", + params: { + networkType: "5G" + }, + scheduledAt: "2024-01-15T10:30:00Z" +} +``` + +### Common Failure Reasons + +| Error | Cause | Resolution | +| --------------------- | -------------------------------- | --------------------------------------- | +| `FREEBIT_AUTH_FAILED` | Freebit authentication error | Check OEM credentials | +| `ACCOUNT_NOT_FOUND` | SIM account not found in Freebit | Verify account identifier | +| `OPERATION_CONFLICT` | Another operation pending | Wait for previous operation to complete | + +--- + +## Failed Job Investigation + +### View Failed Jobs + +```bash +# List failed jobs (using Redis CLI) +redis-cli ZRANGE "bull:order-provisioning:failed" 0 -1 + +# Get job details +redis-cli HGETALL "bull:order-provisioning:{job-id}" +``` + +### Common Investigation Steps + +1. **Check job data**: Identify the order/subscription involved +2. **Check error message**: Look for specific failure reason +3. **Check external system**: Verify Salesforce/WHMCS/Freebit status +4. **Check logs**: Search BFF logs for job ID or order ID +5. **Determine if retryable**: Some errors are permanent (missing mapping), others are transient (network timeout) + +### Log Search + +```bash +# Search logs for specific order +grep "8014x000000ABCDXYZ" /var/log/bff/combined.log + +# Search for queue processing errors +grep "provisioning" /var/log/bff/error.log | tail -50 +``` + +--- + +## Manual Retry Procedures + +### Retry a Single Failed Job + +```typescript +// Using BullMQ API in Node.js +import { Queue } from "bullmq"; + +const queue = new Queue("order-provisioning", { connection: redisConnection }); +const job = await queue.getJob("job-id"); +await job.retry(); +``` + +### Retry All Failed Jobs + +```bash +# Move all failed jobs back to waiting +redis-cli ZRANGEBYSCORE "bull:order-provisioning:failed" -inf +inf | while read jobId; do + redis-cli LPUSH "bull:order-provisioning:wait" "$jobId" + redis-cli ZREM "bull:order-provisioning:failed" "$jobId" +done +``` + +> **Warning**: Only retry jobs after fixing the root cause. Retrying without fixing will cause the same failure. + +### Retry via Salesforce (Recommended for Provisioning) + +For order provisioning, the recommended retry method is through Salesforce: + +1. Open the Order in Salesforce +2. Clear error fields (`Activation_Error__c`, `Activation_Error_DateTime__c`) +3. Set `Activation_Status__c` back to "Activating" +4. The Record-Triggered Flow will publish a new Platform Event + +This approach ensures proper idempotency tracking and audit trail. + +--- + +## Clearing Stuck Jobs + +### Clear All Jobs from a Queue + +> **Warning**: This removes all jobs including pending work. Use only in emergencies. + +```bash +# Clear all queue data +redis-cli DEL \ + "bull:order-provisioning:wait" \ + "bull:order-provisioning:active" \ + "bull:order-provisioning:delayed" \ + "bull:order-provisioning:completed" \ + "bull:order-provisioning:failed" +``` + +### Clear Old Completed/Failed Jobs + +```bash +# Remove jobs older than 7 days from completed +redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:completed" -inf $(date -d '7 days ago' +%s000) + +# Remove jobs older than 30 days from failed +redis-cli ZREMRANGEBYSCORE "bull:order-provisioning:failed" -inf $(date -d '30 days ago' +%s000) +``` + +--- + +## Queue Backlog Handling + +### Symptoms of Backlog + +- Wait queue depth increasing +- Jobs not being processed +- Customer orders stuck in "Activating" status + +### Diagnosis + +1. **Check processor is running** + + ```bash + grep "BullMQ" /var/log/bff/combined.log | tail -20 + ``` + +2. **Check Redis connectivity** + + ```bash + redis-cli PING + ``` + +3. **Check for blocked jobs** + + ```bash + redis-cli LLEN "bull:order-provisioning:active" + # If active > 0 for extended time, jobs may be stuck + ``` + +4. **Check external dependencies** + - Salesforce API + - WHMCS API + +### Resolution + +1. **Restart BFF** to reconnect queue workers +2. **Clear stuck active jobs** if processor crashed mid-job +3. **Scale horizontally** if queue depth is due to high volume +4. **Fix root cause** if jobs are failing repeatedly + +--- + +## Alerting Configuration + +### Recommended Alerts + +| Alert | Condition | Severity | +| ---------------------- | ------------------------------------------------ | -------- | +| Queue Backlog | Wait queue > 10 for > 5 minutes | Warning | +| Queue Backlog Critical | Wait queue > 50 | Critical | +| Failed Jobs Spike | > 5 failures in 15 minutes | Warning | +| Processor Down | No job processed in 10 minutes with jobs waiting | Critical | +| Job Timeout | Job active for > 5 minutes | Warning | + +### Monitoring Queries + +```bash +# Check queue depths (for monitoring script) +WAIT=$(redis-cli LLEN "bull:order-provisioning:wait") +ACTIVE=$(redis-cli LLEN "bull:order-provisioning:active") +FAILED=$(redis-cli ZCARD "bull:order-provisioning:failed") + +echo "Wait: $WAIT, Active: $ACTIVE, Failed: $FAILED" +``` + +--- + +## Best Practices + +### Job Design + +- Include sufficient context in job data for debugging +- Use idempotency keys to prevent duplicate processing +- Keep job payloads small (< 10KB) + +### Error Handling + +- Distinguish between retryable and non-retryable errors +- Log sufficient context before throwing +- Update external systems with error status on final failure + +### Monitoring + +- Set up alerts for queue depth and failure rate +- Monitor job processing duration +- Track success/failure ratios over time + +--- + +## Related Documents + +- [Incident Response](./incident-response.md) +- [Provisioning Runbook](./provisioning-runbook.md) +- [External Dependencies](./external-dependencies.md) +- [SIM State Machine](../integrations/sim/state-machine.md) + +--- + +**Last Updated:** December 2025