Assist_Design/docs/how-it-works/system-overview.md
barsa a688121a16 Update Cache Documentation to Reflect Safety TTL Implementation
- Revised README and system overview documentation to clarify the introduction of a 12-hour safety TTL for cache invalidation alongside CDC events.
- Enhanced explanations regarding the caching strategy for services, emphasizing the importance of the safety TTL in maintaining data freshness and self-healing capabilities.
- Updated comments in the OrdersCacheService to align with the new caching approach, ensuring consistency across documentation and code.
2025-12-25 15:50:41 +09:00

93 lines
6.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# How the Portal Works (Overview)
Purpose: explain what the portal does, which systems own which data, and how freshness is managed.
## Core Pieces and Responsibilities
- Portal UI (Next.js) + BFF API (NestJS): handles all user traffic and calls external systems.
- Postgres: stores portal users and the cross-system mapping `user_id ↔ whmcs_client_id ↔ sf_account_id`.
- Redis cache: reduces load with a mix of **global** caches (e.g. product catalog) and **account-scoped** caches (e.g. eligibility) to avoid mixing customer data.
- WHMCS: system of record for billing (clients, addresses, invoices, payment methods, subscriptions).
- Salesforce: system of record for CRM (accounts/contacts), product catalog/pricebook, orders, and support cases.
- Freebit: SIM provisioning only, used during mobile/SIM order fulfillment.
## High-Level Data Flows
- Sign-up: portal verifies the customer number in Salesforce → creates a WHMCS client (billing account) → stores the portal user + mapping → updates Salesforce with portal status + WHMCS ID.
- Login/Linking: existing WHMCS users validate their WHMCS credentials; we create the portal user, map IDs, and mark the Salesforce account as portal-active.
- Services & Checkout: products/prices come from the Salesforce portal pricebook; eligibility is checked per account; we require a WHMCS payment method before allowing checkout.
- Orders: created in Salesforce with an address snapshot; Salesforce change events trigger fulfillment, which creates the matching WHMCS order and updates Salesforce statuses.
- Billing: invoices, payment methods, and subscriptions are read from WHMCS; secure SSO links are generated for paying invoices inside WHMCS.
- Support: cases are created/read directly in Salesforce with Origin = “Portal Website.”
## Data Ownership Cheat Sheet
- Identity & session: Portal DB (hashed passwords, no WHMCS/SF credentials stored).
- Billing profile & addresses: WHMCS (authoritative); the portal writes changes back to WHMCS.
- Orders & order status: Salesforce (source of truth); WHMCS receives the billing/provisioning copy during fulfillment.
- Support cases: Salesforce (portal only filters to the accounts cases).
## Caching & Freshness (Redis)
- Services catalog: event-driven (Salesforce CDC) with a 12h safety TTL; "volatile" bits use 60s TTL; eligibility per account is event-driven with the same 12h safety TTL.
- Orders: event-driven (Salesforce CDC), no TTL; invalidated when Salesforce emits order/order-item changes or when we create/provision an order.
- Invoices: list cached 90s; invoice detail cached 5m; invalidated by WHMCS webhooks and by write operations.
- Subscriptions/services: list cached 5m; single subscription cached 10m; invalidated on WHMCS cache busts (webhooks or profile updates).
- Payment methods: cached 15m; payment gateways list cached 1h.
- WHMCS client profile: cached 30m; cleared after profile/address changes.
- Signup account lookup (Salesforce customer number): cached 30s to keep the form responsive.
- Support cases: read live from Salesforce (no cache).
## What Happens on Errors
- We prefer to fail safely with clear messages: for example, missing Customer Number, duplicate account, or missing payment method stops the action and tells the user what to fix.
- If WHMCS or Salesforce is briefly unavailable, the portal surfaces a friendly “try again later” message rather than partial data.
- Fulfillment writes error codes/messages back to Salesforce (e.g., missing payment method) so the team can see why a provision was paused.
- Caches are cleared on writes and key webhooks so stale data is minimized; when cache access fails, we fall back to live reads.
## Public vs Account API Boundary (Security + Caching)
The BFF exposes two “flavors” of service catalog endpoints:
- **Public catalog (never personalized)**: `GET /api/public/services/*`
- Ignores cookies/tokens (no optional session attach).
- Safe to cache publicly (subject to TTL) and heavily rate limit.
- **Account catalog (authenticated + personalized)**: `GET /api/account/services/*`
- Requires auth and can return account-specific catalog variants (e.g. SIM family discount availability).
- Uses `Cache-Control: private, no-store` at the HTTP layer; server-side caching is handled in Redis.
### How "public caching" works (and why high traffic usually won't hit Salesforce)
There are **two independent caching layers** involved:
- **Redis (server-side) catalog cache**:
- Catalog reads are cached in Redis via `ServicesCacheService`.
- Catalog + eligibility data are primarily invalidated by Salesforce events, but we also apply a **12 hour safety TTL** (configurable via `SERVICES_CACHE_SAFETY_TTL_SECONDS`) to self-heal if events are missed.
- Invalidation is driven by Salesforce **CDC** events (Product2 / PricebookEntry) and an account **Platform Event** for eligibility updates.
- Result: even if the public catalog is requested millions of times, the BFF typically serves from Redis and only re-queries Salesforce when a relevant Salesforce change event arrives (or on cold start / cache miss).
- **HTTP cache (browser/CDN)**:
- Public catalog responses include `Cache-Control: public, max-age=..., s-maxage=...`.
- This reduces load on the BFF by allowing browsers/shared caches/CDNs to reuse responses for the TTL window.
- This layer is TTL-based, so **staleness up to the TTL** is expected unless your CDN is configured for explicit purge.
### What to worry about at "million visits" scale
- **CDN cookie forwarding / cache key fragmentation**:
- Browsers will still send cookies to `/api/public/*` by default; the BFF ignores them, but a CDN might treat cookies as part of the cache key unless configured not to.
- Make sure your CDN/proxy config does **not** include cookies (and ideally not `Authorization`) in the cache key for `/api/public/services/*`.
- **BFF + Redis load (even if Salesforce is protected)**:
- Redis caching prevents Salesforce read amplification, but the BFF/Redis still need to handle request volume.
- Rate limiting on public endpoints is intentional to cap abuse and protect infrastructure.
- **CDC subscription health / fallback behavior**:
- If Salesforce CDC subscriptions are disabled or unhealthy, invalidations may not arrive and Redis caches can become stale until manually cleared.
- Monitor the CDC subscriber and cache health metrics (`GET /api/health/services/cache`).
### Future work (monitoring + resilience)
- **CDC subscriber monitoring**: alert on disconnects and sustained lack of events (time since last processed event).
- **Replay cursor persistence**: store/restore a replay position across restarts to reduce missed-event risk.
- **Operational runbook**: document the “flush services caches” procedure for incidents where events were missed for an extended period.