barsa 72d0b66be7 Enhance Documentation Structure and Update Operational Runbooks

- Added a new section for operational runbooks in README.md, detailing procedures for incident response, database operations, and queue management.
- Updated the documentation structure in STRUCTURE.md to reflect the new organization of guides and resources.
- Removed the deprecated disabled-modules.md file to streamline documentation.
- Enhanced the _archive/README.md with historical notes on documentation alignment and corrections made in December 2025.
- Updated various references in the documentation to reflect the new paths and services in the integrations directory.

2025-12-23 15:55:58 +09:00

8.6 KiB

Raw Blame History

Database Operations Runbook

This document covers operational procedures for the PostgreSQL database used by the Customer Portal BFF.

Overview

Component	Technology	Location
Database	PostgreSQL 17	Configured via `DATABASE_URL`
ORM	Prisma 6	`apps/bff/prisma/`
Connection Pool	Prisma connection pooling	Default: 10 connections

Backup Procedures

Automated Backups

Note

: Configure automated backups based on your hosting environment.

Recommended Schedule:

Full backup: Daily at 02:00 UTC
Transaction log backup: Every 15 minutes
Retention: 30 days

Manual Backup

# Create a full database backup
pg_dump $DATABASE_URL > backup_$(date +%Y%m%d_%H%M%S).sql

# Create a compressed backup
pg_dump $DATABASE_URL | gzip > backup_$(date +%Y%m%d_%H%M%S).sql.gz

# Backup specific tables
pg_dump $DATABASE_URL -t users -t id_mappings > user_data_backup.sql

Backup Verification

# Verify backup integrity (restore to temp database)
createdb portal_backup_test
psql portal_backup_test < backup_YYYYMMDD.sql

# Run basic integrity checks
psql portal_backup_test -c "SELECT COUNT(*) FROM users"
psql portal_backup_test -c "SELECT COUNT(*) FROM id_mappings"

# Clean up
dropdb portal_backup_test

Recovery Procedures

Point-in-Time Recovery

Prerequisites:

WAL archiving enabled
Continuous backup configured

# Stop the application
pnpm prod:stop

# Restore from backup
pg_restore -d $DATABASE_URL backup_YYYYMMDD.dump

# Run Prisma migrations to ensure schema is current
pnpm db:migrate

# Restart the application
pnpm prod:start

Restore from SQL Backup

# Stop the application to prevent writes
pnpm prod:stop

# Drop and recreate database (DESTRUCTIVE)
dropdb portal_production
createdb portal_production

# Restore from backup
psql $DATABASE_URL < backup_YYYYMMDD.sql

# Verify restoration
psql $DATABASE_URL -c "SELECT COUNT(*) FROM users"

# Restart application
pnpm prod:start

Migration Management

Running Migrations

# Development: Apply pending migrations
pnpm db:migrate

# Production: Deploy migrations
pnpm db:migrate --skip-generate

# View migration status
npx prisma migrate status

Migration Checklist

Before deploying migrations to production:

Test migration on staging environment
Verify rollback procedure exists
Estimate migration duration
Schedule maintenance window if needed
Create backup before migration
Notify team of deployment

Rollback Procedure

Prisma does not have built-in rollback. Use these approaches:

Option 1: Restore from Backup

# Restore database to pre-migration state
psql $DATABASE_URL < pre_migration_backup.sql

# Revert migration files in codebase
git revert <migration-commit>

Option 2: Manual Rollback SQL

# Create rollback SQL for each migration
# Store in: apps/bff/prisma/rollbacks/

# Example rollback
psql $DATABASE_URL < rollbacks/20240115_rollback.sql

Option 3: Reset and Reseed (Development Only)

# WARNING: Destroys all data
pnpm db:reset

ID Mappings Data Integrity

The id_mappings table links portal users to WHMCS and Salesforce accounts. Corruption here causes authentication and data access failures.

Verify Mapping Integrity

-- Check for orphaned mappings (portal user deleted but mapping exists)
SELECT m.* FROM id_mappings m
LEFT JOIN users u ON m.user_id = u.id
WHERE u.id IS NULL;

-- Check for duplicate WHMCS mappings
SELECT whmcs_client_id, COUNT(*) as count
FROM id_mappings
WHERE whmcs_client_id IS NOT NULL
GROUP BY whmcs_client_id
HAVING COUNT(*) > 1;

-- Check for duplicate Salesforce mappings
SELECT sf_account_id, COUNT(*) as count
FROM id_mappings
WHERE sf_account_id IS NOT NULL
GROUP BY sf_account_id
HAVING COUNT(*) > 1;

Fix Orphaned Mappings

-- Remove mappings for deleted users
DELETE FROM id_mappings
WHERE user_id NOT IN (SELECT id FROM users);

Fix Duplicate Mappings

Warning

: Investigate duplicates before deleting. They may indicate data issues.

-- View duplicate details before fixing
SELECT m.*, u.email FROM id_mappings m
JOIN users u ON m.user_id = u.id
WHERE m.whmcs_client_id IN (
  SELECT whmcs_client_id FROM id_mappings
  GROUP BY whmcs_client_id HAVING COUNT(*) > 1
);

PostgreSQL Maintenance

VACUUM and ANALYZE

-- Analyze all tables for query optimization
ANALYZE;

-- Vacuum to reclaim space (non-blocking)
VACUUM;

-- Full vacuum (blocking, reclaims more space)
VACUUM FULL;

-- Vacuum specific table
VACUUM ANALYZE id_mappings;

Recommended Schedule:

VACUUM ANALYZE: Daily during low-traffic hours
VACUUM FULL: Monthly during maintenance window

Index Maintenance

-- Check index usage
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;

-- Find unused indexes (candidates for removal)
SELECT schemaname, tablename, indexname
FROM pg_stat_user_indexes
WHERE idx_scan = 0;

-- Reindex a table
REINDEX TABLE id_mappings;

-- Reindex entire database (during maintenance window)
REINDEX DATABASE portal_production;

Check Table Bloat

-- Estimate table bloat
SELECT
  schemaname, tablename,
  pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) as size,
  n_dead_tup as dead_rows,
  n_live_tup as live_rows,
  ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) as dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

Connection Pool Monitoring

Check Active Connections

-- Current connection count
SELECT COUNT(*) as connections FROM pg_stat_activity;

-- Connections by state
SELECT state, COUNT(*) FROM pg_stat_activity GROUP BY state;

-- Connections by application
SELECT application_name, COUNT(*)
FROM pg_stat_activity
GROUP BY application_name;

-- Long-running queries (>5 minutes)
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
  AND now() - pg_stat_activity.query_start > interval '5 minutes';

Kill Stuck Connections

-- Terminate a specific query
SELECT pg_terminate_backend(<pid>);

-- Terminate all connections except current
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid <> pg_backend_pid()
  AND datname = current_database();

Prisma Connection Pool Settings

Configure in DATABASE_URL query parameters:

postgresql://user:pass@host:5432/db?connection_limit=10&pool_timeout=10

Parameter	Default	Recommended
`connection_limit`	10	10-20 per instance
`pool_timeout`	10s	10-30s

Monitoring Queries

Database Size

-- Total database size
SELECT pg_size_pretty(pg_database_size(current_database()));

-- Size per table
SELECT
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) as total_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC;

Query Performance

-- Slowest queries (requires pg_stat_statements extension)
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;

Lock Monitoring

-- Check for locks
SELECT
  pg_locks.pid,
  pg_stat_activity.query,
  pg_locks.mode,
  pg_locks.granted
FROM pg_locks
JOIN pg_stat_activity ON pg_locks.pid = pg_stat_activity.pid
WHERE NOT pg_locks.granted;

Emergency Procedures

Database Unresponsive

Check PostgreSQL process status
Check disk space and memory
Kill long-running queries
Restart PostgreSQL if necessary
Check application connectivity after restart

Disk Space Full

# Check disk usage
df -h

# Find large files in PostgreSQL data directory
du -sh /var/lib/postgresql/data/*

# Clear transaction logs (if WAL archiving is working)
# WARNING: Only if logs are properly archived

Corruption Detected

STOP the application immediately
Do not attempt repairs without backup verification
Restore from last known good backup
Investigate root cause before resuming service

Last Updated: December 2025

8.6 KiB Raw Blame History