- Added a new section for operational runbooks in README.md, detailing procedures for incident response, database operations, and queue management. - Updated the documentation structure in STRUCTURE.md to reflect the new organization of guides and resources. - Removed the deprecated disabled-modules.md file to streamline documentation. - Enhanced the _archive/README.md with historical notes on documentation alignment and corrections made in December 2025. - Updated various references in the documentation to reflect the new paths and services in the integrations directory.
8.6 KiB
8.6 KiB
Database Operations Runbook
This document covers operational procedures for the PostgreSQL database used by the Customer Portal BFF.
Overview
| Component | Technology | Location |
|---|---|---|
| Database | PostgreSQL 17 | Configured via DATABASE_URL |
| ORM | Prisma 6 | apps/bff/prisma/ |
| Connection Pool | Prisma connection pooling | Default: 10 connections |
Backup Procedures
Automated Backups
Note
: Configure automated backups based on your hosting environment.
Recommended Schedule:
- Full backup: Daily at 02:00 UTC
- Transaction log backup: Every 15 minutes
- Retention: 30 days
Manual Backup
# Create a full database backup
pg_dump $DATABASE_URL > backup_$(date +%Y%m%d_%H%M%S).sql
# Create a compressed backup
pg_dump $DATABASE_URL | gzip > backup_$(date +%Y%m%d_%H%M%S).sql.gz
# Backup specific tables
pg_dump $DATABASE_URL -t users -t id_mappings > user_data_backup.sql
Backup Verification
# Verify backup integrity (restore to temp database)
createdb portal_backup_test
psql portal_backup_test < backup_YYYYMMDD.sql
# Run basic integrity checks
psql portal_backup_test -c "SELECT COUNT(*) FROM users"
psql portal_backup_test -c "SELECT COUNT(*) FROM id_mappings"
# Clean up
dropdb portal_backup_test
Recovery Procedures
Point-in-Time Recovery
Prerequisites:
- WAL archiving enabled
- Continuous backup configured
# Stop the application
pnpm prod:stop
# Restore from backup
pg_restore -d $DATABASE_URL backup_YYYYMMDD.dump
# Run Prisma migrations to ensure schema is current
pnpm db:migrate
# Restart the application
pnpm prod:start
Restore from SQL Backup
# Stop the application to prevent writes
pnpm prod:stop
# Drop and recreate database (DESTRUCTIVE)
dropdb portal_production
createdb portal_production
# Restore from backup
psql $DATABASE_URL < backup_YYYYMMDD.sql
# Verify restoration
psql $DATABASE_URL -c "SELECT COUNT(*) FROM users"
# Restart application
pnpm prod:start
Migration Management
Running Migrations
# Development: Apply pending migrations
pnpm db:migrate
# Production: Deploy migrations
pnpm db:migrate --skip-generate
# View migration status
npx prisma migrate status
Migration Checklist
Before deploying migrations to production:
- Test migration on staging environment
- Verify rollback procedure exists
- Estimate migration duration
- Schedule maintenance window if needed
- Create backup before migration
- Notify team of deployment
Rollback Procedure
Prisma does not have built-in rollback. Use these approaches:
Option 1: Restore from Backup
# Restore database to pre-migration state
psql $DATABASE_URL < pre_migration_backup.sql
# Revert migration files in codebase
git revert <migration-commit>
Option 2: Manual Rollback SQL
# Create rollback SQL for each migration
# Store in: apps/bff/prisma/rollbacks/
# Example rollback
psql $DATABASE_URL < rollbacks/20240115_rollback.sql
Option 3: Reset and Reseed (Development Only)
# WARNING: Destroys all data
pnpm db:reset
ID Mappings Data Integrity
The id_mappings table links portal users to WHMCS and Salesforce accounts. Corruption here causes authentication and data access failures.
Verify Mapping Integrity
-- Check for orphaned mappings (portal user deleted but mapping exists)
SELECT m.* FROM id_mappings m
LEFT JOIN users u ON m.user_id = u.id
WHERE u.id IS NULL;
-- Check for duplicate WHMCS mappings
SELECT whmcs_client_id, COUNT(*) as count
FROM id_mappings
WHERE whmcs_client_id IS NOT NULL
GROUP BY whmcs_client_id
HAVING COUNT(*) > 1;
-- Check for duplicate Salesforce mappings
SELECT sf_account_id, COUNT(*) as count
FROM id_mappings
WHERE sf_account_id IS NOT NULL
GROUP BY sf_account_id
HAVING COUNT(*) > 1;
Fix Orphaned Mappings
-- Remove mappings for deleted users
DELETE FROM id_mappings
WHERE user_id NOT IN (SELECT id FROM users);
Fix Duplicate Mappings
Warning
: Investigate duplicates before deleting. They may indicate data issues.
-- View duplicate details before fixing
SELECT m.*, u.email FROM id_mappings m
JOIN users u ON m.user_id = u.id
WHERE m.whmcs_client_id IN (
SELECT whmcs_client_id FROM id_mappings
GROUP BY whmcs_client_id HAVING COUNT(*) > 1
);
PostgreSQL Maintenance
VACUUM and ANALYZE
-- Analyze all tables for query optimization
ANALYZE;
-- Vacuum to reclaim space (non-blocking)
VACUUM;
-- Full vacuum (blocking, reclaims more space)
VACUUM FULL;
-- Vacuum specific table
VACUUM ANALYZE id_mappings;
Recommended Schedule:
VACUUM ANALYZE: Daily during low-traffic hoursVACUUM FULL: Monthly during maintenance window
Index Maintenance
-- Check index usage
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;
-- Find unused indexes (candidates for removal)
SELECT schemaname, tablename, indexname
FROM pg_stat_user_indexes
WHERE idx_scan = 0;
-- Reindex a table
REINDEX TABLE id_mappings;
-- Reindex entire database (during maintenance window)
REINDEX DATABASE portal_production;
Check Table Bloat
-- Estimate table bloat
SELECT
schemaname, tablename,
pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) as size,
n_dead_tup as dead_rows,
n_live_tup as live_rows,
ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) as dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
Connection Pool Monitoring
Check Active Connections
-- Current connection count
SELECT COUNT(*) as connections FROM pg_stat_activity;
-- Connections by state
SELECT state, COUNT(*) FROM pg_stat_activity GROUP BY state;
-- Connections by application
SELECT application_name, COUNT(*)
FROM pg_stat_activity
GROUP BY application_name;
-- Long-running queries (>5 minutes)
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
AND now() - pg_stat_activity.query_start > interval '5 minutes';
Kill Stuck Connections
-- Terminate a specific query
SELECT pg_terminate_backend(<pid>);
-- Terminate all connections except current
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid <> pg_backend_pid()
AND datname = current_database();
Prisma Connection Pool Settings
Configure in DATABASE_URL query parameters:
postgresql://user:pass@host:5432/db?connection_limit=10&pool_timeout=10
| Parameter | Default | Recommended |
|---|---|---|
connection_limit |
10 | 10-20 per instance |
pool_timeout |
10s | 10-30s |
Monitoring Queries
Database Size
-- Total database size
SELECT pg_size_pretty(pg_database_size(current_database()));
-- Size per table
SELECT
tablename,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) as total_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC;
Query Performance
-- Slowest queries (requires pg_stat_statements extension)
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
Lock Monitoring
-- Check for locks
SELECT
pg_locks.pid,
pg_stat_activity.query,
pg_locks.mode,
pg_locks.granted
FROM pg_locks
JOIN pg_stat_activity ON pg_locks.pid = pg_stat_activity.pid
WHERE NOT pg_locks.granted;
Emergency Procedures
Database Unresponsive
- Check PostgreSQL process status
- Check disk space and memory
- Kill long-running queries
- Restart PostgreSQL if necessary
- Check application connectivity after restart
Disk Space Full
# Check disk usage
df -h
# Find large files in PostgreSQL data directory
du -sh /var/lib/postgresql/data/*
# Clear transaction logs (if WAL archiving is working)
# WARNING: Only if logs are properly archived
Corruption Detected
- STOP the application immediately
- Do not attempt repairs without backup verification
- Restore from last known good backup
- Investigate root cause before resuming service
Related Documents
Last Updated: December 2025