# Database Operations Runbook This document covers operational procedures for the PostgreSQL database used by the Customer Portal BFF. --- ## Overview | Component | Technology | Location | | --------------- | ------------------------- | ----------------------------- | | Database | PostgreSQL 17 | Configured via `DATABASE_URL` | | ORM | Prisma 6 | `apps/bff/prisma/` | | Connection Pool | Prisma connection pooling | Default: 10 connections | --- ## Backup Procedures ### Automated Backups > **Note**: Configure automated backups based on your hosting environment. **Recommended Schedule:** - Full backup: Daily at 02:00 UTC - Transaction log backup: Every 15 minutes - Retention: 30 days ### Manual Backup ```bash # Create a full database backup pg_dump $DATABASE_URL > backup_$(date +%Y%m%d_%H%M%S).sql # Create a compressed backup pg_dump $DATABASE_URL | gzip > backup_$(date +%Y%m%d_%H%M%S).sql.gz # Backup specific tables pg_dump $DATABASE_URL -t users -t id_mappings > user_data_backup.sql ``` ### Backup Verification ```bash # Verify backup integrity (restore to temp database) createdb portal_backup_test psql portal_backup_test < backup_YYYYMMDD.sql # Run basic integrity checks psql portal_backup_test -c "SELECT COUNT(*) FROM users" psql portal_backup_test -c "SELECT COUNT(*) FROM id_mappings" # Clean up dropdb portal_backup_test ``` --- ## Recovery Procedures ### Point-in-Time Recovery **Prerequisites:** - WAL archiving enabled - Continuous backup configured ```bash # Stop the application pnpm prod:stop # Restore from backup pg_restore -d $DATABASE_URL backup_YYYYMMDD.dump # Run Prisma migrations to ensure schema is current pnpm db:migrate # Restart the application pnpm prod:start ``` ### Restore from SQL Backup ```bash # Stop the application to prevent writes pnpm prod:stop # Drop and recreate database (DESTRUCTIVE) dropdb portal_production createdb portal_production # Restore from backup psql $DATABASE_URL < backup_YYYYMMDD.sql # Verify restoration psql $DATABASE_URL -c "SELECT COUNT(*) FROM users" # Restart application pnpm prod:start ``` --- ## Migration Management ### Running Migrations ```bash # Development: Apply pending migrations pnpm db:migrate # Production: Deploy migrations pnpm db:migrate --skip-generate # View migration status pnpm exec prisma migrate status ``` ### Migration Checklist Before deploying migrations to production: 1. [ ] Test migration on staging environment 2. [ ] Verify rollback procedure exists 3. [ ] Estimate migration duration 4. [ ] Schedule maintenance window if needed 5. [ ] Create backup before migration 6. [ ] Notify team of deployment ### Rollback Procedure Prisma does not have built-in rollback. Use these approaches: **Option 1: Restore from Backup** ```bash # Restore database to pre-migration state psql $DATABASE_URL < pre_migration_backup.sql # Revert migration files in codebase git revert ``` **Option 2: Manual Rollback SQL** ```bash # Create rollback SQL for each migration # Store in: apps/bff/prisma/rollbacks/ # Example rollback psql $DATABASE_URL < rollbacks/20240115_rollback.sql ``` **Option 3: Reset and Reseed (Development Only)** ```bash # WARNING: Destroys all data pnpm db:reset ``` --- ## ID Mappings Data Integrity The `id_mappings` table links portal users to WHMCS and Salesforce accounts. Corruption here causes authentication and data access failures. ### Verify Mapping Integrity ```sql -- Check for orphaned mappings (portal user deleted but mapping exists) SELECT m.* FROM id_mappings m LEFT JOIN users u ON m.user_id = u.id WHERE u.id IS NULL; -- Check for duplicate WHMCS mappings SELECT whmcs_client_id, COUNT(*) as count FROM id_mappings WHERE whmcs_client_id IS NOT NULL GROUP BY whmcs_client_id HAVING COUNT(*) > 1; -- Check for duplicate Salesforce mappings SELECT sf_account_id, COUNT(*) as count FROM id_mappings WHERE sf_account_id IS NOT NULL GROUP BY sf_account_id HAVING COUNT(*) > 1; ``` ### Fix Orphaned Mappings ```sql -- Remove mappings for deleted users DELETE FROM id_mappings WHERE user_id NOT IN (SELECT id FROM users); ``` ### Fix Duplicate Mappings > **Warning**: Investigate duplicates before deleting. They may indicate data issues. ```sql -- View duplicate details before fixing SELECT m.*, u.email FROM id_mappings m JOIN users u ON m.user_id = u.id WHERE m.whmcs_client_id IN ( SELECT whmcs_client_id FROM id_mappings GROUP BY whmcs_client_id HAVING COUNT(*) > 1 ); ``` --- ## PostgreSQL Maintenance ### VACUUM and ANALYZE ```sql -- Analyze all tables for query optimization ANALYZE; -- Vacuum to reclaim space (non-blocking) VACUUM; -- Full vacuum (blocking, reclaims more space) VACUUM FULL; -- Vacuum specific table VACUUM ANALYZE id_mappings; ``` **Recommended Schedule:** - `VACUUM ANALYZE`: Daily during low-traffic hours - `VACUUM FULL`: Monthly during maintenance window ### Index Maintenance ```sql -- Check index usage SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read FROM pg_stat_user_indexes ORDER BY idx_scan DESC; -- Find unused indexes (candidates for removal) SELECT schemaname, tablename, indexname FROM pg_stat_user_indexes WHERE idx_scan = 0; -- Reindex a table REINDEX TABLE id_mappings; -- Reindex entire database (during maintenance window) REINDEX DATABASE portal_production; ``` ### Check Table Bloat ```sql -- Estimate table bloat SELECT schemaname, tablename, pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) as size, n_dead_tup as dead_rows, n_live_tup as live_rows, ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) as dead_pct FROM pg_stat_user_tables ORDER BY n_dead_tup DESC; ``` --- ## Connection Pool Monitoring ### Check Active Connections ```sql -- Current connection count SELECT COUNT(*) as connections FROM pg_stat_activity; -- Connections by state SELECT state, COUNT(*) FROM pg_stat_activity GROUP BY state; -- Connections by application SELECT application_name, COUNT(*) FROM pg_stat_activity GROUP BY application_name; -- Long-running queries (>5 minutes) SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '5 minutes'; ``` ### Kill Stuck Connections ```sql -- Terminate a specific query SELECT pg_terminate_backend(); -- Terminate all connections except current SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE pid <> pg_backend_pid() AND datname = current_database(); ``` ### Prisma Connection Pool Settings Configure in `DATABASE_URL` query parameters: ``` postgresql://user:pass@host:5432/db?connection_limit=10&pool_timeout=10 ``` | Parameter | Default | Recommended | | ------------------ | ------- | ------------------ | | `connection_limit` | 10 | 10-20 per instance | | `pool_timeout` | 10s | 10-30s | --- ## Monitoring Queries ### Database Size ```sql -- Total database size SELECT pg_size_pretty(pg_database_size(current_database())); -- Size per table SELECT tablename, pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) as total_size FROM pg_tables WHERE schemaname = 'public' ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC; ``` ### Query Performance ```sql -- Slowest queries (requires pg_stat_statements extension) SELECT query, calls, mean_time, total_time FROM pg_stat_statements ORDER BY mean_time DESC LIMIT 10; ``` ### Lock Monitoring ```sql -- Check for locks SELECT pg_locks.pid, pg_stat_activity.query, pg_locks.mode, pg_locks.granted FROM pg_locks JOIN pg_stat_activity ON pg_locks.pid = pg_stat_activity.pid WHERE NOT pg_locks.granted; ``` --- ## Emergency Procedures ### Database Unresponsive 1. Check PostgreSQL process status 2. Check disk space and memory 3. Kill long-running queries 4. Restart PostgreSQL if necessary 5. Check application connectivity after restart ### Disk Space Full ```bash # Check disk usage df -h # Find large files in PostgreSQL data directory du -sh /var/lib/postgresql/data/* # Clear transaction logs (if WAL archiving is working) # WARNING: Only if logs are properly archived ``` ### Corruption Detected 1. **STOP** the application immediately 2. Do not attempt repairs without backup verification 3. Restore from last known good backup 4. Investigate root cause before resuming service --- ## Related Documents - [Incident Response](./incident-response.md) - [External Dependencies](./external-dependencies.md) - [Provisioning Runbook](./provisioning-runbook.md) --- **Last Updated:** December 2025