Assist_Design/docs/operations/database-operations.md

408 lines
8.6 KiB
Markdown
Raw Normal View History

# Database Operations Runbook
This document covers operational procedures for the PostgreSQL database used by the Customer Portal BFF.
---
## Overview
| Component | Technology | Location |
| --------------- | ------------------------- | ----------------------------- |
| Database | PostgreSQL 17 | Configured via `DATABASE_URL` |
| ORM | Prisma 6 | `apps/bff/prisma/` |
| Connection Pool | Prisma connection pooling | Default: 10 connections |
---
## Backup Procedures
### Automated Backups
> **Note**: Configure automated backups based on your hosting environment.
**Recommended Schedule:**
- Full backup: Daily at 02:00 UTC
- Transaction log backup: Every 15 minutes
- Retention: 30 days
### Manual Backup
```bash
# Create a full database backup
pg_dump $DATABASE_URL > backup_$(date +%Y%m%d_%H%M%S).sql
# Create a compressed backup
pg_dump $DATABASE_URL | gzip > backup_$(date +%Y%m%d_%H%M%S).sql.gz
# Backup specific tables
pg_dump $DATABASE_URL -t users -t id_mappings > user_data_backup.sql
```
### Backup Verification
```bash
# Verify backup integrity (restore to temp database)
createdb portal_backup_test
psql portal_backup_test < backup_YYYYMMDD.sql
# Run basic integrity checks
psql portal_backup_test -c "SELECT COUNT(*) FROM users"
psql portal_backup_test -c "SELECT COUNT(*) FROM id_mappings"
# Clean up
dropdb portal_backup_test
```
---
## Recovery Procedures
### Point-in-Time Recovery
**Prerequisites:**
- WAL archiving enabled
- Continuous backup configured
```bash
# Stop the application
pnpm prod:stop
# Restore from backup
pg_restore -d $DATABASE_URL backup_YYYYMMDD.dump
# Run Prisma migrations to ensure schema is current
pnpm db:migrate
# Restart the application
pnpm prod:start
```
### Restore from SQL Backup
```bash
# Stop the application to prevent writes
pnpm prod:stop
# Drop and recreate database (DESTRUCTIVE)
dropdb portal_production
createdb portal_production
# Restore from backup
psql $DATABASE_URL < backup_YYYYMMDD.sql
# Verify restoration
psql $DATABASE_URL -c "SELECT COUNT(*) FROM users"
# Restart application
pnpm prod:start
```
---
## Migration Management
### Running Migrations
```bash
# Development: Apply pending migrations
pnpm db:migrate
# Production: Deploy migrations
pnpm db:migrate --skip-generate
# View migration status
npx prisma migrate status
```
### Migration Checklist
Before deploying migrations to production:
1. [ ] Test migration on staging environment
2. [ ] Verify rollback procedure exists
3. [ ] Estimate migration duration
4. [ ] Schedule maintenance window if needed
5. [ ] Create backup before migration
6. [ ] Notify team of deployment
### Rollback Procedure
Prisma does not have built-in rollback. Use these approaches:
**Option 1: Restore from Backup**
```bash
# Restore database to pre-migration state
psql $DATABASE_URL < pre_migration_backup.sql
# Revert migration files in codebase
git revert <migration-commit>
```
**Option 2: Manual Rollback SQL**
```bash
# Create rollback SQL for each migration
# Store in: apps/bff/prisma/rollbacks/
# Example rollback
psql $DATABASE_URL < rollbacks/20240115_rollback.sql
```
**Option 3: Reset and Reseed (Development Only)**
```bash
# WARNING: Destroys all data
pnpm db:reset
```
---
## ID Mappings Data Integrity
The `id_mappings` table links portal users to WHMCS and Salesforce accounts. Corruption here causes authentication and data access failures.
### Verify Mapping Integrity
```sql
-- Check for orphaned mappings (portal user deleted but mapping exists)
SELECT m.* FROM id_mappings m
LEFT JOIN users u ON m.user_id = u.id
WHERE u.id IS NULL;
-- Check for duplicate WHMCS mappings
SELECT whmcs_client_id, COUNT(*) as count
FROM id_mappings
WHERE whmcs_client_id IS NOT NULL
GROUP BY whmcs_client_id
HAVING COUNT(*) > 1;
-- Check for duplicate Salesforce mappings
SELECT sf_account_id, COUNT(*) as count
FROM id_mappings
WHERE sf_account_id IS NOT NULL
GROUP BY sf_account_id
HAVING COUNT(*) > 1;
```
### Fix Orphaned Mappings
```sql
-- Remove mappings for deleted users
DELETE FROM id_mappings
WHERE user_id NOT IN (SELECT id FROM users);
```
### Fix Duplicate Mappings
> **Warning**: Investigate duplicates before deleting. They may indicate data issues.
```sql
-- View duplicate details before fixing
SELECT m.*, u.email FROM id_mappings m
JOIN users u ON m.user_id = u.id
WHERE m.whmcs_client_id IN (
SELECT whmcs_client_id FROM id_mappings
GROUP BY whmcs_client_id HAVING COUNT(*) > 1
);
```
---
## PostgreSQL Maintenance
### VACUUM and ANALYZE
```sql
-- Analyze all tables for query optimization
ANALYZE;
-- Vacuum to reclaim space (non-blocking)
VACUUM;
-- Full vacuum (blocking, reclaims more space)
VACUUM FULL;
-- Vacuum specific table
VACUUM ANALYZE id_mappings;
```
**Recommended Schedule:**
- `VACUUM ANALYZE`: Daily during low-traffic hours
- `VACUUM FULL`: Monthly during maintenance window
### Index Maintenance
```sql
-- Check index usage
SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;
-- Find unused indexes (candidates for removal)
SELECT schemaname, tablename, indexname
FROM pg_stat_user_indexes
WHERE idx_scan = 0;
-- Reindex a table
REINDEX TABLE id_mappings;
-- Reindex entire database (during maintenance window)
REINDEX DATABASE portal_production;
```
### Check Table Bloat
```sql
-- Estimate table bloat
SELECT
schemaname, tablename,
pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) as size,
n_dead_tup as dead_rows,
n_live_tup as live_rows,
ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) as dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
```
---
## Connection Pool Monitoring
### Check Active Connections
```sql
-- Current connection count
SELECT COUNT(*) as connections FROM pg_stat_activity;
-- Connections by state
SELECT state, COUNT(*) FROM pg_stat_activity GROUP BY state;
-- Connections by application
SELECT application_name, COUNT(*)
FROM pg_stat_activity
GROUP BY application_name;
-- Long-running queries (>5 minutes)
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
AND now() - pg_stat_activity.query_start > interval '5 minutes';
```
### Kill Stuck Connections
```sql
-- Terminate a specific query
SELECT pg_terminate_backend(<pid>);
-- Terminate all connections except current
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid <> pg_backend_pid()
AND datname = current_database();
```
### Prisma Connection Pool Settings
Configure in `DATABASE_URL` query parameters:
```
postgresql://user:pass@host:5432/db?connection_limit=10&pool_timeout=10
```
| Parameter | Default | Recommended |
| ------------------ | ------- | ------------------ |
| `connection_limit` | 10 | 10-20 per instance |
| `pool_timeout` | 10s | 10-30s |
---
## Monitoring Queries
### Database Size
```sql
-- Total database size
SELECT pg_size_pretty(pg_database_size(current_database()));
-- Size per table
SELECT
tablename,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) as total_size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC;
```
### Query Performance
```sql
-- Slowest queries (requires pg_stat_statements extension)
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
```
### Lock Monitoring
```sql
-- Check for locks
SELECT
pg_locks.pid,
pg_stat_activity.query,
pg_locks.mode,
pg_locks.granted
FROM pg_locks
JOIN pg_stat_activity ON pg_locks.pid = pg_stat_activity.pid
WHERE NOT pg_locks.granted;
```
---
## Emergency Procedures
### Database Unresponsive
1. Check PostgreSQL process status
2. Check disk space and memory
3. Kill long-running queries
4. Restart PostgreSQL if necessary
5. Check application connectivity after restart
### Disk Space Full
```bash
# Check disk usage
df -h
# Find large files in PostgreSQL data directory
du -sh /var/lib/postgresql/data/*
# Clear transaction logs (if WAL archiving is working)
# WARNING: Only if logs are properly archived
```
### Corruption Detected
1. **STOP** the application immediately
2. Do not attempt repairs without backup verification
3. Restore from last known good backup
4. Investigate root cause before resuming service
---
## Related Documents
- [Incident Response](./incident-response.md)
- [External Dependencies](./external-dependencies.md)
- [Provisioning Runbook](./provisioning-runbook.md)
---
**Last Updated:** December 2025