Disaster Recovery
Overview
This document outlines the disaster recovery (DR) procedures for Kredete's infrastructure. Our DR strategy is designed to meet the following objectives:
| Objective | Target | Current |
|---|---|---|
| Recovery Time Objective (RTO) | < 4 hours | 2.5 hours |
| Recovery Point Objective (RPO) | < 1 hour | 15 minutes |
| Availability Target | 99.99% | 99.995% |
Backup Strategy
Database Backups
PostgreSQL (Primary Stores)
| Type | Frequency | Retention | Location |
|---|---|---|---|
| Full Backup | Daily (02:00 UTC) | 30 days | S3 (cross-region) |
| Incremental | Every 6 hours | 7 days | S3 (same-region) |
| WAL Archives | Continuous | 7 days | S3 + Azure (DR) |
| Monthly Archive | 1st day of month | 7 years | Glacier Deep Archive |
PCI DSS Compliance
All backups containing cardholder data are encrypted using AES-256-GCM and stored following PCI DSS requirements for data retention.
MongoDB (Document Stores)
# MongoDB Backup Configuration
backup:
enabled: true
type: continuous
destination:
type: s3
bucket: kredete-mongo-backups
region: eu-west-1
encryption:
enabled: true
algorithm: AES256
retention:
daily: 7
weekly: 4
monthly: 12
yearly: 7
Infrastructure Backups
- Kubernetes State: Velero daily backups to S3
- Terraform State: S3 with versioning + DynamoDB locking
- Secrets: Vault snapshots every 4 hours
- Configuration: Git (GitLab) with offline mirror
Failover Procedures
Primary Region Failure
If the primary region (AWS af-south-1) becomes unavailable:
┌─────────────┐ Automatic ┌─────────────┐
│ af-south-1 │ ────────────→ │ eu-west-1 │
│ (Primary) │ DNS Failover │ (Standby) │
│ FAILED │ │ ACTIVE │
└─────────────┘ └─────────────┘
│ │
▼ ▼
Route 53 RDS Read Replica
Health Check Promoted to Primary
Triggers
Automated Failover Steps
- Detection (0-2 min)
- Route 53 health checks fail (3 consecutive)
- PagerDuty alert triggered automatically
- Traffic Redirect (2-5 min)
- DNS TTL: 60 seconds
- Route 53 automatically routes to eu-west-1
- Database Promotion (5-15 min)
- RDS read replica promoted to standalone
- Connection strings updated via Parameter Store
- Service Activation (15-30 min)
- Kubernetes workloads scaled up
- Cache warming initiated
Partial Service Failure
| Component | Failure Response | Auto Recovery |
|---|---|---|
| Single AZ | Redistribute to remaining AZs | Yes |
| Kubernetes Node | Pods rescheduled automatically | Yes |
| Database Primary | Automatic failover to standby | Yes |
| Redis Primary | Sentinel promotes replica | Yes |
| Kafka Broker | ISR continues serving | Yes |
Runbooks
Runbook: Database Failover
Warning
This runbook should only be executed by authorized Database Administrators with approval from the VP of Engineering.
#!/bin/bash
# Runbook: Manual Database Failover
# Author: Database Team
# Last Updated: 2026-04-15
# 1. Verify current primary status
aws rds describe-db-instances \
--db-instance-identifier kredete-primary \
--query 'DBInstances[0].DBInstanceStatus'
# 2. Check replication lag on standby
aws rds describe-db-instances \
--db-instance-identifier kredete-standby \
--query 'DBInstances[0].StatusInfos'
# 3. Initiate failover (if lag < 100ms)
aws rds failover-db-cluster \
--db-cluster-identifier kredete-cluster \
--target-db-instance-identifier kredete-standby
# 4. Update application configuration
kubectl set env deployment/credit-origination \
DATABASE_HOST=kredete-standby.xxxxx.eu-west-1.rds.amazonaws.com
# 5. Verify connectivity
kubectl exec -it deploy/credit-origination -- \
psql -c "SELECT pg_is_in_recovery();" # Should return false (primary)
# 6. Clear application caches
kubectl rollout restart deployment --selector=tier=api
Runbook: Complete System Recovery
#!/bin/bash
# Runbook: Full DR Recovery
# Time Estimate: 2-4 hours
echo "=== Stage 1: Infrastructure (30 min) ==="
cd terraform/environments/dr
terraform init
terraform apply -auto-approve
echo "=== Stage 2: Database Restore (60 min) ==="
# Restore from latest S3 backup
aws rds restore-db-cluster-from-snapshot \
--db-cluster-identifier kredete-dr \
--snapshot-identifier $(aws rds describe-db-cluster-snapshots \
--db-cluster-identifier kredete-primary \
--query 'DBClusterSnapshots[-1].DBClusterSnapshotIdentifier' \
--output text)
echo "=== Stage 3: Kubernetes (30 min) ==="
# Restore Velero backup
velero restore create --from-backup latest-daily
echo "=== Stage 4: Verification (30 min) ==="
# Run smoke tests
./scripts/dr-smoke-test.sh
echo "=== Stage 5: DNS Cutover (5 min) ==="
# Update Route 53 to point to DR region
aws route53 change-resource-record-sets \
--hosted-zone-id ZXXXXX \
--change-batch file://dns-failover.json
DR Testing
Testing Schedule
| Test Type | Frequency | Scope | Duration |
|---|---|---|---|
| Backup Verification | Weekly | Restore and validate checksums | 2 hours |
| Failover Test | Monthly | Single component failover | 4 hours |
| Partial DR | Quarterly | Database + API layer | 1 day |
| Full DR Exercise | Annually | Complete region failover | 2 days |
Last DR Test Results
| Test | Date | RTO Achieved | RPO Achieved | Status |
|---|---|---|---|---|
| Full DR Exercise | 2026-01-15 | 2h 15m | 12 min | ✅ Pass |
| Database Failover | 2026-03-01 | 8 min | 0 sec | ✅ Pass |
| AZ Evacuation | 2026-03-15 | 15 min | 0 sec | ✅ Pass |
Communication Plan
During Incident
| Audience | Channel | Frequency |
|---|---|---|
| Engineering Team | Slack #incident-response | Real-time |
| Executive Team | Email + SMS | Every 30 min |
| Partners/Clients | Status Page | Every 15 min |
| Regulators | Direct call (if required) | Within 4 hours |
Status Page
Public status: https://status.kredete.com
Components tracked:
- API Platform
- Credit Origination
- Partner Integrations
- Customer Portal
- Admin Dashboard