Disaster Recovery

Overview

This document outlines the disaster recovery (DR) procedures for Kredete's infrastructure. Our DR strategy is designed to meet the following objectives:

Objective	Target	Current
Recovery Time Objective (RTO)	< 4 hours	2.5 hours
Recovery Point Objective (RPO)	< 1 hour	15 minutes
Availability Target	99.99%	99.995%

Backup Strategy

Database Backups

PostgreSQL (Primary Stores)

Type	Frequency	Retention	Location
Full Backup	Daily (02:00 UTC)	30 days	S3 (cross-region)
Incremental	Every 6 hours	7 days	S3 (same-region)
WAL Archives	Continuous	7 days	S3 + Azure (DR)
Monthly Archive	1st day of month	7 years	Glacier Deep Archive

PCI DSS Compliance

All backups containing cardholder data are encrypted using AES-256-GCM and stored following PCI DSS requirements for data retention.

MongoDB (Document Stores)

# MongoDB Backup Configuration
backup:
  enabled: true
  type: continuous
  destination:
    type: s3
    bucket: kredete-mongo-backups
    region: eu-west-1
  encryption:
    enabled: true
    algorithm: AES256
  retention:
    daily: 7
    weekly: 4
    monthly: 12
    yearly: 7

Infrastructure Backups

Kubernetes State: Velero daily backups to S3
Terraform State: S3 with versioning + DynamoDB locking
Secrets: Vault snapshots every 4 hours
Configuration: Git (GitLab) with offline mirror

Failover Procedures

Primary Region Failure

If the primary region (AWS af-south-1) becomes unavailable:

┌─────────────┐ Automatic ┌─────────────┐ │ af-south-1 │ ────────────→ │ eu-west-1 │ │ (Primary) │ DNS Failover │ (Standby) │ │ FAILED │ │ ACTIVE │ └─────────────┘ └─────────────┘ │ │ ▼ ▼ Route 53 RDS Read Replica Health Check Promoted to Primary Triggers

Automated Failover Steps

Detection (0-2 min)
- Route 53 health checks fail (3 consecutive)
- PagerDuty alert triggered automatically
Traffic Redirect (2-5 min)
- DNS TTL: 60 seconds
- Route 53 automatically routes to eu-west-1
Database Promotion (5-15 min)
- RDS read replica promoted to standalone
- Connection strings updated via Parameter Store
Service Activation (15-30 min)
- Kubernetes workloads scaled up
- Cache warming initiated

Partial Service Failure

Component	Failure Response	Auto Recovery
Single AZ	Redistribute to remaining AZs	Yes
Kubernetes Node	Pods rescheduled automatically	Yes
Database Primary	Automatic failover to standby	Yes
Redis Primary	Sentinel promotes replica	Yes
Kafka Broker	ISR continues serving	Yes

Runbooks

Runbook: Database Failover

Warning

This runbook should only be executed by authorized Database Administrators with approval from the VP of Engineering.

#!/bin/bash
# Runbook: Manual Database Failover
# Author: Database Team
# Last Updated: 2026-04-15

# 1. Verify current primary status
aws rds describe-db-instances \
  --db-instance-identifier kredete-primary \
  --query 'DBInstances[0].DBInstanceStatus'

# 2. Check replication lag on standby
aws rds describe-db-instances \
  --db-instance-identifier kredete-standby \
  --query 'DBInstances[0].StatusInfos'

# 3. Initiate failover (if lag < 100ms)
aws rds failover-db-cluster \
  --db-cluster-identifier kredete-cluster \
  --target-db-instance-identifier kredete-standby

# 4. Update application configuration
kubectl set env deployment/credit-origination \
  DATABASE_HOST=kredete-standby.xxxxx.eu-west-1.rds.amazonaws.com

# 5. Verify connectivity
kubectl exec -it deploy/credit-origination -- \
  psql -c "SELECT pg_is_in_recovery();"  # Should return false (primary)

# 6. Clear application caches
kubectl rollout restart deployment --selector=tier=api

Runbook: Complete System Recovery

#!/bin/bash
# Runbook: Full DR Recovery
# Time Estimate: 2-4 hours

echo "=== Stage 1: Infrastructure (30 min) ==="
cd terraform/environments/dr
terraform init
terraform apply -auto-approve

echo "=== Stage 2: Database Restore (60 min) ==="
# Restore from latest S3 backup
aws rds restore-db-cluster-from-snapshot \
  --db-cluster-identifier kredete-dr \
  --snapshot-identifier $(aws rds describe-db-cluster-snapshots \
    --db-cluster-identifier kredete-primary \
    --query 'DBClusterSnapshots[-1].DBClusterSnapshotIdentifier' \
    --output text)

echo "=== Stage 3: Kubernetes (30 min) ==="
# Restore Velero backup
velero restore create --from-backup latest-daily

echo "=== Stage 4: Verification (30 min) ==="
# Run smoke tests
./scripts/dr-smoke-test.sh

echo "=== Stage 5: DNS Cutover (5 min) ==="
# Update Route 53 to point to DR region
aws route53 change-resource-record-sets \
  --hosted-zone-id ZXXXXX \
  --change-batch file://dns-failover.json

DR Testing

Testing Schedule

Test Type	Frequency	Scope	Duration
Backup Verification	Weekly	Restore and validate checksums	2 hours
Failover Test	Monthly	Single component failover	4 hours
Partial DR	Quarterly	Database + API layer	1 day
Full DR Exercise	Annually	Complete region failover	2 days

Last DR Test Results

Test	Date	RTO Achieved	RPO Achieved	Status
Full DR Exercise	2026-01-15	2h 15m	12 min	✅ Pass
Database Failover	2026-03-01	8 min	0 sec	✅ Pass
AZ Evacuation	2026-03-15	15 min	0 sec	✅ Pass

Communication Plan

During Incident

Audience	Channel	Frequency
Engineering Team	Slack #incident-response	Real-time
Executive Team	Email + SMS	Every 30 min
Partners/Clients	Status Page	Every 15 min
Regulators	Direct call (if required)	Within 4 hours

Status Page

Public status: https://status.kredete.com

Components tracked:

API Platform
Credit Origination
Partner Integrations
Customer Portal
Admin Dashboard