Disaster Recovery

Overview

This document outlines the disaster recovery (DR) procedures for Kredete's infrastructure. Our DR strategy is designed to meet the following objectives:

Objective Target Current
Recovery Time Objective (RTO) < 4 hours 2.5 hours
Recovery Point Objective (RPO) < 1 hour 15 minutes
Availability Target 99.99% 99.995%

Backup Strategy

Database Backups

PostgreSQL (Primary Stores)

Type Frequency Retention Location
Full Backup Daily (02:00 UTC) 30 days S3 (cross-region)
Incremental Every 6 hours 7 days S3 (same-region)
WAL Archives Continuous 7 days S3 + Azure (DR)
Monthly Archive 1st day of month 7 years Glacier Deep Archive

PCI DSS Compliance

All backups containing cardholder data are encrypted using AES-256-GCM and stored following PCI DSS requirements for data retention.

MongoDB (Document Stores)

# MongoDB Backup Configuration
backup:
  enabled: true
  type: continuous
  destination:
    type: s3
    bucket: kredete-mongo-backups
    region: eu-west-1
  encryption:
    enabled: true
    algorithm: AES256
  retention:
    daily: 7
    weekly: 4
    monthly: 12
    yearly: 7

Infrastructure Backups

  • Kubernetes State: Velero daily backups to S3
  • Terraform State: S3 with versioning + DynamoDB locking
  • Secrets: Vault snapshots every 4 hours
  • Configuration: Git (GitLab) with offline mirror

Failover Procedures

Primary Region Failure

If the primary region (AWS af-south-1) becomes unavailable:

┌─────────────┐ Automatic ┌─────────────┐ │ af-south-1 │ ────────────→ │ eu-west-1 │ │ (Primary) │ DNS Failover │ (Standby) │ │ FAILED │ │ ACTIVE │ └─────────────┘ └─────────────┘ │ │ ▼ ▼ Route 53 RDS Read Replica Health Check Promoted to Primary Triggers

Automated Failover Steps

  1. Detection (0-2 min)
    • Route 53 health checks fail (3 consecutive)
    • PagerDuty alert triggered automatically
  2. Traffic Redirect (2-5 min)
    • DNS TTL: 60 seconds
    • Route 53 automatically routes to eu-west-1
  3. Database Promotion (5-15 min)
    • RDS read replica promoted to standalone
    • Connection strings updated via Parameter Store
  4. Service Activation (15-30 min)
    • Kubernetes workloads scaled up
    • Cache warming initiated

Partial Service Failure

Component Failure Response Auto Recovery
Single AZ Redistribute to remaining AZs Yes
Kubernetes Node Pods rescheduled automatically Yes
Database Primary Automatic failover to standby Yes
Redis Primary Sentinel promotes replica Yes
Kafka Broker ISR continues serving Yes

Runbooks

Runbook: Database Failover

Warning

This runbook should only be executed by authorized Database Administrators with approval from the VP of Engineering.

#!/bin/bash
# Runbook: Manual Database Failover
# Author: Database Team
# Last Updated: 2026-04-15

# 1. Verify current primary status
aws rds describe-db-instances \
  --db-instance-identifier kredete-primary \
  --query 'DBInstances[0].DBInstanceStatus'

# 2. Check replication lag on standby
aws rds describe-db-instances \
  --db-instance-identifier kredete-standby \
  --query 'DBInstances[0].StatusInfos'

# 3. Initiate failover (if lag < 100ms)
aws rds failover-db-cluster \
  --db-cluster-identifier kredete-cluster \
  --target-db-instance-identifier kredete-standby

# 4. Update application configuration
kubectl set env deployment/credit-origination \
  DATABASE_HOST=kredete-standby.xxxxx.eu-west-1.rds.amazonaws.com

# 5. Verify connectivity
kubectl exec -it deploy/credit-origination -- \
  psql -c "SELECT pg_is_in_recovery();"  # Should return false (primary)

# 6. Clear application caches
kubectl rollout restart deployment --selector=tier=api

Runbook: Complete System Recovery

#!/bin/bash
# Runbook: Full DR Recovery
# Time Estimate: 2-4 hours

echo "=== Stage 1: Infrastructure (30 min) ==="
cd terraform/environments/dr
terraform init
terraform apply -auto-approve

echo "=== Stage 2: Database Restore (60 min) ==="
# Restore from latest S3 backup
aws rds restore-db-cluster-from-snapshot \
  --db-cluster-identifier kredete-dr \
  --snapshot-identifier $(aws rds describe-db-cluster-snapshots \
    --db-cluster-identifier kredete-primary \
    --query 'DBClusterSnapshots[-1].DBClusterSnapshotIdentifier' \
    --output text)

echo "=== Stage 3: Kubernetes (30 min) ==="
# Restore Velero backup
velero restore create --from-backup latest-daily

echo "=== Stage 4: Verification (30 min) ==="
# Run smoke tests
./scripts/dr-smoke-test.sh

echo "=== Stage 5: DNS Cutover (5 min) ==="
# Update Route 53 to point to DR region
aws route53 change-resource-record-sets \
  --hosted-zone-id ZXXXXX \
  --change-batch file://dns-failover.json

DR Testing

Testing Schedule

Test Type Frequency Scope Duration
Backup Verification Weekly Restore and validate checksums 2 hours
Failover Test Monthly Single component failover 4 hours
Partial DR Quarterly Database + API layer 1 day
Full DR Exercise Annually Complete region failover 2 days

Last DR Test Results

Test Date RTO Achieved RPO Achieved Status
Full DR Exercise 2026-01-15 2h 15m 12 min ✅ Pass
Database Failover 2026-03-01 8 min 0 sec ✅ Pass
AZ Evacuation 2026-03-15 15 min 0 sec ✅ Pass

Communication Plan

During Incident

Audience Channel Frequency
Engineering Team Slack #incident-response Real-time
Executive Team Email + SMS Every 30 min
Partners/Clients Status Page Every 15 min
Regulators Direct call (if required) Within 4 hours

Status Page

Public status: https://status.kredete.com

Components tracked:

  • API Platform
  • Credit Origination
  • Partner Integrations
  • Customer Portal
  • Admin Dashboard