Operations Guide

Overview

This guide covers day-to-day operations, monitoring, incident management, and maintenance procedures for the Kredete platform.


Monitoring & Alerting

Primary Dashboards

Dashboard URL Purpose
System Health grafana.kredete.internal/d/system-health Overall platform status
API Metrics grafana.kredete.internal/d/api-metrics API performance and errors
Business KPIs grafana.kredete.internal/d/business-kpis Loan/payment metrics
Infrastructure grafana.kredete.internal/d/infra K8s, databases, queues
Security grafana.kredete.internal/d/security Security events and alerts

Key Metrics

Metric Warning Critical Action
API Error Rate > 1% > 5% Investigate errors
API Latency (P95) > 500ms > 2000ms Scale/optimize
CPU Usage > 70% > 90% Scale pods
Memory Usage > 80% > 95% Scale/investigate leaks
Disk Usage > 75% > 90% Cleanup/expand storage
Database Connections > 70% pool > 90% pool Scale DB/optimize queries
Kafka Consumer Lag > 10K > 100K Scale consumers

Quick Health Check

# Check all services status
kubectl get pods -n kredete-production

# Check API health endpoint
curl -s https://api.kredete.com/health | jq

# Expected output:
{
  "status": "healthy",
  "timestamp": "2026-04-16T10:00:00Z",
  "services": {
    "customer": "up",
    "loan": "up",
    "scoring": "up",
    "ledger": "up",
    "payment": "up"
  }
}

On-Call Procedures

Severity Levels

Severity Definition Response Time Communication
SEV1 Complete outage, data loss risk 5 min acknowledge All-hands, exec update hourly
SEV2 Major feature unavailable 15 min acknowledge Team leads, update every 30 min
SEV3 Minor feature degraded 30 min acknowledge Team notification
SEV4 Cosmetic/minor issue Next business day Ticket tracking

Incident Response Flow

┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ 1. DETECT │────▶│ 2. ANALYZE │────▶│ 3. CONTAIN │ │ - SIEM Alert │ │ - Triage │ │ - Isolate │ │ - User Report │ │ - Classify │ │ - Block │ │ - Auto-detect │ │ - Prioritize │ │ - Preserve │ └────────────────┘ └────────────────┘ └───────┬────────┘ │ ┌────────────────┐ ┌────────────────┐ ┌───────▼────────┐ │ 6. IMPROVE │◀────│ 5. RECOVER │◀────│ 4. ERADICATE │ │ - Root cause │ │ - Restore │ │ - Remove │ │ - Update │ │ - Verify │ │ - Patch │ │ - Training │ │ - Monitor │ │ - Harden │ └────────────────┘ └────────────────┘ └────────────────┘

Common Operations

Creating Business Accounts

# Via API
curl -X POST https://api.kredete.com/v2/admin/businesses \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "business_name": "ABC Microfinance Ltd",
    "registration_number": "RC123456",
    "business_type": "microfinance",
    "contact": {
      "name": "John Doe",
      "email": "john@abcmfi.com",
      "phone": "+2341234567890"
    }
  }'

Batch Job Management

Job Schedule Duration Description
daily-interest-accrual 00:00 UTC ~30 min Calculate daily interest
loan-maturity-check 01:00 UTC ~15 min Mark matured loans
delinquency-update 02:00 UTC ~20 min Update DPD flags
collection-assignment 03:00 UTC ~10 min Assign collection tasks
statement-generation 04:00 UTC ~60 min Generate monthly statements
# List all CronJobs
kubectl get cronjobs -n kredete-production

# Manual trigger
kubectl create job --from=cronjob/daily-interest-accrual manual-interest-$(date +%Y%m%d)

# Check job logs
kubectl logs job/daily-interest-accrual-28561234

Deployment Procedures

Standard Deployment

# 1. Create release tag
git tag -a v2.5.0 -m "Release 2.5.0"
git push origin v2.5.0

# 2. Deploy via kubectl
kubectl set image deployment/kredete-loan \
  kredete-loan=registry.kredete.com/kredete-loan:v2.5.0 \
  -n kredete-production

# 3. Monitor rollout
kubectl rollout status deployment/kredete-loan -n kredete-production

# 4. Verify deployment
kubectl get pods -l app=kredete-loan -n kredete-production
curl https://api.kredete.com/v2/loan/health

Rollback Procedure

# Quick rollback to previous revision
kubectl rollout undo deployment/kredete-loan -n kredete-production

# Rollback to specific revision
kubectl rollout history deployment/kredete-loan -n kredete-production
kubectl rollout undo deployment/kredete-loan --to-revision=5 -n kredete-production

# Helm rollback
helm rollback kredete-stack 3 -n kredete-production

Warning

Always verify the rollback by running health checks and smoke tests before considering the rollback complete.


Maintenance Windows

Day Time (UTC) Type Impact
Sunday 02:00-04:00 Database maintenance Brief performance impact
Sunday 04:00-06:00 Infrastructure updates Potential restarts
1st Sunday/month 02:00-08:00 Full maintenance Possible downtime

Contact Information

Role Contact Hours
Platform On-Call platform-oncall@kredete.com 24/7
DBA On-Call dba-oncall@kredete.com 24/7
NOC noc@kredete.com 24/7
DevOps Lead devops-lead@kredete.com Business hours