Operations Guide
Overview
This guide covers day-to-day operations, monitoring, incident management, and maintenance procedures for the Kredete platform.
Monitoring & Alerting
Primary Dashboards
| Dashboard |
URL |
Purpose |
| System Health |
grafana.kredete.internal/d/system-health |
Overall platform status |
| API Metrics |
grafana.kredete.internal/d/api-metrics |
API performance and errors |
| Business KPIs |
grafana.kredete.internal/d/business-kpis |
Loan/payment metrics |
| Infrastructure |
grafana.kredete.internal/d/infra |
K8s, databases, queues |
| Security |
grafana.kredete.internal/d/security |
Security events and alerts |
Key Metrics
| Metric |
Warning |
Critical |
Action |
| API Error Rate |
> 1% |
> 5% |
Investigate errors |
| API Latency (P95) |
> 500ms |
> 2000ms |
Scale/optimize |
| CPU Usage |
> 70% |
> 90% |
Scale pods |
| Memory Usage |
> 80% |
> 95% |
Scale/investigate leaks |
| Disk Usage |
> 75% |
> 90% |
Cleanup/expand storage |
| Database Connections |
> 70% pool |
> 90% pool |
Scale DB/optimize queries |
| Kafka Consumer Lag |
> 10K |
> 100K |
Scale consumers |
Quick Health Check
# Check all services status
kubectl get pods -n kredete-production
# Check API health endpoint
curl -s https://api.kredete.com/health | jq
# Expected output:
{
"status": "healthy",
"timestamp": "2026-04-16T10:00:00Z",
"services": {
"customer": "up",
"loan": "up",
"scoring": "up",
"ledger": "up",
"payment": "up"
}
}
On-Call Procedures
Severity Levels
| Severity |
Definition |
Response Time |
Communication |
| SEV1 |
Complete outage, data loss risk |
5 min acknowledge |
All-hands, exec update hourly |
| SEV2 |
Major feature unavailable |
15 min acknowledge |
Team leads, update every 30 min |
| SEV3 |
Minor feature degraded |
30 min acknowledge |
Team notification |
| SEV4 |
Cosmetic/minor issue |
Next business day |
Ticket tracking |
Incident Response Flow
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ 1. DETECT │────▶│ 2. ANALYZE │────▶│ 3. CONTAIN │
│ - SIEM Alert │ │ - Triage │ │ - Isolate │
│ - User Report │ │ - Classify │ │ - Block │
│ - Auto-detect │ │ - Prioritize │ │ - Preserve │
└────────────────┘ └────────────────┘ └───────┬────────┘
│
┌────────────────┐ ┌────────────────┐ ┌───────▼────────┐
│ 6. IMPROVE │◀────│ 5. RECOVER │◀────│ 4. ERADICATE │
│ - Root cause │ │ - Restore │ │ - Remove │
│ - Update │ │ - Verify │ │ - Patch │
│ - Training │ │ - Monitor │ │ - Harden │
└────────────────┘ └────────────────┘ └────────────────┘
Common Operations
Creating Business Accounts
# Via API
curl -X POST https://api.kredete.com/v2/admin/businesses \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"business_name": "ABC Microfinance Ltd",
"registration_number": "RC123456",
"business_type": "microfinance",
"contact": {
"name": "John Doe",
"email": "john@abcmfi.com",
"phone": "+2341234567890"
}
}'
Batch Job Management
| Job |
Schedule |
Duration |
Description |
daily-interest-accrual |
00:00 UTC |
~30 min |
Calculate daily interest |
loan-maturity-check |
01:00 UTC |
~15 min |
Mark matured loans |
delinquency-update |
02:00 UTC |
~20 min |
Update DPD flags |
collection-assignment |
03:00 UTC |
~10 min |
Assign collection tasks |
statement-generation |
04:00 UTC |
~60 min |
Generate monthly statements |
# List all CronJobs
kubectl get cronjobs -n kredete-production
# Manual trigger
kubectl create job --from=cronjob/daily-interest-accrual manual-interest-$(date +%Y%m%d)
# Check job logs
kubectl logs job/daily-interest-accrual-28561234
Deployment Procedures
Standard Deployment
# 1. Create release tag
git tag -a v2.5.0 -m "Release 2.5.0"
git push origin v2.5.0
# 2. Deploy via kubectl
kubectl set image deployment/kredete-loan \
kredete-loan=registry.kredete.com/kredete-loan:v2.5.0 \
-n kredete-production
# 3. Monitor rollout
kubectl rollout status deployment/kredete-loan -n kredete-production
# 4. Verify deployment
kubectl get pods -l app=kredete-loan -n kredete-production
curl https://api.kredete.com/v2/loan/health
Rollback Procedure
# Quick rollback to previous revision
kubectl rollout undo deployment/kredete-loan -n kredete-production
# Rollback to specific revision
kubectl rollout history deployment/kredete-loan -n kredete-production
kubectl rollout undo deployment/kredete-loan --to-revision=5 -n kredete-production
# Helm rollback
helm rollback kredete-stack 3 -n kredete-production
Warning
Always verify the rollback by running health checks and smoke tests before considering the rollback complete.
Maintenance Windows
| Day |
Time (UTC) |
Type |
Impact |
| Sunday |
02:00-04:00 |
Database maintenance |
Brief performance impact |
| Sunday |
04:00-06:00 |
Infrastructure updates |
Potential restarts |
| 1st Sunday/month |
02:00-08:00 |
Full maintenance |
Possible downtime |
Contact Information
| Role |
Contact |
Hours |
| Platform On-Call |
platform-oncall@kredete.com |
24/7 |
| DBA On-Call |
dba-oncall@kredete.com |
24/7 |
| NOC |
noc@kredete.com |
24/7 |
| DevOps Lead |
devops-lead@kredete.com |
Business hours |