Three Layers of Health Monitoring
Layer 1: Active Health Checks
↓
Periodic background checks (every 30s default)
Tests database connectivity and responsiveness
Layer 2: Passive Health Checks
↓
On-demand during connection pool recycling
Validates connection before reuse
Layer 3: Health Monitoring System
↓
Tracks metrics baselines using EMA
Detects anomalies (error rate, latency, pool saturation)
Integrates with circuit breaker for predictive opening
Active Health Checks
Periodic background task that actively tests database health every 30 seconds (configurable).
Health Check Query
SELECT 1
Fast (~1ms), low overhead, no side effects, works with any database state.
Failure Handling
Consecutive failures are tracked:
Check 1: ✓ Success (failures: 0)
Check 2: ✓ Success (failures: 0)
Check 3: ✗ Timeout (failures: 1)
Check 4: ✗ Timeout (failures: 2)
Check 5: ✗ Timeout (failures: 3) → Mark Unhealthy
When unhealthy: Circuit breaker opens, requests fail fast, health checks continue until recovery.
Passive Health Checks
Every connection is validated before being returned from the pool:
Connection returned to pool
↓
Health check (SELECT 1)
↓
┌─────┴─────┐
│ Healthy │ → State reset (DISCARD ALL) → Return to pool
└───────────┘
│
┌─────┴─────┐
│ Failed │ → Connection discarded → Pool creates new
└───────────┘
Benefits:
- Connection quality guaranteed
- State consistency ensured
- No stale connections
- Automatic cleanup
Health Monitoring System
Advanced monitoring that tracks baseline metrics using Exponential Moving Average (EMA) and detects anomalies.
Exponential Moving Average (EMA)
EMA gives more weight to recent values while maintaining history:
EMA_new = (alpha × current_value) + ((1 - alpha) × EMA_old)
Alpha (default 0.1): Lower = smoother, slower to adapt. Higher = more responsive, noisier.
Health Status Levels
| Status | Description | Circuit Breaker |
|---|---|---|
| Healthy | No warnings | Closed |
| Degraded | Minor warnings present | Closed |
| Unhealthy | Critical warnings | Opens |
Warning Types
- Error Rate Spike: Current error rate > 3x baseline
- Latency Spike: Current P99 latency > 2x baseline
- Pool Saturation: Pool utilization > 95%
- Pool Starvation (Critical): No available connections + waiting requests
Configuration
Active Health Checks
[resilience.healthcheck]
active_enabled = true
interval_secs = 30
timeout_ms = 1000
failure_threshold = 3
Health Monitoring
[health]
error_rate_spike_factor = 3.0
latency_spike_factor = 2.0
pool_saturation_threshold = 0.95
ema_alpha = 0.1
Health Endpoint
curl http://localhost:9090/health
Healthy Response
{
"status": "Healthy",
"uptime_secs": 3600,
"queries_total": 10000,
"error_rate": 0.001,
"latency_p99_ms": 5.2,
"pool_utilization": 0.45,
"warnings": []
}
Degraded Response
{
"status": "Degraded",
"uptime_secs": 3605,
"queries_total": 10100,
"error_rate": 0.025,
"pool_utilization": 0.96,
"warnings": [
{
"type": "ErrorRateSpike",
"message": "Error rate (2.5%) is 3.0x baseline (0.8%)"
},
{
"type": "PoolSaturation",
"message": "Pool utilization at 96%"
}
]
}
Catch Issues Before They Impact Production
ScryData's predictive health monitoring detects degradation before failures cascade.
Request Early Access