Health Checks

Three layers of health monitoring: active checks, passive checks, and predictive anomaly detection using EMA baselines.

Three Layers of Health Monitoring

Layer 1: Active Health Checks
   ↓
Periodic background checks (every 30s default)
Tests database connectivity and responsiveness

Layer 2: Passive Health Checks
   ↓
On-demand during connection pool recycling
Validates connection before reuse

Layer 3: Health Monitoring System
   ↓
Tracks metrics baselines using EMA
Detects anomalies (error rate, latency, pool saturation)
Integrates with circuit breaker for predictive opening
                            

Active Health Checks

Periodic background task that actively tests database health every 30 seconds (configurable).

Health Check Query

SELECT 1

Fast (~1ms), low overhead, no side effects, works with any database state.

Failure Handling

Consecutive failures are tracked:

Check 1: ✓ Success (failures: 0)
Check 2: ✓ Success (failures: 0)
Check 3: ✗ Timeout (failures: 1)
Check 4: ✗ Timeout (failures: 2)
Check 5: ✗ Timeout (failures: 3) → Mark Unhealthy
                            

When unhealthy: Circuit breaker opens, requests fail fast, health checks continue until recovery.

Passive Health Checks

Every connection is validated before being returned from the pool:

Connection returned to pool
         ↓
    Health check (SELECT 1)
         ↓
   ┌─────┴─────┐
   │  Healthy  │ → State reset (DISCARD ALL) → Return to pool
   └───────────┘
         │
   ┌─────┴─────┐
   │  Failed   │ → Connection discarded → Pool creates new
   └───────────┘
                            

Benefits:

  • Connection quality guaranteed
  • State consistency ensured
  • No stale connections
  • Automatic cleanup

Health Monitoring System

Advanced monitoring that tracks baseline metrics using Exponential Moving Average (EMA) and detects anomalies.

Exponential Moving Average (EMA)

EMA gives more weight to recent values while maintaining history:

EMA_new = (alpha × current_value) + ((1 - alpha) × EMA_old)

Alpha (default 0.1): Lower = smoother, slower to adapt. Higher = more responsive, noisier.

Health Status Levels

Status Description Circuit Breaker
Healthy No warnings Closed
Degraded Minor warnings present Closed
Unhealthy Critical warnings Opens

Warning Types

  • Error Rate Spike: Current error rate > 3x baseline
  • Latency Spike: Current P99 latency > 2x baseline
  • Pool Saturation: Pool utilization > 95%
  • Pool Starvation (Critical): No available connections + waiting requests

Configuration

Active Health Checks

[resilience.healthcheck]
active_enabled = true
interval_secs = 30
timeout_ms = 1000
failure_threshold = 3

Health Monitoring

[health]
error_rate_spike_factor = 3.0
latency_spike_factor = 2.0
pool_saturation_threshold = 0.95
ema_alpha = 0.1

Health Endpoint

curl http://localhost:9090/health

Healthy Response

{
  "status": "Healthy",
  "uptime_secs": 3600,
  "queries_total": 10000,
  "error_rate": 0.001,
  "latency_p99_ms": 5.2,
  "pool_utilization": 0.45,
  "warnings": []
}

Degraded Response

{
  "status": "Degraded",
  "uptime_secs": 3605,
  "queries_total": 10100,
  "error_rate": 0.025,
  "pool_utilization": 0.96,
  "warnings": [
    {
      "type": "ErrorRateSpike",
      "message": "Error rate (2.5%) is 3.0x baseline (0.8%)"
    },
    {
      "type": "PoolSaturation",
      "message": "Pool utilization at 96%"
    }
  ]
}

Catch Issues Before They Impact Production

ScryData's predictive health monitoring detects degradation before failures cascade.

Request Early Access