Resilience

Production-grade resilience through integrated circuit breakers, exponential backoff retries, and health monitoring.

Resilience Layer

ScryData's resilience features work together to handle failures gracefully:

┌────────────────────────────────────────────────────┐
│                  Resilience Layer                  │
│                                                    │
│  1. Circuit Breaker                                │
│     Check if requests allowed                      │
│     If circuit open → Fail fast (<1ms)             │
│                                                    │
│  2. Connection Retry                               │
│     Exponential backoff with jitter                │
│     Retry on transient failures                    │
│                                                    │
│  3. Health Monitoring                              │
│     Active checks + Passive checks + EMA tracking  │
│     Predictive circuit opening                     │
│                                                    │
│  Result: Robust, self-healing system               │
└────────────────────────────────────────────────────┘
                            

Key Benefits

  • Fast Failure: Circuit breaker fails requests in <1ms when database is down
  • Automatic Recovery: Retry logic handles transient failures
  • Predictive Protection: Health monitoring detects issues before cascade
  • Database Protection: Prevent overwhelming a struggling database

Connection Retry

Automatic retry with exponential backoff for connection failures:

Backoff Formula

backoff = min(initial_backoff × multiplier^(attempt-1), max_backoff)
jitter = random(0, backoff × jitter_factor)
total_delay = backoff + jitter

Example (Default Settings)

Attempt Backoff With Jitter
1 50ms 50-55ms
2 100ms 100-110ms
3 200ms 200-220ms

Why Jitter?

Without jitter, all simultaneous requests retry at the same time (thundering herd). With jitter, retries are spread out, distributing database load.

What Gets Retried

Retried: Connection refused, connection timeout, network unreachable, TLS handshake failed

Not Retried: Authentication failed, query syntax error, permission denied, circuit breaker open

Feature Integration

Circuit Breaker + Retry

Request → Circuit Breaker
              ↓
         ┌────┴────┐
         │  Open   │ → Reject immediately (no retry)
         └─────────┘
              ↓
         ┌────┴────┐
         │ Closed  │ → Retry Logic → Database
         └─────────┘
              ↓
         If all retries fail → Circuit breaker records failure
                            

Health Monitor + Circuit Breaker

Health monitor tracks error rate, latency, and pool utilization. When status becomes Unhealthy, circuit breaker opens predictively—before failures cascade.

All Features Together: Database Outage

t=0: Outage Starts

Requests fail after retry attempts. Circuit breaker counts failures.

t=5: Circuit Opens

5 failures reached. Health monitor shows Unhealthy. Circuit opens.

t=6+: Fail Fast Mode

All requests rejected in <1ms. No retries. Database protected.

t=66: Circuit Half-Open

60s elapsed. Limited requests test database recovery.

t=67: Recovery Detected

Test succeeds. Circuit closes. Normal operation resumes.

Configuration

# Circuit Breaker
[resilience.circuit_breaker]
enabled = true
failure_threshold = 5
success_threshold = 2
open_timeout_secs = 60
use_health_monitor = true

# Connection Retry
[resilience.connection_retry]
enabled = true
max_attempts = 3
initial_backoff_ms = 50
max_backoff_ms = 5000
backoff_multiplier = 2.0
jitter_factor = 0.1

# Active Health Checks
[resilience.healthcheck]
active_enabled = true
interval_secs = 30
timeout_ms = 1000
failure_threshold = 3

Monitoring Resilience

# Circuit breaker state
scry_circuit_breaker_state

# Retry attempts
rate(scry_connection_retry_attempts_total[5m])

# Health status
scry_health_status

# Pool utilization
scry_pool_utilization

Alerting Examples

# Circuit opened
scry_circuit_breaker_state == 1

# Unhealthy status
scry_health_status == 2

# Frequent retries
rate(scry_connection_retry_attempts_total[5m]) > 10

Build a Self-Healing Database Layer

ScryData's integrated resilience features protect your database and recover automatically.

Request Early Access