Building AutoBreaker: Architecture & Design Decisions#

Introduction#

Circuit breakers are a fundamental pattern for building resilient distributed systems. They protect services from cascading failures by detecting unhealthy backends and failing fast instead of waiting for timeouts. The pattern is well-understood, widely adopted, and seemingly straightforward.

Yet most implementations share a critical flaw: they use static failure thresholds that break down under real-world traffic conditions.

This guide examines why traditional circuit breakers struggle with variable traffic, presents AutoBreaker’s solution using adaptive percentage-based thresholds, and explores the architecture of a production-grade implementation optimized for both performance and observability.

The Problem with Static Thresholds#

Traditional circuit breakers trip when a fixed number of failures occurs within a time window. For example: “open the circuit after 10 failures in 60 seconds.”

This approach appears sensible until you consider how it behaves across different traffic volumes:

High Traffic Scenario (1000 requests/minute):

10 failures = 1% error rate
Circuit trips at 1% errors (too sensitive)
Results in false positives during minor hiccups
Degrades availability unnecessarily

Low Traffic Scenario (10 requests/minute):

10 failures = 100% error rate
Waits for complete failure before protection
Results in slow detection and recovery
Allows extended periods of failed requests

The Core Issue: The same absolute threshold (10 failures) represents vastly different failure rates depending on traffic volume. No single static value works well across both scenarios.

This becomes particularly problematic in practice:

Development environments (low traffic) need different thresholds than production (high traffic)
Traffic patterns vary throughout the day, week, and year
Services experience traffic spikes during incidents, load tests, or viral events
Microservices architectures create diverse traffic patterns across different service boundaries

Manual threshold tuning for each environment and traffic pattern is fragile, error-prone, and doesn’t adapt to changing conditions.

The Solution: Percentage-Based Adaptive Thresholds#

Instead of counting absolute failures, adaptive circuit breakers calculate the failure rate as a percentage of recent requests:

1
2
3
Failure Rate = (Failed Requests / Total Requests) × 100%

Trip Condition: Failure Rate > Threshold (e.g., 5%)

This single change makes the circuit breaker traffic-aware:

At 1000 requests/minute:

5% threshold = 50 failures before tripping
Protects against sustained 5%+ error rates
Ignores minor blips (<5% error rate)

At 10 requests/minute:

5% threshold = 0.5 failures (rounded to 1)
Trips on first sustained failure
Provides immediate protection

The same configuration works correctly across all traffic levels without manual tuning.

Architecture Overview#

graph TD
    A[Client Request] --> B{Circuit State?}
    B -->|Closed| C[Execute Operation]
    B -->|Open| D[Fast Fail]
    B -->|Half-Open| E[Limited Execution]
    
    C --> F{Success?}
    F -->|Yes| G[Update Metrics]
    F -->|No| H[Update Metrics]
    
    G --> I{Check Threshold}
    H --> I
    
    I -->|Below| J[Remain Closed]
    I -->|Above| K[Trip to Open]
    
    E --> L{Success?}
    L -->|Yes| M[Reset to Closed]
    L -->|No| N[Return to Open]
    
    D --> O[Return ErrOpenState]
    
    J --> P[Return Result]
    K --> Q[Start Timeout]
    M --> P
    N --> Q
    
    Q --> R[Timeout Expires]
    R --> S[Transition to Half-Open]

Core Components#

State Machine: Three-state implementation (Closed, Open, Half-Open)
Metrics Collector: Lock-free atomic counters for performance
Threshold Calculator: Adaptive percentage-based evaluation
Timeout Scheduler: Manages state transitions
Callback System: Extensible hooks for observability

Performance-Optimized Design#

Lock-Free Implementation#

Traditional circuit breakers use mutexes or RW locks to protect shared state. This creates contention under high concurrency. AutoBreaker uses atomic operations throughout:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
// Traditional approach (with mutex)
func (cb *CircuitBreaker) Execute(req func() (interface{}, error)) (interface{}, error) {
    cb.mu.Lock()
    defer cb.mu.Unlock()
    
    if cb.state == Open {
        return nil, ErrOpenState
    }
    // ... execute request
}

// AutoBreaker approach (lock-free)
func (b *Breaker) Execute(req func() (interface{}, error)) (interface{}, error) {
    // Fast path: check if open using atomic load
    if atomic.LoadUint32(&b.state) == uint32(StateOpen) {
        return nil, ErrOpenState // ~0.34ns overhead
    }
    // ... execute with atomic updates
}

Performance Impact:

Traditional: 50-200ns overhead due to lock contention
AutoBreaker: <100ns overhead, zero contention

Zero Allocations in Hot Path#

Memory allocations trigger GC pressure and reduce performance. AutoBreaker avoids allocations in the critical path:

1
2
3
4
5
// Bad: allocates on every request
metrics := &Metrics{Requests: 1, Failures: 0} // Allocation!

// Good: pre-allocated, updated atomically
atomic.AddUint64(&b.counters.requests, 1)

Memory Profile:

Execute(): 0 allocations
Metrics(): 1 allocation (on demand, not in hot path)
UpdateSettings(): 0 allocations

Observability Architecture#

Built-in Metrics#

1
2
3
4
5
6
7
8
type Metrics struct {
    State        State
    Requests     uint64
    Failures     uint64
    FailureRate  float64
    Consecutive  uint64
    LastFailure  time.Time
}

Diagnostics API#

1
2
3
4
5
type Diagnostics struct {
    WillTripNext bool    // True if next failure will trip circuit
    TimeToTrip   float64 // Estimated failures needed to trip
    HealthScore  float64 // 0.0-1.0 health indicator
}

Integration Points#

Prometheus: Native collector in examples/
Structured Logging: JSON logs with context
Tracing: OpenTelemetry span attributes
Alerting: Pre-configured Prometheus alerts

Configuration Philosophy#

Sensible Defaults#

1
2
3
4
5
6
7
8
// Default settings work for 80% of use cases
breaker := autobreaker.New(autobreaker.Settings{
    Name:    "service-name",
    Timeout: 30 * time.Second,
})
// Adaptive thresholds enabled by default
// 5% failure rate threshold
// 20 minimum observations

Runtime Updates#

1
2
3
4
5
// Adjust without restart
err := breaker.UpdateSettings(autobreaker.SettingsUpdate{
    FailureRateThreshold: autobreaker.Float64Ptr(0.10), // 10% threshold
    Timeout:              autobreaker.DurationPtr(15 * time.Second),
})

Production Recommendations#

Scenario	Recommended Settings
API Gateway	`FailureRateThreshold: 0.01` (1%), `Timeout: 5s`
Database	`FailureRateThreshold: 0.05` (5%), `Timeout: 60s`
External API	`FailureRateThreshold: 0.10` (10%), `Timeout: 30s`
Internal Service	`FailureRateThreshold: 0.03` (3%), `Timeout: 10s`

Testing Strategy#

Unit Tests#

100% state machine coverage
Race condition detection
Concurrent execution validation

Integration Tests#

HTTP client/server scenarios
Load testing with variable traffic
Failure injection testing

Performance Tests#

Microbenchmarks for hot paths
Memory allocation profiling
Concurrent load testing

Deployment Considerations#

Version Compatibility#

Go 1.21+: Required for generics and atomic features
Zero Dependencies: Standard library only
Backwards Compatibility: Drop-in replacement for sony/gobreaker

Monitoring Setup#

Metrics Collection: Prometheus scraping every 15s
Alerting Rules: Pre-configured in examples/prometheus/
Dashboard: Grafana template provided
Logging: Structured JSON with circuit context

Operational Guidelines#

Start Conservative: Begin with 5% threshold, adjust based on metrics
Monitor Closely: First week of deployment requires attention
Plan Rollbacks: Have configuration revert scripts ready
Document Incidents: Track false positives/negatives for tuning

Future Architecture Directions#

Planned Enhancements#

Sliding Windows: Continuous tracking without reset boundaries
Multi-dimensional Metrics: CPU, memory, latency integration
Predictive Tripping: ML-based failure prediction
Federation: Cross-service circuit coordination

Community RFC Process#

See Roadmap & RFCs for proposed features and community validation process.

Conclusion#

AutoBreaker’s architecture addresses the fundamental limitation of traditional circuit breakers: static thresholds that don’t adapt to traffic patterns. By combining percentage-based adaptive thresholds with a lock-free, zero-allocation implementation, it provides both correct behavior across all traffic levels and production-grade performance.

The design prioritizes:

Correctness: Adaptive thresholds work at any traffic volume
Performance: <100ns overhead, zero allocations in hot path
Observability: Built-in metrics and diagnostics
Operability: Runtime configuration, sensible defaults

For implementation details, see the source code and API reference.