Resile
Resile is an ergonomic, type-safe execution resilience and retry library for Go. Inspired by Python’s stamina, it features generic execution wrappers, AWS Full Jitter backoff, native Retry-After header support, and zero-dependency observability for distributed systems.
Install / Use
/learn @cinar/ResileREADME
Resile: Ergonomic Execution Resilience for Go
Resile is a production-grade execution resilience and retry library for Go, inspired by Python's stamina. It provides a type-safe, ergonomic, and highly observable way to handle transient failures in distributed systems.
Table of Contents
-
- Simple Retries
- Value-Yielding Retries (Generics)
- Request Hedging (Speculative Retries)
- Stateful Retries & Endpoint Rotation
- Handling Rate Limits (Retry-After)
- Aborting Retries (Pushback Signal)
- Fallback Strategies
- Bulkhead Pattern
- Rate Limiting Pattern
- Layered Defense with Circuit Breaker
- Macro-Level Protection (Adaptive Retries)
- Adaptive Concurrency (TCP-Vegas)
- Structured Logging & Telemetry
- Panic Recovery ("Let It Crash")
- Fast Unit Testing
- Reusable Clients & Dependency Injection
- Marking Errors as Fatal
- Custom Error Filtering
- Policy Composition & Chaining
- Native Multi-Error Aggregation
- Native Chaos Engineering (Fault & Latency Injection)
- Configuration Reference
Installation
go get github.com/cinar/resile
Why Resile?
In distributed systems, transient failures are a mathematical certainty. Resile simplifies the "Correct Way" to retry:
- AWS Full Jitter: Uses the industry-standard algorithm to prevent "thundering herd" synchronization.
- Adaptive Retries: Built-in token bucket rate limiting to prevent "retry storms" across a cluster.
- Generic-First: No
interface{}or reflection. Full compile-time type safety. - Context-Aware: Strictly respects
context.Contextcancellation and deadlines. - Zero-Dependency Core: The core library only depends on the Go standard library.
- Opinionated Defaults: Sensible production-ready defaults (5 attempts, exponential backoff).
- Chaos-Ready: Built-in support for fault and latency injection to test your resilience policies.
Articles & Tutorials
Want to learn more about the philosophy behind Resile and advanced resilience patterns in Go? Check out these deep dives:
- Stop Writing Manual Retry Loops in Go: Why Your Current Logic is Probably Dangerous
- Python's Stamina for Go: Bringing Ergonomic Resilience to Gophers
- Beating Tail Latency: A Guide to Request Hedging in Go Microservices
- Preventing Microservice Meltdowns: Adaptive Retries and Circuit Breakers in Go
- Self-Healing State Machines: Resilient State Transitions in Go
- Resilience Beyond Counters: Sliding Window Circuit Breakers in Go
- Stop the Domino Effect: Bulkhead Isolation in Go
- Respecting Boundaries: Precise Rate Limiting in Go
- Beyond Static Limits: Adaptive Concurrency with TCP-Vegas in Go
- Debugging the Timeline: Native Multi-Error Aggregation in Go
- Native Chaos Engineering: Testing Resilience with Fault & Latency Injection
Examples
The examples/ directory contains standalone programs showing how to use Resile in various scenarios:
- Basic Retry: Simple
DoandDoErrcalls. - Request Hedging: Reducing tail latency with speculative retries.
- HTTP with Rate Limits: Respecting
Retry-Afterheaders and usingslog. - Fallback Strategies: Returning stale data when all attempts fail.
- Stateful Rotation: Rotating API endpoints using
RetryState. - Circuit Breaker: Layering defensive strategies.
- Adaptive Retries: Preventing retry storms with a token bucket.
- Adaptive Concurrency: Dynamic concurrency limits based on latency (TCP-Vegas).
- Pushback Signal: Aborting retries immediately using
CancelAllRetries. - Panic Recovery: Implementing Erlang's "Let It Crash" philosophy.
- State Machine: Building resilient state machines inspired by Erlang's
gen_statem. - Chaos Injection: Simulating faults and latency to test your policies.
Common Use Cases
1. Simple Retries
Retry a simple operation that only returns an error. If all retries fail, Resile returns an aggregated error containing the failures from every attempt.
err := resile.DoErr(ctx, func(ctx context.Context) error {
return db.PingContext(ctx)
})
2. Value-Yielding Retries (Generics)
Fetch data with full type safety. The return type is inferred from your closure.
// val is automatically inferred as *User
user, err := resile.Do(ctx, func(ctx context.Context) (*User, error) {
return apiClient.GetUser(ctx, userID)
}, resile.WithMaxAttempts(3))
3. Request Hedging (Speculative Retries)
Speculative retries reduce tail latency by starting a second request if the first one doesn't finish within a configured HedgingDelay. The first successful result is used, and the other is cancelled.
// For value-yielding operations
data, err := resile.DoHedged(ctx, action,
resile.WithMaxAttempts(3),
resile.WithHedgingDelay(100 * time.Millisecond),
)
// For error-only operations
err := resile.DoErrHedged(ctx, action,
resile.WithMaxAttempts(2),
resile.WithHedgingDelay(50 * time.Millisecond),
)
4. Stateful Retries & Endpoint Rotation
Use DoState (or DoErrState) to access the RetryState, allowing you to rotate endpoints or fallback logic based on the failure history.
endpoints := []string{"api-v1.example.com", "api-v2.example.com"}
data, err := resile.DoState(ctx, func(ctx context.Context, state resile.RetryState) (string, error) {
// Rotate endpoint based on attempt number
url := endpoints[state.Attempt % uint(len(endpoints))]
return client.Get(ctx, url)
})
5. Handling Rate Limits (Retry-After)
Resile automatically detects if an error implements RetryAfterError. It can override the jittered backoff with a server-dictated duration and can also signal immediate termination (pushback).
type RateLimitError struct {
WaitUntil time.Time
}
func (e *RateLimitError) Error() string { return "too many requests" }
func (e *RateLimitError) RetryAfter() time.Duration {
return time.Until(e.WaitUntil)
}
func (e *RateLimitError) CancelAllRetries() bool {
// Return true to abort the entire retry loop immediately.
return false
}
// Resile will sleep exactly until WaitUntil when this error is encountered.
6. Aborting Retries (Pushback Signal)
If a downstream service returns a terminal error (like "Quota Exceeded") that shouldn't be retried, implement CancelAllRetries() bool to abort the entire retry loop immediately.
type QuotaExceededError struct{}
func (e *QuotaExceededError) Error() string { return "quota exhausted" }
func (e *QuotaExceededError) CancelAllRetries() bool { return true }
// Resile will stop immediately if this error is encountered,
// even if more attempts are remaining.
_, err := resile.Do(ctx, action, resile.WithMaxAttempts(10))
7. Fallback Strategies
Provide a fallback function to handle cases where all retries are exhausted or the circuit breaker is open. This is useful for returning stale data or default values.
data, err := resile.Do(ctx, fetchData,
resile.WithMaxAttempts(3),
resile.WithFallback(func(ctx context.Context, err error) (string, error) {
// Return stale data from cache if the primary fetch fails
return cache.Get(ctx, key), nil
}),
)
8. Bulkhead Pattern
Isolate failures by limiting the number of concurrent executions to a specific resource.
// Shared bulkhead with capacity of 10
bh := resile.NewBulkhead(10)
err := resile.DoErr(ctx,
