Retry Pattern

Paulo Menin

Software Engineer and Architect

Posted on July 04, 2022

3 min read

design-patterns

resilience

services-communication

In a distributed environment we have services calling external resources, these resources can be temporarily unavailable for many reasons. A resilient application must know how to deal with the failures of other services.

One way to handle failures of external calls is using a retry pattern. A retry mechanism will control the interval between attempts and the number of retries before giving up.

But first, an application must determine if the failure is transient. A transient failure occurs because of a temporary fault condition that causes a service down for a few seconds. Examples of transient failures are:

Network failure: missing connectivity for a short period.
Unavailability: a service crash or down for maintenance.
Overload: an external service that cannot accept new requests.

In another hand, an authentication failure, for example, will not be resolved after a few seconds, and should not be retried.

Different applications have different requirements and require different fault handling strategies. Some must retry more frequently, others can have a long interval between attempts, for example, in background jobs where a user is not waiting for feedback.

Retry pattern should be used when faults may self-correct after a short delay.

Interval between attempts

Also known as Back-off time, the time to wait before another attempt. Some strategies are:

Constant/Fixed interval: the interval is the same between attempts.
Incremental Back-off: the interval is calculated based on the number of attempts, it can be calculated in many ways:
- Linear: using a constant multiplier.
- Exponential: where is the time delay, is the multiplicative factor and is the number of attempts. A common value for b is 2 ( ), which will double the delay for every attempt.
- Fibonacci: the sum of the last two intervals.
Random: usually a value between a minimum and a maximum range.

Also, a strategy can use combinations, like a constant interval plus jitter (a short random delay) to avoid many parallel attempts to fire at the same time. Another combination is to use a truncated back-off time, it is to define a maximum time so the attempts don’t end up waiting longer and longer.

When to give up

A common strategy to give up the operation and abort is to use a fixed number of attempts.

Another possibility is to use a timeout, in this case, the number of attempts will depend on the strategy used to determine the back-off time.