Core Module
12 min forge

Retries and Exponential Backoff

Master the logic of persistence. Learn how to handle transient network errors without overwhelming failing services.

πŸ”„ Retries & Exponential Backoff

In a distributed system, network failures are inevitable. Retries allow a system to recover from "transient" (temporary) errors.

πŸ’‘ The Logic (ELI5)

The Retry (Try Again)

Think of a Phone Call:

  1. You call your friend. It says "Busy."
  2. You hang up and call again immediately.
  3. This is a Retry.

Exponential Backoff (Wait Longer)

If you keep calling every 1 second, and the whole network is down, you are just making the problem worse! Exponential Backoff is the rule:

  • Wait 1 second. Try again.
  • Still fails? Wait 2 seconds. Try again.
  • Still fails? Wait 4 seconds... then 8... then 16.
  • This gives the system time to recover.

πŸ” The Deep Dive

When to Retry?

Only retry for "Idempotent" operations (operations that are safe to do twice, like checking a balance or searching for a cat) or for "Transient Errors" (503 Service Unavailable, 504 Gateway Timeout). Never retry if the server says 400 Bad Request or 401 Unauthorized.

Jitter: The Secret Ingredient

If 1,000 servers all start their exponential backoff at the exact same time, they will all retry at exactly 1s, then 2s, then 4s. This creates "Waves" of traffic that crash the database (The Thundering Herd). Jitter adds a small random amount of time (e.g., 1.2s instead of 1.0s) to every retry, so traffic is spread out evenly.


🎯 Interview Pulse

Use Case

Always mention "Retries with Exponential Backoff and Jitter" when asked how to handle Network Timeouts.

The "Retry Storm"

Interviewers will ask: "What if your DB is already slow, and all your clients start retrying?" Answer: This is a Retry Storm. Use a Circuit Breaker to stop the clients from retrying until the DB is healthy again.

Key Terms

  • Maximum Retries: Stop trying after X attempts so the user isn't waiting forever.
  • Idempotency: Ensuring that multiple identical requests have the same effect as a single request (e.g., assigning a unique ID to every order). 🌩️