One of the most common mistakes in software architecture is designing systems that assume everything will work.

APIs will respond.

Jobs will run once.

Data will arrive in the expected format.

In reality, production systems behave very differently.

Requests time out. Jobs run twice. External systems send incomplete data.

Reliable systems are built around this reality.

They assume things will fail.

Idempotency is the foundation of reliability

Many operations in distributed systems can be triggered multiple times.

A network retry might send the same request twice. A queue might deliver a job again after a timeout. A webhook might be retried because the receiving service responded slowly.

If the operation is not idempotent, duplicate processing can corrupt data.

Good systems design operations so that repeating the same request produces the same result.

This makes retries safe and greatly simplifies recovery.

Retries are normal behavior

Retries are often treated as an edge case.

In practice they are a core part of how distributed systems function.

Transient failures happen constantly: network latency spikes, services restart, upstream systems rate limit requests.

Systems that expect these failures and retry intelligently are dramatically more resilient.

Systems that assume success tend to fail catastrophically when the first unexpected timeout occurs.

Partial failures are inevitable

Many workflows span multiple systems.

An order might trigger inventory updates, payment processing, fulfillment requests, and customer notifications.

If one step fails while others succeed, the system enters an inconsistent state.

Designing systems to detect and recover from these partial failures is one of the most important parts of production architecture.

This is where reconciliation jobs, state machines, and retry mechanisms become essential.

Observability matters more than perfection

No system avoids every failure.

The goal is not to eliminate problems. The goal is to detect and recover from them quickly.

Good systems make it easy to answer questions like:

  • What failed?
  • When did it fail?
  • Has it been retried?
  • Is the system in a recoverable state?

Without this visibility, small issues become large operational problems.


Reliable software isn’t built by assuming everything will work.

It’s built by assuming that things will fail and designing systems that can recover when they do.