Designing APIs That Survive Real Production Traffic

Hook

Most APIs look fine in documentation and still create production pain. The problem is usually not missing endpoints. It is that the contract becomes ambiguous the moment users retry, multiple actors edit the same entity, or one dependency slows down enough to distort the whole request path.

The Real Problem

An API is where business intent meets infrastructure reality. It has to translate a user action into a durable system action while telling clients what actually happened and what they should do next. That means it must survive duplicate requests, conflicting writes, long-running work, partial downstream failure, and client assumptions that will inevitably simplify the backend model.

Weak API design usually starts with convenience. Teams expose generic create and update surfaces because they are faster to ship. Over time, those surfaces absorb more business rules, more partner-specific conditions, and more implicit frontend behavior. The contract becomes wide, hard to reason about, and expensive to change. Clients then fill the gaps with local logic, which is how the API slowly stops being the source of truth.

What Breaks in Practice

The same write is retried from mobile or frontend timeout behavior and creates duplicate business actions.
Two actors edit the same record and the later write silently overwrites the earlier business decision.
A single PATCH endpoint accepts combinations of fields that were never meant to coexist.
Clients cannot tell whether an error means "retry later," "fix your data," or "this state no longer allows that action."
One unstable integration leaks response quirks into the public contract and every client starts coding around them.
Version drift during deployment means some clients temporarily speak a contract the backend only partially supports.

Key Decisions

1. Shape writes around business actions

I prefer APIs that make intent explicit. approve, cancel, submit, and reopen are more valuable than a broad status update endpoint because they tell the client what the operation means and give the backend a clear place to enforce legal transitions.

2. Make retries safe on important writes

If an action matters enough to trigger financial, operational, or user-visible side effects, it deserves idempotency. I want clients to be able to retry uncertain requests without gambling on whether the backend already acted.

3. Distinguish conflict from validation from dependency failure

Clients make better decisions when the API tells the truth about why a request failed. Validation errors should guide correction. State conflicts should guide refresh and re-evaluation. Temporary dependency failures should guide retry or pending behavior.

4. Separate request acceptance from asynchronous completion

Not all work belongs in one synchronous response. For expensive or unstable downstream paths, I would rather expose an operation that the client can track than force the API to pretend it completed something it only queued.

5. Keep partner instability behind internal boundaries

External APIs and callbacks should not dictate the shape of your public contract. Adapter layers are worth it because they protect the rest of the product from one partner's inconsistency.

Tradeoffs

Action-oriented APIs are clearer but create more explicit endpoints and policy logic.
Idempotency improves safety but adds storage, deduplication, and lifecycle concerns.
Better error semantics improve UX while forcing more careful backend classification.
Operation-based writes make long-running flows safer, though they require more product state and support tooling.
Internal adapters keep contracts clean but move more complexity into service orchestration.

Production Patterns

Idempotency keys on business-significant writes.
Version checks or optimistic concurrency on shared resources.
Explicit operation resources for long-running actions.
Problem-detail style error responses with stable categories.
Contract tests that cover client expectations, not only server logic.
Expand-contract compatibility during API evolution and rollout.

What I'd Improve

I would involve frontend and support earlier in API review. APIs become much stronger when engineering validates not only whether the contract is implementable, but whether it is explainable under conflict, retry, and degraded system behavior.