IC

Iheb Chatti

Full-stack product engineering, scalable APIs, async workflows, and cloud delivery

System design

How real systems execute, fail, and recover in production.

An NDA-safe view of real production systems across frontend, API, workers, asynchronous jobs, and delivery infrastructure. The goal is not to show neat diagrams. It is to show where state is validated, where side effects leave the request path, and how the system stays reliable when dependencies slow down or fail.

How to read this page

Start with the core system patterns. They show where state becomes durable, where work leaves the request path, and how the product stays understandable once several components participate in the same business action.

Core

Booking workflow and partner boundary design

How a user action becomes durable booking state before partner synchronization and recovery continue outside the request path.

Core

Medical workflow orchestration

Where workflow state is committed before document, email, and reporting side effects continue in the background.

Core

Retry and idempotency strategy

The guardrail that lets clients, workers, and integrations retry without turning uncertainty into duplicate business actions.

The rest are supporting patterns: delivery, caching, notifications, consistency, and operational recovery around those core execution paths.

Core system patterns

The highest-signal diagrams here. These are the patterns I reach for when system behavior, API boundaries, async execution, and recovery have to stay clear in production.

Booking workflow and partner boundary design

Context

Used when product-critical flows depend on partner APIs that are slower or less reliable than the user-facing booking experience. It solves the problem of keeping frontend progress and backend truth aligned even when external systems respond late or inconsistently.

Key idea

Treat booking state as the source of truth, and push partner variability behind explicit async recovery boundaries.

Tradeoff

This makes the backend stricter in ways product teams feel. Booking can no longer be treated as one clean success state, and this is where teams get stuck: API validation, queue retries, partner callbacks, and recovery tooling can all disagree for a while.

Failure reality

In production, a partner API times out but still creates the reservation on its side. The retry creates a second reservation, then the original callback arrives late and overwrites the local status again. The user saw one spinner and one success screen, support sees conflicting states across screens, and engineers check logs that do not line up cleanly with partner timestamps. Without a clear local source of truth, teams end up guessing whether to retry, compensate, or tell the user to wait.

Production note

This pattern is common in marketplace and integration-heavy systems where conversion depends on not exposing partner instability directly to the user journey.

flowchart LR
    A[User Booking Flow] --> B[Frontend Client]
    B --> C[Booking API]
    C --> D[Validation & Pricing Rules]
    D --> E[(Booking Data)]
    C -. retry jobs .-> F[Background Workers]
    F --> G[Partner API Connectors]
    G --> H[Partner Systems]
    F --> I[Status Sync & Recovery]

Legend

Solid arrows: primary booking path

Dashed arrows: partner sync and retries

Rounded nodes: user-facing steps

Booking API

Owns the booking contract. Without it, frontend code and partner adapters each invent their own rules for what 'confirmed', 'pending', or 'failed' means.

Background workers

Keep partner latency and retry behavior out of the user request. Without them, the booking path either blocks on third parties or exposes transient partner failure directly to the customer.

Status sync

Reconciles local and partner state after retries, callbacks, or delayed confirmations. Without it, duplicate charges, stale booking status, and support ambiguity become normal.

Medical workflow orchestration

Context

Used in workflow-heavy admin systems where operators trigger actions that depend on document generation, email handling, or reporting work that may complete later. It solves the gap between a fast admin action and a slower backend workflow that still needs to remain explainable.

Key idea

Commit the workflow transition first, then move document and notification side effects into observable background execution.

Tradeoff

This introduces more than queueing overhead. The product now has intermediate states that admin users will see before the system agrees with itself, and debugging slows down because controller, queue, worker, and status screens stop lining up cleanly under failure.

Failure reality

This is where systems break. A report job gets queued, the worker crashes during PDF generation, and the admin still sees the visit as processed because the request path already returned cleanly. Minutes later the PDF job retries, but the operator has already refreshed into a stale status view and assumes it is done. Or the email send times out after the state change was saved, then a retry sends the message twice. The database says processed, the document store is empty, and support cannot tell whether replay will fix the issue or duplicate more work.

Production note

This pattern is common in compliance-sensitive systems where admin users need reliable status visibility without waiting on PDFs, email, or reporting pipelines.

flowchart LR
    A[Admin UI] --> B[Symfony Controllers]
    B --> C[Workflow Services]
    C --> D[(Scheduling & Reporting Data)]
    C -. queue .-> E[Async Worker]
    E --> F[PDF Processing]
    E --> G[Email Notifications]
    E --> H[Audit Trail]
    C --> I[Operational Status Views]

Legend

Solid arrows: synchronous requests

Dashed arrows: async jobs

Panels: admin-facing system surfaces

Workflow services

Own the transition rules. Without this boundary, scheduling and reporting logic leaks into controllers and screens, and the same action starts behaving differently depending on where it was triggered.

Async worker

Takes document and notification work off the request path. Without it, admin actions block on slow dependencies or fail for reasons unrelated to the state change the user just made.

Operational status views

Explain what already happened and what is still stuck. Without them, incidents become log archaeology and operators cannot safely decide whether a workflow should be replayed or left alone.

Retry and idempotency strategy

Context

Used anywhere the same operation may be replayed by a browser refresh, mobile timeout, queue retry, or partner callback. It solves the problem of distinguishing a legitimate retry from a second business action.

Key idea

Make retries cheap by treating deduplication and execution history as part of the system contract, not as an afterthought.

Tradeoff

The cost is real: more persistent state, slower debugging, and stricter semantics around when a side effect is considered committed. This is where teams get stuck when a retry fires after partial success and nobody can tell whether they are recovering work or duplicating it.

Failure reality

Timeouts are rarely clean failures. A partner API can return nothing, still create the booking, and then get called again by a retry. Or a worker can update the external system, crash before storing its execution result, and run again while the previous attempt actually succeeded. The external system says yes, the app log is missing the success entry, and support is left with the worst question in these incidents: did we do it once or twice?

Production note

This is a standard reliability pattern in payment, booking, and integration-heavy systems where at-least-once delivery is normal and duplicates are expensive.

flowchart LR
    A[Client Request] --> B[API Gateway]
    B --> C[Idempotency Key Check]
    C --> D[(Idempotency Store)]
    C --> E[Domain Service]
    E -. retry-safe job .-> F[Queue]
    F --> G[Worker]
    G --> H[External API]
    G --> I[(Execution Log)]

Legend

Key store: idempotency record

Dashed arrows: retries

Worker: side-effect execution

Idempotency store

Remembers whether this business action was already accepted. Without it, every retry risks becoming a second side effect instead of a safe replay.

Retry-safe queue

Lets background work retry without redefining the original intent. Without it, workers either fail permanently on transient issues or repeat actions that were already committed.

Execution log

Shows what each attempt did and what came back. Without it, partial failure is impossible to reason about and operators are left guessing whether replay is safe.

Async job processing architecture

Context

Used when requests trigger work that depends on external services, document processing, or long-running computation. It solves the tension between fast user-facing responses and the need to keep downstream execution durable and observable.

Key idea

Accept and persist the business action synchronously, then let workers execute the expensive or failure-prone part with traceable status.

Tradeoff

This looks fine until the product has to explain it. `Accepted`, `processing`, `failed`, and `completed` become separate user-visible states, and debugging gets slower because one action now spans API, queue, worker, and follow-up reads.

Failure reality

This is where systems break when teams stop at the 202 response. Jobs back up, workers crash after writing half the result, downstream APIs slow down just enough to delay completion without clearly failing, and users click again because the UI cannot tell queued work from finished work. Support sees 'request succeeded' in one place and 'job still failing' in another, while logs only show a partial trail. Without explicit state transitions, the system feels random even when each component is doing exactly what it was coded to do.

Production note

This pattern appears in most systems with external integrations, background documents, or any path where completion matters more than the first response time.

flowchart LR
    A[API Request or Event] --> B[Domain Service]
    B --> C[(Primary Database)]
    B -. enqueue .-> D[Job Queue]
    D --> E[Worker Pool]
    E --> F[External Services]
    E --> G[(Audit Log)]
    E --> H[Notification Service]

Legend

Input: API or workflow trigger

Queue: deferred work

Workers: processing units

Audit trail: observability

Domain service

Defines when the action becomes durable. Without this boundary, workers and APIs disagree about whether the system actually accepted the request.

Job queue

Buffers slow or unstable work. Without it, one external timeout or CPU-heavy step leaks directly into request latency and user-visible failures.

Audit log

Keeps the execution path explainable after the request returns. Without it, teams know a job exists but cannot explain what it already touched or what is safe to retry.

Data consistency model

Context

Used when a system needs one authoritative write model but several downstream consumers, read views, or integrations. It solves the problem of making side effects and read models catch up safely after the source-of-truth transaction commits.

Key idea

Keep the primary write atomic, then publish outward from durable state instead of coupling business truth to immediate downstream success.

Tradeoff

The cost is not abstract eventual consistency. It is operational disagreement: the source of truth says one thing, the dashboard says another, and support has to pick which screen to trust before telling a user to retry or wait.

Failure reality

In production, stale reads look like bugs even when the write succeeded. The database has the new state, the dashboard still shows the old one, and the external integration has not processed the event yet. A user retries an action that already worked, support sees conflicting screens, and engineering has to answer the ugliest question in these systems: which truth is the real one right now?

Production note

This pattern is common once systems add projections, callbacks, analytics feeds, or several services that should react to one committed business change.

flowchart LR
    A[API Write] --> B[Transactional Service]
    B --> C[(Primary DB)]
    B --> D[(Outbox / Event Table)]
    D -. publish .-> E[Worker]
    E --> F[Read Model]
    E --> G[External Integrations]
    F --> H[Status Views]

Legend

Solid arrows: transactional writes

Dashed arrows: eventual propagation

Status views: reconciled read models

Transactional service

Owns the authoritative write. Without it, the system cannot say exactly when the business action became real, which is where incident timelines start to fall apart.

Outbox table

Keeps outbound work attached to the committed state change. Without it, the database can say 'saved' while the integration path silently misses the event.

Read model

Serves converged views without hitting the write path for everything. Without it, either the primary store carries too much read pressure or product screens end up stitching together half-fresh data from the wrong place.

Sequence diagram: user to API to worker to database

Context

Used to make the real execution path explicit when a single user action spans request-time validation and asynchronous follow-up work. It solves the problem of teams reasoning only about the API response while the meaningful work continues afterward.

Key idea

The response is only the first milestone; the full system path includes persistence, queuing, worker execution, and a later visible outcome.

Tradeoff

Once execution spans time, the product needs honest status language and tighter coordination across frontend, API, and worker code. Otherwise one team says 'success' while another means 'accepted but not finished.'

Failure reality

A user clicks once, gets a success response, and the real work still fails later. Maybe the API persisted the transition, the worker hit a timeout, and the refreshed screen still shows stale data, so the user clicks again before the callback from the first attempt arrives. This is where systems break if the response is treated as completion. Support then has to answer whether the first click worked, whether the second click duplicated it, and why the UI, database, and worker logs each tell a slightly different story.

Production note

This is a useful framing for any system where users see an immediate acknowledgment but actual completion depends on background work.

sequenceDiagram
    participant U as User
    participant A as API
    participant D as Database
    participant Q as Queue
    participant W as Worker
    U->>A: Submit workflow action
    A->>D: Persist state transition
    A-->>Q: Enqueue background job
    A-->>U: 202 Accepted / updated state
    Q-->>W: Deliver job
    W->>D: Store processing result
    W-->>A: Emit status update

Legend

Sequence participants: user, API, database, queue, worker

Dashed message: queued async execution

API boundary

Defines what the response means. Without it, users and client code interpret 'success' differently from what the backend actually guaranteed.

Queue

Separates accepted work from immediate execution. Without it, request time and execution time collapse into one fragile path.

Worker

Carries the long-running step to completion. Without it, slow integrations and heavy processing either block the API or disappear into ad hoc background logic.

Supporting patterns

Secondary patterns that make the main flows safer to operate and easier to ship once the core execution path is well defined.

Email notification workflow

Context

Used when product state changes need user or operator notifications, but delivery timing and provider behavior are not reliable enough for synchronous handling. It solves the problem of making notifications observable and recoverable instead of best-effort side effects.

Key idea

Convert domain events into durable notification work, then deliver and audit that work asynchronously.

Tradeoff

This makes messaging operationally heavier than it looks. Teams now have to reason about delivery state, template changes, and whether a failed send is safe to replay when provider status, app logs, and user reports do not agree.

Failure reality

This is usually where teams struggle. A worker sends the email, crashes before writing the notification log, and the retry sends the same email again. Or the provider times out, but the message was actually accepted. Without a durable log and retry boundary, support cannot tell whether the user was notified, engineering cannot tell whether replay is safe, and a template bug can quietly affect live traffic for hours.

Production note

This pattern is common in systems with external providers, compliance-sensitive messaging, or support teams that need to explain whether a message really went out.

flowchart LR
    A[Domain Event] --> B[Notification Service]
    B --> C[Template Resolver]
    B -. enqueue .-> D[Delivery Queue]
    D --> E[Worker]
    E --> F[Email Provider]
    E --> G[(Notification Log)]
    G --> H[Admin Visibility]

Legend

Solid arrows: domain events

Dashed arrows: queued delivery

Database icon: persisted history

Notification service

Translates a business event into a concrete outbound message. Without it, templating, recipient logic, and delivery policy scatter across the app and become hard to audit.

Delivery queue

Moves provider latency and retries off the request path. Without it, a slow email vendor starts breaking user actions that should have completed even if the message sends later.

Notification log

Records what the system attempted and what happened next. Without it, support cannot answer whether a user missed a message, whether a retry duplicated one, or whether delivery failed upstream.

Caching layers and read optimization

Context

Used once read-heavy endpoints or dashboards begin to pressure the primary system path. It solves the problem of reducing repeated read load without letting cached data become an invisible source of business inconsistency.

Key idea

Cache the reads that can tolerate lag, but keep business-critical writes and freshness ownership tied to the source of truth.

Tradeoff

Every cache creates a second truth with a different freshness window. This is where debugging slows down because engineers have to figure out whether the wrong screen came from stale cache, stale client state, or a real write-path bug.

Failure reality

Caches fail quietly. A cache serves stale eligibility, the user performs an action that should have been blocked, and now support has to explain why the UI allowed something the backend later rejected. Or one screen invalidates correctly while another keeps serving yesterday's state. This is usually where teams struggle because nothing is obviously down.

Production note

This matters most in systems with dashboards, frequently queried views, or partner lookups where performance gains are real but stale eligibility can cause real user-facing mistakes.

flowchart TD
    A[Users] --> B[Web App]
    B --> C[API Layer]
    C --> D[Cache]
    C --> E[(Primary DB)]
    E -. invalidate .-> D
    D --> F[Frequently Read Views]

Legend

Primary DB: source of truth

Cache: read optimization

Dashed arrows: cache invalidation

API layer

Decides which reads may be stale and which cannot. Without it, caching rules drift into clients and nobody can explain freshness guarantees.

Cache

Reduces repeated load on stable reads. Without it, primary stores absorb unnecessary traffic and performance degrades on endpoints that should be cheap.

Invalidation path

Clears or refreshes cached state when authoritative data changes. Without it, stale values linger long enough to cause incorrect user actions and difficult-to-reproduce bugs.

Cloud delivery topology

Context

Used to explain the runtime shape of a full-stack system once web traffic, API work, background jobs, and integrations all matter operationally. It solves the problem of showing where each execution path actually lives after code leaves the repo.

Key idea

Design the runtime so synchronous product flows, async work, persistence, and operational feedback reinforce each other instead of competing for ownership.

Tradeoff

The tradeoff is operational spread. Incidents no longer live in one runtime, and this is where debugging slows down: web looks healthy, workers are degraded, the database is fine, and the external dependency is timing out just enough to keep the whole system ambiguous.

Failure reality

The app can look healthy while the system is failing somewhere else. Web requests are green, workers are stuck retrying, queue lag is climbing, and an external integration is slow enough to hurt completion but not slow enough to trip every alert. Support sees that 'the site works' while users report actions stuck for hours. Without a topology like this, teams debug one runtime at a time and miss the path that actually broke.

Production note

This is the baseline shape of many backend-heavy product systems once background processing and external integrations become part of normal operation.

flowchart TD
    A[Users] --> B[CDN / Edge]
    B --> C[Web Application]
    C --> D[API Services]
    D --> E[(PostgreSQL / MySQL)]
    D -. publish jobs .-> F[Worker Services]
    F --> G[Integrations]
    C --> H[Monitoring]
    F --> H

Legend

Users enter through the web edge

App services handle product logic

Workers process asynchronous jobs

Monitoring supports operational feedback

Web application

Owns the user-facing contract. Without it, backend timing and infrastructure quirks leak straight into the UI, and different entry points start telling different stories about the same action.

Worker services

Handle the work that continues after the request returns. Without them, slow integrations either block the web tier or disappear into background behavior nobody can observe properly.

Monitoring

Connects the health of web, workers, queues, and dependencies. Without it, teams see green service checks while the actual completion path is already broken and support only has user complaints to go on.

CI/CD pipeline visualization

Context

Used when APIs, workers, and background consumers need to move safely together through production. It solves the problem of treating deployment as an operational change with compatibility windows, not just an artifact upload.

Key idea

A safe release includes verification before deploy and runtime confirmation after deploy, especially when more than one component participates in execution.

Tradeoff

Safer releases are slower and more constrained. Teams have to think about mixed-version windows, rollback limits, and whether queued work produced by new code can still be consumed by old workers while support is already seeing strange half-broken behavior.

Failure reality

Most release failures are mixed-version failures. New code publishes a message old workers cannot read, or the web tier expects a field the API has not fully rolled out yet. This looks fine in CI and breaks only after deploy. The API returns success, the worker fails later, and the UI never reflects the failure cleanly. Without runtime feedback, engineering finds out from users, and support cannot tell whether the issue is a bad release, stale worker, or half-completed migration.

Production note

This becomes especially important in systems with workers, schema changes, or queued messages where old and new code coexist during rollout.

flowchart LR
    A[Pull Request] --> B[Lint and Tests]
    B --> C[Build]
    C --> D[Deploy]
    D --> E[Application Services]
    D --> F[Workers]
    E -. telemetry .-> G[Monitoring]
    F -. telemetry .-> G
    G -. release feedback .-> H[Team]

Legend

Solid arrows: release path

Dashed arrows: feedback loop

Monitoring: post-deploy validation

Verification stage

Catches obvious regressions before rollout. Without it, production becomes the first place API, build, and worker assumptions are tested together.

Deploy stage

Controls how new code reaches web and worker runtimes. Without it, incompatible versions overlap in ways nobody modeled.

Monitoring loop

Checks whether the release actually behaves under live traffic and queued work. Without it, a green deployment hides a broken completion path.

Background job lifecycle

Context

Used once background jobs are important enough that teams need to answer whether a failed job should retry, stop, or be replayed manually. It solves the problem of making async recovery predictable instead of improvising during incidents.

Key idea

Treat job states as an explicit lifecycle with bounded retries and a deliberate operator handoff for non-recoverable failures.

Tradeoff

This adds operator-facing state and recovery tooling that teams now have to maintain. It also makes incidents more procedural, because someone has to decide which failures should retry automatically and which ones should stop before causing more damage.

Failure reality

Jobs do not fail in one neat way. One job times out after creating a remote record but before saving success locally. Another keeps retrying because the dependency is down. A third should never retry because replay would send a second email or create a second booking. Operators check the logs, but one attempt is missing its completion entry and the callback has not arrived yet. Without an explicit lifecycle, replay becomes guesswork and incidents drag on far longer than they should.

Production note

This pattern is common in systems where retries are helpful but unsafe to run forever, especially around external APIs or expensive document workflows.

stateDiagram-v2
    [*] --> Queued
    Queued --> Running
    Running --> Succeeded
    Running --> Failed
    Failed --> Queued: retry
    Failed --> DeadLetter: max retries exceeded
    DeadLetter --> Queued: manual replay
    Succeeded --> [*]

Legend

State nodes: job lifecycle

Dashed arrows: retry transitions

Dead-letter: manual recovery path

Queued

Holds accepted work until capacity is available. Without this state, the system cannot absorb spikes or reason about backlog separately from failure.

Failed and retried

Contains transient failure in a bounded loop. Without it, every blip becomes a manual incident or every failure retries forever with no business awareness.

Dead-letter recovery

Stops unsafe work and forces human review. Without it, the system keeps replaying broken jobs or silently drops the ones that mattered most.