IC

Iheb Chatti

Ingénierie produit full stack, API scalables, workflows asynchrones et delivery cloud

System design

Comment des systèmes réels exécutent, échouent et récupèrent en production.

Une vue NDA-safe de systèmes de production couvrant frontend, API, workers, jobs async et delivery. L’objectif n’est pas de montrer des diagrammes “propres”, mais où l’état est validé, où les effets de bord partent, et comment le système reste fiable quand une dépendance ralentit ou échoue.

Comment lire cette page

Commencez par les patterns cœur de système. Ils montrent où l’état devient durable, où le travail bascule hors requête, et comment le produit reste lisible quand plusieurs composants exécutent la même action métier.

Core

Workflow de réservation et frontière partenaire

Montre comment les demandes de réservation traversent la validation, l’intégration partenaire et les parcours asynchrones de reprise.

Core

Orchestration du workflow médical

Vue système des actions synchrones, des services métier, du traitement asynchrone des documents et de la visibilité opérationnelle.

Core

Stratégie de retry et d’idempotence

Un pattern de garde-fous pour sécuriser les retries à travers API, queues et effets de bord vers des services partenaires.

Le reste sert de patterns de support : delivery, cache, notifications, cohérence et récupération opérationnelle.

Patterns cœur de système

Les diagrammes les plus représentatifs de la façon dont je pense exécution, frontières d’API, traitements async et récupération en production.

Workflow de réservation et frontière partenaire

Contexte

Used when product-critical flows depend on partner APIs that are slower or less reliable than the user-facing booking experience. It solves the problem of keeping frontend progress and backend truth aligned even when external systems respond late or inconsistently.

Idée clé

Treat booking state as the source of truth, and push partner variability behind explicit async recovery boundaries.

Tradeoff

This makes the backend stricter in ways product teams feel. Booking can no longer be treated as one clean success state, and this is where teams get stuck: API validation, queue retries, partner callbacks, and recovery tooling can all disagree for a while.

Réalité de production

In production, a partner API times out but still creates the reservation on its side. The retry creates a second reservation, then the original callback arrives late and overwrites the local status again. The user saw one spinner and one success screen, support sees conflicting states across screens, and engineers check logs that do not line up cleanly with partner timestamps. Without a clear local source of truth, teams end up guessing whether to retry, compensate, or tell the user to wait.

Signal de production

This pattern is common in marketplace and integration-heavy systems where conversion depends on not exposing partner instability directly to the user journey.

flowchart LR
    A[User Booking Flow] --> B[Frontend Client]
    B --> C[Booking API]
    C --> D[Validation & Pricing Rules]
    D --> E[(Booking Data)]
    C -. retry jobs .-> F[Background Workers]
    F --> G[Partner API Connectors]
    G --> H[Partner Systems]
    F --> I[Status Sync & Recovery]

Légende

Flèches pleines : parcours principal

Flèches pointillées : sync partenaire et retries

Nœuds arrondis : étapes côté utilisateur

API de réservation

Définit des contrats fiables entre l’UI produit, les règles backend et les partenaires.

Workers de fond

Gèrent la synchronisation partenaire hors cycle de requête pour plus de résilience.

Synchronisation d’état

Permet de garder des parcours de reprise lisibles lorsque les partenaires sont instables.

Orchestration du workflow médical

Contexte

Used in workflow-heavy admin systems where operators trigger actions that depend on document generation, email handling, or reporting work that may complete later. It solves the gap between a fast admin action and a slower backend workflow that still needs to remain explainable.

Idée clé

Commit the workflow transition first, then move document and notification side effects into observable background execution.

Tradeoff

This introduces more than queueing overhead. The product now has intermediate states that admin users will see before the system agrees with itself, and debugging slows down because controller, queue, worker, and status screens stop lining up cleanly under failure.

Réalité de production

This is where systems break. A report job gets queued, the worker crashes during PDF generation, and the admin still sees the visit as processed because the request path already returned cleanly. Minutes later the PDF job retries, but the operator has already refreshed into a stale status view and assumes it is done. Or the email send times out after the state change was saved, then a retry sends the message twice. The database says processed, the document store is empty, and support cannot tell whether replay will fix the issue or duplicate more work.

Signal de production

This pattern is common in compliance-sensitive systems where admin users need reliable status visibility without waiting on PDFs, email, or reporting pipelines.

flowchart LR
    A[Admin UI] --> B[Symfony Controllers]
    B --> C[Workflow Services]
    C --> D[(Scheduling & Reporting Data)]
    C -. queue .-> E[Async Worker]
    E --> F[PDF Processing]
    E --> G[Email Notifications]
    E --> H[Audit Trail]
    C --> I[Operational Status Views]

Légende

Flèches pleines : requêtes synchrones

Flèches pointillées : jobs async

Panneaux : surfaces opérationnelles

Services workflow

Centralisent les règles métier de planification, reporting et contrôles d’éligibilité.

Worker async

Gère le parsing documentaire, les notifications et les tâches qui ne doivent pas bloquer l’interface admin.

Vues d’état opérationnel

Rendent les états système lisibles pour les équipes administratives.

Stratégie de retry et d’idempotence

Contexte

Used anywhere the same operation may be replayed by a browser refresh, mobile timeout, queue retry, or partner callback. It solves the problem of distinguishing a legitimate retry from a second business action.

Idée clé

Make retries cheap by treating deduplication and execution history as part of the system contract, not as an afterthought.

Tradeoff

The cost is real: more persistent state, slower debugging, and stricter semantics around when a side effect is considered committed. This is where teams get stuck when a retry fires after partial success and nobody can tell whether they are recovering work or duplicating it.

Réalité de production

Timeouts are rarely clean failures. A partner API can return nothing, still create the booking, and then get called again by a retry. Or a worker can update the external system, crash before storing its execution result, and run again while the previous attempt actually succeeded. The external system says yes, the app log is missing the success entry, and support is left with the worst question in these incidents: did we do it once or twice?

Signal de production

This is a standard reliability pattern in payment, booking, and integration-heavy systems where at-least-once delivery is normal and duplicates are expensive.

flowchart LR
    A[Client Request] --> B[API Gateway]
    B --> C[Idempotency Key Check]
    C --> D[(Idempotency Store)]
    C --> E[Domain Service]
    E -. retry-safe job .-> F[Queue]
    F --> G[Worker]
    G --> H[External API]
    G --> I[(Execution Log)]

Légende

Key store : enregistrement d’idempotence

Flèches pointillées : retries

Worker : exécution des effets de bord

Store d’idempotence

Empêche les doubles effets de bord quand clients ou workers rejouent une opération.

Queue sûre au retry

Permet de rejouer le travail de fond sans répéter l’état déjà validé.

Journal d’exécution

Conserve une trace durable des tentatives, résultats et décisions de retry.

Architecture de traitement asynchrone

Contexte

Used when requests trigger work that depends on external services, document processing, or long-running computation. It solves the tension between fast user-facing responses and the need to keep downstream execution durable and observable.

Idée clé

Accept and persist the business action synchronously, then let workers execute the expensive or failure-prone part with traceable status.

Tradeoff

This looks fine until the product has to explain it. `Accepted`, `processing`, `failed`, and `completed` become separate user-visible states, and debugging gets slower because one action now spans API, queue, worker, and follow-up reads.

Réalité de production

This is where systems break when teams stop at the 202 response. Jobs back up, workers crash after writing half the result, downstream APIs slow down just enough to delay completion without clearly failing, and users click again because the UI cannot tell queued work from finished work. Support sees 'request succeeded' in one place and 'job still failing' in another, while logs only show a partial trail. Without explicit state transitions, the system feels random even when each component is doing exactly what it was coded to do.

Signal de production

This pattern appears in most systems with external integrations, background documents, or any path where completion matters more than the first response time.

flowchart LR
    A[API Request or Event] --> B[Domain Service]
    B --> C[(Primary Database)]
    B -. enqueue .-> D[Job Queue]
    D --> E[Worker Pool]
    E --> F[External Services]
    E --> G[(Audit Log)]
    E --> H[Notification Service]

Légende

Entrée : API ou déclencheur métier

Queue : travail différé

Workers : unités de traitement

Audit trail : observabilité

Service métier

Valide et enregistre l’état avant de déléguer les effets de bord à des workers.

File de jobs

Absorbe les traitements coûteux ou dépendants d’intégrations pour protéger la latence.

Journal d’audit

Crée une trace opérationnelle lisible pour le support et le debug.

Modèle de cohérence des données

Contexte

Used when a system needs one authoritative write model but several downstream consumers, read views, or integrations. It solves the problem of making side effects and read models catch up safely after the source-of-truth transaction commits.

Idée clé

Keep the primary write atomic, then publish outward from durable state instead of coupling business truth to immediate downstream success.

Tradeoff

The cost is not abstract eventual consistency. It is operational disagreement: the source of truth says one thing, the dashboard says another, and support has to pick which screen to trust before telling a user to retry or wait.

Réalité de production

In production, stale reads look like bugs even when the write succeeded. The database has the new state, the dashboard still shows the old one, and the external integration has not processed the event yet. A user retries an action that already worked, support sees conflicting screens, and engineering has to answer the ugliest question in these systems: which truth is the real one right now?

Signal de production

This pattern is common once systems add projections, callbacks, analytics feeds, or several services that should react to one committed business change.

flowchart LR
    A[API Write] --> B[Transactional Service]
    B --> C[(Primary DB)]
    B --> D[(Outbox / Event Table)]
    D -. publish .-> E[Worker]
    E --> F[Read Model]
    E --> G[External Integrations]
    F --> H[Status Views]

Légende

Flèches pleines : écritures transactionnelles

Flèches pointillées : propagation éventuelle

Status views : modèles de lecture réconciliés

Service transactionnel

Valide l’état faisant autorité et enregistre le travail sortant de manière atomique.

Table outbox

Sépare les changements d’état durables de la livraison asynchrone des effets de bord.

Read model

Alimente les vues utilisateur et opérateur une fois les traitements de fond appliqués.

Diagramme de séquence : utilisateur vers API, worker et base de données

Contexte

Used to make the real execution path explicit when a single user action spans request-time validation and asynchronous follow-up work. It solves the problem of teams reasoning only about the API response while the meaningful work continues afterward.

Idée clé

The response is only the first milestone; the full system path includes persistence, queuing, worker execution, and a later visible outcome.

Tradeoff

Once execution spans time, the product needs honest status language and tighter coordination across frontend, API, and worker code. Otherwise one team says 'success' while another means 'accepted but not finished.'

Réalité de production

A user clicks once, gets a success response, and the real work still fails later. Maybe the API persisted the transition, the worker hit a timeout, and the refreshed screen still shows stale data, so the user clicks again before the callback from the first attempt arrives. This is where systems break if the response is treated as completion. Support then has to answer whether the first click worked, whether the second click duplicated it, and why the UI, database, and worker logs each tell a slightly different story.

Signal de production

This is a useful framing for any system where users see an immediate acknowledgment but actual completion depends on background work.

sequenceDiagram
    participant U as User
    participant A as API
    participant D as Database
    participant Q as Queue
    participant W as Worker
    U->>A: Submit workflow action
    A->>D: Persist state transition
    A-->>Q: Enqueue background job
    A-->>U: 202 Accepted / updated state
    Q-->>W: Deliver job
    W->>D: Store processing result
    W-->>A: Emit status update

Légende

Participants : utilisateur, API, base, queue, worker

Message pointillé : exécution async

Frontière API

Confirme le changement d’état avant de déléguer les traitements longs.

Queue

Tamponne l’étape asynchrone pour garder une réponse rapide et stable.

Worker

Exécute le travail lourd et persiste l’état final de traitement.

Patterns de support

Des patterns secondaires qui renforcent fiabilité, opérabilité et delivery une fois les chemins d’exécution principaux bien définis.

Workflow de notifications email

Contexte

Used when product state changes need user or operator notifications, but delivery timing and provider behavior are not reliable enough for synchronous handling. It solves the problem of making notifications observable and recoverable instead of best-effort side effects.

Idée clé

Convert domain events into durable notification work, then deliver and audit that work asynchronously.

Tradeoff

This makes messaging operationally heavier than it looks. Teams now have to reason about delivery state, template changes, and whether a failed send is safe to replay when provider status, app logs, and user reports do not agree.

Réalité de production

This is usually where teams struggle. A worker sends the email, crashes before writing the notification log, and the retry sends the same email again. Or the provider times out, but the message was actually accepted. Without a durable log and retry boundary, support cannot tell whether the user was notified, engineering cannot tell whether replay is safe, and a template bug can quietly affect live traffic for hours.

Signal de production

This pattern is common in systems with external providers, compliance-sensitive messaging, or support teams that need to explain whether a message really went out.

flowchart LR
    A[Domain Event] --> B[Notification Service]
    B --> C[Template Resolver]
    B -. enqueue .-> D[Delivery Queue]
    D --> E[Worker]
    E --> F[Email Provider]
    E --> G[(Notification Log)]
    G --> H[Admin Visibility]

Légende

Flèches pleines : événements métier

Flèches pointillées : envoi différé

Base de données : historique persistant

Service de notification

Transforme les événements métier en messages prêts à être délivrés.

File d’envoi

Sort l’envoi email du cycle de requête et facilite les retries.

Journal de notifications

Garde une trace exploitable pour le support, l’audit et la visibilité d’état.

Couches de cache et optimisation des lectures

Contexte

Used once read-heavy endpoints or dashboards begin to pressure the primary system path. It solves the problem of reducing repeated read load without letting cached data become an invisible source of business inconsistency.

Idée clé

Cache the reads that can tolerate lag, but keep business-critical writes and freshness ownership tied to the source of truth.

Tradeoff

Every cache creates a second truth with a different freshness window. This is where debugging slows down because engineers have to figure out whether the wrong screen came from stale cache, stale client state, or a real write-path bug.

Réalité de production

Caches fail quietly. A cache serves stale eligibility, the user performs an action that should have been blocked, and now support has to explain why the UI allowed something the backend later rejected. Or one screen invalidates correctly while another keeps serving yesterday's state. This is usually where teams struggle because nothing is obviously down.

Signal de production

This matters most in systems with dashboards, frequently queried views, or partner lookups where performance gains are real but stale eligibility can cause real user-facing mistakes.

flowchart TD
    A[Users] --> B[Web App]
    B --> C[API Layer]
    C --> D[Cache]
    C --> E[(Primary DB)]
    E -. invalidate .-> D
    D --> F[Frequently Read Views]

Légende

Base principale : source de vérité

Cache : optimisation de lecture

Flèches pointillées : invalidation

Couche API

Décide quelles lectures méritent un cache et quelles écritures doivent rester strictement autoritaires.

Cache

Réduit la charge sur les vues stables et les chemins de lecture fréquents.

Invalidation

Lie la fraîcheur du cache aux changements d’état de référence.

Topologie de déploiement cloud

Contexte

Used to explain the runtime shape of a full-stack system once web traffic, API work, background jobs, and integrations all matter operationally. It solves the problem of showing where each execution path actually lives after code leaves the repo.

Idée clé

Design the runtime so synchronous product flows, async work, persistence, and operational feedback reinforce each other instead of competing for ownership.

Tradeoff

The tradeoff is operational spread. Incidents no longer live in one runtime, and this is where debugging slows down: web looks healthy, workers are degraded, the database is fine, and the external dependency is timing out just enough to keep the whole system ambiguous.

Réalité de production

The app can look healthy while the system is failing somewhere else. Web requests are green, workers are stuck retrying, queue lag is climbing, and an external integration is slow enough to hurt completion but not slow enough to trip every alert. Support sees that 'the site works' while users report actions stuck for hours. Without a topology like this, teams debug one runtime at a time and miss the path that actually broke.

Signal de production

This is the baseline shape of many backend-heavy product systems once background processing and external integrations become part of normal operation.

flowchart TD
    A[Users] --> B[CDN / Edge]
    B --> C[Web Application]
    C --> D[API Services]
    D --> E[(PostgreSQL / MySQL)]
    D -. publish jobs .-> F[Worker Services]
    F --> G[Integrations]
    C --> H[Monitoring]
    F --> H

Légende

Les utilisateurs passent par la couche edge

Les services applicatifs portent la logique produit

Les workers traitent les jobs async

Le monitoring soutient le feedback opérationnel

Application web

Sert les parcours produit et délègue le travail métier aux services backend.

Services workers

Prennent en charge jobs, retries, traitement documentaire et intégrations.

Monitoring

Donne de la visibilité sur la santé production et le débogage.

Visualisation du pipeline CI/CD

Contexte

Used when APIs, workers, and background consumers need to move safely together through production. It solves the problem of treating deployment as an operational change with compatibility windows, not just an artifact upload.

Idée clé

A safe release includes verification before deploy and runtime confirmation after deploy, especially when more than one component participates in execution.

Tradeoff

Safer releases are slower and more constrained. Teams have to think about mixed-version windows, rollback limits, and whether queued work produced by new code can still be consumed by old workers while support is already seeing strange half-broken behavior.

Réalité de production

Most release failures are mixed-version failures. New code publishes a message old workers cannot read, or the web tier expects a field the API has not fully rolled out yet. This looks fine in CI and breaks only after deploy. The API returns success, the worker fails later, and the UI never reflects the failure cleanly. Without runtime feedback, engineering finds out from users, and support cannot tell whether the issue is a bad release, stale worker, or half-completed migration.

Signal de production

This becomes especially important in systems with workers, schema changes, or queued messages where old and new code coexist during rollout.

flowchart LR
    A[Pull Request] --> B[Lint and Tests]
    B --> C[Build]
    C --> D[Deploy]
    D --> E[Application Services]
    D --> F[Workers]
    E -. telemetry .-> G[Monitoring]
    F -. telemetry .-> G
    G -. release feedback .-> H[Team]

Légende

Flèches pleines : chemin de release

Flèches pointillées : boucle de feedback

Monitoring : validation post-déploiement

Étape de vérification

Détecte les échecs de code, tests et build avant déploiement.

Étape de déploiement

Livre application et workers ensemble quand les workflows dépendent des deux.

Boucle monitoring

Fournit le signal opérationnel qui referme le cycle de release.

Cycle de vie d’un job de fond

Contexte

Used once background jobs are important enough that teams need to answer whether a failed job should retry, stop, or be replayed manually. It solves the problem of making async recovery predictable instead of improvising during incidents.

Idée clé

Treat job states as an explicit lifecycle with bounded retries and a deliberate operator handoff for non-recoverable failures.

Tradeoff

This adds operator-facing state and recovery tooling that teams now have to maintain. It also makes incidents more procedural, because someone has to decide which failures should retry automatically and which ones should stop before causing more damage.

Réalité de production

Jobs do not fail in one neat way. One job times out after creating a remote record but before saving success locally. Another keeps retrying because the dependency is down. A third should never retry because replay would send a second email or create a second booking. Operators check the logs, but one attempt is missing its completion entry and the callback has not arrived yet. Without an explicit lifecycle, replay becomes guesswork and incidents drag on far longer than they should.

Signal de production

This pattern is common in systems where retries are helpful but unsafe to run forever, especially around external APIs or expensive document workflows.

stateDiagram-v2
    [*] --> Queued
    Queued --> Running
    Running --> Succeeded
    Running --> Failed
    Failed --> Queued: retry
    Failed --> DeadLetter: max retries exceeded
    DeadLetter --> Queued: manual replay
    Succeeded --> [*]

Légende

Nœuds d’état : cycle du job

Flèches pointillées : transitions de retry

Dead-letter : parcours de reprise manuelle

Queued

Le job est accepté et attend de la capacité worker.

Échec puis retry

Les échecs transitoires repassent par des retries bornés.

Dead-letter recovery

Les jobs irrécupérables sont isolés pour revue opérateur explicite.