System design
Une vue NDA-safe de systèmes de production couvrant frontend, API, workers, jobs async et delivery. L’objectif n’est pas de montrer des diagrammes “propres”, mais où l’état est validé, où les effets de bord partent, et comment le système reste fiable quand une dépendance ralentit ou échoue.
Commencez par les patterns cœur de système. Ils montrent où l’état devient durable, où le travail bascule hors requête, et comment le produit reste lisible quand plusieurs composants exécutent la même action métier.
Core
Montre comment les demandes de réservation traversent la validation, l’intégration partenaire et les parcours asynchrones de reprise.
Core
Vue système des actions synchrones, des services métier, du traitement asynchrone des documents et de la visibilité opérationnelle.
Core
Un pattern de garde-fous pour sécuriser les retries à travers API, queues et effets de bord vers des services partenaires.
Le reste sert de patterns de support : delivery, cache, notifications, cohérence et récupération opérationnelle.
Patterns cœur de système
Les diagrammes les plus représentatifs de la façon dont je pense exécution, frontières d’API, traitements async et récupération en production.
Contexte
Used when product-critical flows depend on partner APIs that are slower or less reliable than the user-facing booking experience. It solves the problem of keeping frontend progress and backend truth aligned even when external systems respond late or inconsistently.
Idée clé
Treat booking state as the source of truth, and push partner variability behind explicit async recovery boundaries.
Tradeoff
This makes the backend stricter in ways product teams feel. Booking can no longer be treated as one clean success state, and this is where teams get stuck: API validation, queue retries, partner callbacks, and recovery tooling can all disagree for a while.
Réalité de production
In production, a partner API times out but still creates the reservation on its side. The retry creates a second reservation, then the original callback arrives late and overwrites the local status again. The user saw one spinner and one success screen, support sees conflicting states across screens, and engineers check logs that do not line up cleanly with partner timestamps. Without a clear local source of truth, teams end up guessing whether to retry, compensate, or tell the user to wait.
Signal de production
This pattern is common in marketplace and integration-heavy systems where conversion depends on not exposing partner instability directly to the user journey.
flowchart LR
A[User Booking Flow] --> B[Frontend Client]
B --> C[Booking API]
C --> D[Validation & Pricing Rules]
D --> E[(Booking Data)]
C -. retry jobs .-> F[Background Workers]
F --> G[Partner API Connectors]
G --> H[Partner Systems]
F --> I[Status Sync & Recovery]Légende
Flèches pleines : parcours principal
Flèches pointillées : sync partenaire et retries
Nœuds arrondis : étapes côté utilisateur
Définit des contrats fiables entre l’UI produit, les règles backend et les partenaires.
Gèrent la synchronisation partenaire hors cycle de requête pour plus de résilience.
Permet de garder des parcours de reprise lisibles lorsque les partenaires sont instables.
Contexte
Used in workflow-heavy admin systems where operators trigger actions that depend on document generation, email handling, or reporting work that may complete later. It solves the gap between a fast admin action and a slower backend workflow that still needs to remain explainable.
Idée clé
Commit the workflow transition first, then move document and notification side effects into observable background execution.
Tradeoff
This introduces more than queueing overhead. The product now has intermediate states that admin users will see before the system agrees with itself, and debugging slows down because controller, queue, worker, and status screens stop lining up cleanly under failure.
Réalité de production
This is where systems break. A report job gets queued, the worker crashes during PDF generation, and the admin still sees the visit as processed because the request path already returned cleanly. Minutes later the PDF job retries, but the operator has already refreshed into a stale status view and assumes it is done. Or the email send times out after the state change was saved, then a retry sends the message twice. The database says processed, the document store is empty, and support cannot tell whether replay will fix the issue or duplicate more work.
Signal de production
This pattern is common in compliance-sensitive systems where admin users need reliable status visibility without waiting on PDFs, email, or reporting pipelines.
flowchart LR
A[Admin UI] --> B[Symfony Controllers]
B --> C[Workflow Services]
C --> D[(Scheduling & Reporting Data)]
C -. queue .-> E[Async Worker]
E --> F[PDF Processing]
E --> G[Email Notifications]
E --> H[Audit Trail]
C --> I[Operational Status Views]Légende
Flèches pleines : requêtes synchrones
Flèches pointillées : jobs async
Panneaux : surfaces opérationnelles
Centralisent les règles métier de planification, reporting et contrôles d’éligibilité.
Gère le parsing documentaire, les notifications et les tâches qui ne doivent pas bloquer l’interface admin.
Rendent les états système lisibles pour les équipes administratives.
Contexte
Used anywhere the same operation may be replayed by a browser refresh, mobile timeout, queue retry, or partner callback. It solves the problem of distinguishing a legitimate retry from a second business action.
Idée clé
Make retries cheap by treating deduplication and execution history as part of the system contract, not as an afterthought.
Tradeoff
The cost is real: more persistent state, slower debugging, and stricter semantics around when a side effect is considered committed. This is where teams get stuck when a retry fires after partial success and nobody can tell whether they are recovering work or duplicating it.
Réalité de production
Timeouts are rarely clean failures. A partner API can return nothing, still create the booking, and then get called again by a retry. Or a worker can update the external system, crash before storing its execution result, and run again while the previous attempt actually succeeded. The external system says yes, the app log is missing the success entry, and support is left with the worst question in these incidents: did we do it once or twice?
Signal de production
This is a standard reliability pattern in payment, booking, and integration-heavy systems where at-least-once delivery is normal and duplicates are expensive.
flowchart LR
A[Client Request] --> B[API Gateway]
B --> C[Idempotency Key Check]
C --> D[(Idempotency Store)]
C --> E[Domain Service]
E -. retry-safe job .-> F[Queue]
F --> G[Worker]
G --> H[External API]
G --> I[(Execution Log)]Légende
Key store : enregistrement d’idempotence
Flèches pointillées : retries
Worker : exécution des effets de bord
Empêche les doubles effets de bord quand clients ou workers rejouent une opération.
Permet de rejouer le travail de fond sans répéter l’état déjà validé.
Conserve une trace durable des tentatives, résultats et décisions de retry.
Contexte
Used when requests trigger work that depends on external services, document processing, or long-running computation. It solves the tension between fast user-facing responses and the need to keep downstream execution durable and observable.
Idée clé
Accept and persist the business action synchronously, then let workers execute the expensive or failure-prone part with traceable status.
Tradeoff
This looks fine until the product has to explain it. `Accepted`, `processing`, `failed`, and `completed` become separate user-visible states, and debugging gets slower because one action now spans API, queue, worker, and follow-up reads.
Réalité de production
This is where systems break when teams stop at the 202 response. Jobs back up, workers crash after writing half the result, downstream APIs slow down just enough to delay completion without clearly failing, and users click again because the UI cannot tell queued work from finished work. Support sees 'request succeeded' in one place and 'job still failing' in another, while logs only show a partial trail. Without explicit state transitions, the system feels random even when each component is doing exactly what it was coded to do.
Signal de production
This pattern appears in most systems with external integrations, background documents, or any path where completion matters more than the first response time.
Lecture liée
flowchart LR
A[API Request or Event] --> B[Domain Service]
B --> C[(Primary Database)]
B -. enqueue .-> D[Job Queue]
D --> E[Worker Pool]
E --> F[External Services]
E --> G[(Audit Log)]
E --> H[Notification Service]Légende
Entrée : API ou déclencheur métier
Queue : travail différé
Workers : unités de traitement
Audit trail : observabilité
Valide et enregistre l’état avant de déléguer les effets de bord à des workers.
Absorbe les traitements coûteux ou dépendants d’intégrations pour protéger la latence.
Crée une trace opérationnelle lisible pour le support et le debug.
Contexte
Used when a system needs one authoritative write model but several downstream consumers, read views, or integrations. It solves the problem of making side effects and read models catch up safely after the source-of-truth transaction commits.
Idée clé
Keep the primary write atomic, then publish outward from durable state instead of coupling business truth to immediate downstream success.
Tradeoff
The cost is not abstract eventual consistency. It is operational disagreement: the source of truth says one thing, the dashboard says another, and support has to pick which screen to trust before telling a user to retry or wait.
Réalité de production
In production, stale reads look like bugs even when the write succeeded. The database has the new state, the dashboard still shows the old one, and the external integration has not processed the event yet. A user retries an action that already worked, support sees conflicting screens, and engineering has to answer the ugliest question in these systems: which truth is the real one right now?
Signal de production
This pattern is common once systems add projections, callbacks, analytics feeds, or several services that should react to one committed business change.
flowchart LR
A[API Write] --> B[Transactional Service]
B --> C[(Primary DB)]
B --> D[(Outbox / Event Table)]
D -. publish .-> E[Worker]
E --> F[Read Model]
E --> G[External Integrations]
F --> H[Status Views]Légende
Flèches pleines : écritures transactionnelles
Flèches pointillées : propagation éventuelle
Status views : modèles de lecture réconciliés
Valide l’état faisant autorité et enregistre le travail sortant de manière atomique.
Sépare les changements d’état durables de la livraison asynchrone des effets de bord.
Alimente les vues utilisateur et opérateur une fois les traitements de fond appliqués.
Contexte
Used to make the real execution path explicit when a single user action spans request-time validation and asynchronous follow-up work. It solves the problem of teams reasoning only about the API response while the meaningful work continues afterward.
Idée clé
The response is only the first milestone; the full system path includes persistence, queuing, worker execution, and a later visible outcome.
Tradeoff
Once execution spans time, the product needs honest status language and tighter coordination across frontend, API, and worker code. Otherwise one team says 'success' while another means 'accepted but not finished.'
Réalité de production
A user clicks once, gets a success response, and the real work still fails later. Maybe the API persisted the transition, the worker hit a timeout, and the refreshed screen still shows stale data, so the user clicks again before the callback from the first attempt arrives. This is where systems break if the response is treated as completion. Support then has to answer whether the first click worked, whether the second click duplicated it, and why the UI, database, and worker logs each tell a slightly different story.
Signal de production
This is a useful framing for any system where users see an immediate acknowledgment but actual completion depends on background work.
sequenceDiagram
participant U as User
participant A as API
participant D as Database
participant Q as Queue
participant W as Worker
U->>A: Submit workflow action
A->>D: Persist state transition
A-->>Q: Enqueue background job
A-->>U: 202 Accepted / updated state
Q-->>W: Deliver job
W->>D: Store processing result
W-->>A: Emit status updateLégende
Participants : utilisateur, API, base, queue, worker
Message pointillé : exécution async
Confirme le changement d’état avant de déléguer les traitements longs.
Tamponne l’étape asynchrone pour garder une réponse rapide et stable.
Exécute le travail lourd et persiste l’état final de traitement.
Patterns de support
Des patterns secondaires qui renforcent fiabilité, opérabilité et delivery une fois les chemins d’exécution principaux bien définis.
Contexte
Used when product state changes need user or operator notifications, but delivery timing and provider behavior are not reliable enough for synchronous handling. It solves the problem of making notifications observable and recoverable instead of best-effort side effects.
Idée clé
Convert domain events into durable notification work, then deliver and audit that work asynchronously.
Tradeoff
This makes messaging operationally heavier than it looks. Teams now have to reason about delivery state, template changes, and whether a failed send is safe to replay when provider status, app logs, and user reports do not agree.
Réalité de production
This is usually where teams struggle. A worker sends the email, crashes before writing the notification log, and the retry sends the same email again. Or the provider times out, but the message was actually accepted. Without a durable log and retry boundary, support cannot tell whether the user was notified, engineering cannot tell whether replay is safe, and a template bug can quietly affect live traffic for hours.
Signal de production
This pattern is common in systems with external providers, compliance-sensitive messaging, or support teams that need to explain whether a message really went out.
flowchart LR
A[Domain Event] --> B[Notification Service]
B --> C[Template Resolver]
B -. enqueue .-> D[Delivery Queue]
D --> E[Worker]
E --> F[Email Provider]
E --> G[(Notification Log)]
G --> H[Admin Visibility]Légende
Flèches pleines : événements métier
Flèches pointillées : envoi différé
Base de données : historique persistant
Transforme les événements métier en messages prêts à être délivrés.
Sort l’envoi email du cycle de requête et facilite les retries.
Garde une trace exploitable pour le support, l’audit et la visibilité d’état.
Contexte
Used once read-heavy endpoints or dashboards begin to pressure the primary system path. It solves the problem of reducing repeated read load without letting cached data become an invisible source of business inconsistency.
Idée clé
Cache the reads that can tolerate lag, but keep business-critical writes and freshness ownership tied to the source of truth.
Tradeoff
Every cache creates a second truth with a different freshness window. This is where debugging slows down because engineers have to figure out whether the wrong screen came from stale cache, stale client state, or a real write-path bug.
Réalité de production
Caches fail quietly. A cache serves stale eligibility, the user performs an action that should have been blocked, and now support has to explain why the UI allowed something the backend later rejected. Or one screen invalidates correctly while another keeps serving yesterday's state. This is usually where teams struggle because nothing is obviously down.
Signal de production
This matters most in systems with dashboards, frequently queried views, or partner lookups where performance gains are real but stale eligibility can cause real user-facing mistakes.
flowchart TD
A[Users] --> B[Web App]
B --> C[API Layer]
C --> D[Cache]
C --> E[(Primary DB)]
E -. invalidate .-> D
D --> F[Frequently Read Views]Légende
Base principale : source de vérité
Cache : optimisation de lecture
Flèches pointillées : invalidation
Décide quelles lectures méritent un cache et quelles écritures doivent rester strictement autoritaires.
Réduit la charge sur les vues stables et les chemins de lecture fréquents.
Lie la fraîcheur du cache aux changements d’état de référence.
Contexte
Used to explain the runtime shape of a full-stack system once web traffic, API work, background jobs, and integrations all matter operationally. It solves the problem of showing where each execution path actually lives after code leaves the repo.
Idée clé
Design the runtime so synchronous product flows, async work, persistence, and operational feedback reinforce each other instead of competing for ownership.
Tradeoff
The tradeoff is operational spread. Incidents no longer live in one runtime, and this is where debugging slows down: web looks healthy, workers are degraded, the database is fine, and the external dependency is timing out just enough to keep the whole system ambiguous.
Réalité de production
The app can look healthy while the system is failing somewhere else. Web requests are green, workers are stuck retrying, queue lag is climbing, and an external integration is slow enough to hurt completion but not slow enough to trip every alert. Support sees that 'the site works' while users report actions stuck for hours. Without a topology like this, teams debug one runtime at a time and miss the path that actually broke.
Signal de production
This is the baseline shape of many backend-heavy product systems once background processing and external integrations become part of normal operation.
flowchart TD
A[Users] --> B[CDN / Edge]
B --> C[Web Application]
C --> D[API Services]
D --> E[(PostgreSQL / MySQL)]
D -. publish jobs .-> F[Worker Services]
F --> G[Integrations]
C --> H[Monitoring]
F --> HLégende
Les utilisateurs passent par la couche edge
Les services applicatifs portent la logique produit
Les workers traitent les jobs async
Le monitoring soutient le feedback opérationnel
Sert les parcours produit et délègue le travail métier aux services backend.
Prennent en charge jobs, retries, traitement documentaire et intégrations.
Donne de la visibilité sur la santé production et le débogage.
Contexte
Used when APIs, workers, and background consumers need to move safely together through production. It solves the problem of treating deployment as an operational change with compatibility windows, not just an artifact upload.
Idée clé
A safe release includes verification before deploy and runtime confirmation after deploy, especially when more than one component participates in execution.
Tradeoff
Safer releases are slower and more constrained. Teams have to think about mixed-version windows, rollback limits, and whether queued work produced by new code can still be consumed by old workers while support is already seeing strange half-broken behavior.
Réalité de production
Most release failures are mixed-version failures. New code publishes a message old workers cannot read, or the web tier expects a field the API has not fully rolled out yet. This looks fine in CI and breaks only after deploy. The API returns success, the worker fails later, and the UI never reflects the failure cleanly. Without runtime feedback, engineering finds out from users, and support cannot tell whether the issue is a bad release, stale worker, or half-completed migration.
Signal de production
This becomes especially important in systems with workers, schema changes, or queued messages where old and new code coexist during rollout.
flowchart LR
A[Pull Request] --> B[Lint and Tests]
B --> C[Build]
C --> D[Deploy]
D --> E[Application Services]
D --> F[Workers]
E -. telemetry .-> G[Monitoring]
F -. telemetry .-> G
G -. release feedback .-> H[Team]Légende
Flèches pleines : chemin de release
Flèches pointillées : boucle de feedback
Monitoring : validation post-déploiement
Détecte les échecs de code, tests et build avant déploiement.
Livre application et workers ensemble quand les workflows dépendent des deux.
Fournit le signal opérationnel qui referme le cycle de release.
Contexte
Used once background jobs are important enough that teams need to answer whether a failed job should retry, stop, or be replayed manually. It solves the problem of making async recovery predictable instead of improvising during incidents.
Idée clé
Treat job states as an explicit lifecycle with bounded retries and a deliberate operator handoff for non-recoverable failures.
Tradeoff
This adds operator-facing state and recovery tooling that teams now have to maintain. It also makes incidents more procedural, because someone has to decide which failures should retry automatically and which ones should stop before causing more damage.
Réalité de production
Jobs do not fail in one neat way. One job times out after creating a remote record but before saving success locally. Another keeps retrying because the dependency is down. A third should never retry because replay would send a second email or create a second booking. Operators check the logs, but one attempt is missing its completion entry and the callback has not arrived yet. Without an explicit lifecycle, replay becomes guesswork and incidents drag on far longer than they should.
Signal de production
This pattern is common in systems where retries are helpful but unsafe to run forever, especially around external APIs or expensive document workflows.
stateDiagram-v2
[*] --> Queued
Queued --> Running
Running --> Succeeded
Running --> Failed
Failed --> Queued: retry
Failed --> DeadLetter: max retries exceeded
DeadLetter --> Queued: manual replay
Succeeded --> [*]Légende
Nœuds d’état : cycle du job
Flèches pointillées : transitions de retry
Dead-letter : parcours de reprise manuelle
Le job est accepté et attend de la capacité worker.
Les échecs transitoires repassent par des retries bornés.
Les jobs irrécupérables sont isolés pour revue opérateur explicite.