Skip to main content

Durable Delivery Retry

Status

Implemented.

  • SRS references: reliable outbound integration and evidence/audit traceability.
  • Client response references: Phase 1 needs basic outbound webhook; external ticketing systems are Phase 2.
  • ADR references: infrastructure seams and delivery state machine docs.
  • Task board references: OP-D030, OP-021.

Problem Statement

Outbound integration delivery must survive worker restarts, process crashes, and transient network failure without relying on in-memory retry timers.

Architectural Intent

Delivery retry state is persisted in Postgres. Workers poll due retry work, claim it transactionally, execute delivery through the adapter seam, record attempts, and transition to delivered, retry scheduled, failed, or dead-lettered.

What Was Implemented

  • Explicit delivery states: pending, delivering, delivered, retry_scheduled, failed, dead_lettered.
  • DB-driven retry scheduling.
  • Transactional claiming with lock ownership and expiry.
  • Persisted attempts with timing, duration, HTTP status, retryability, event references, and redacted error summaries.
  • Dead-letter transition and audit path.
  • Retry/dead-letter metrics.

Components Involved

  • db/postgres/005_delivery_retry_state.sql
  • poc/ingest_api/integration_delivery_service.py
  • services/worker_service/runner.py
  • Integrations console timeline components.

APIs / Events / Schemas

  • Delivery read fields are exposed through existing integrations APIs.
  • Delivery events and AsyncAPI docs include delivery lifecycle behavior.
  • No replay API is implemented yet.

Deployment Notes

Retry polling is worker-owned and configurable through worker environment variables. Production-shaped retry requires Postgres and the worker service.

Security / Tenant Isolation

  • Delivery rows and attempts are tenant-scoped.
  • Worker retry polling requires explicit tenant configuration.
  • Disabled or missing destinations fail safely.
  • Error summaries are bounded and redacted.
  • No secret values are persisted in attempts.

Validation Steps

UI Validation

  1. Open /console/integrations.
  2. Select a destination with retry or dead-letter history.
  3. Confirm the attempt timeline shows status, attempt number, retryability, and redacted error summary.
  4. Confirm dead-lettered deliveries show explanation without replay controls.

API Validation

Use the integration delivery read endpoints documented in OpenAPI and confirm delivery state and attempts are tenant-scoped.

Smoke Validation

make test-delivery-retry
make smoke-delivery-retry

Known Limitations

  • DLQ browser and replay tooling are not implemented.
  • Alerting dashboards for retry queues are future work.
  • External ticketing adapters are not implemented.

Follow-Up Work

  • Add DLQ browser/replay operations after safe API contracts.
  • Add receiver replay-window fixtures.
  • Add production dashboards and alerting.

Acceptance Criteria Mapping

Acceptance criterionEvidence
Retry is durableDB retry state migration and tests
Duplicate retry execution is controlledTransactional claiming tests
Dead-letter is auditableDelivery service audit path
UI shows retry/dead-letter state/console/integrations