Durable Delivery Retry
Status
Implemented.
Related Requirements
- SRS references: reliable outbound integration and evidence/audit traceability.
- Client response references: Phase 1 needs basic outbound webhook; external ticketing systems are Phase 2.
- ADR references: infrastructure seams and delivery state machine docs.
- Task board references: OP-D030, OP-021.
Problem Statement
Outbound integration delivery must survive worker restarts, process crashes, and transient network failure without relying on in-memory retry timers.
Architectural Intent
Delivery retry state is persisted in Postgres. Workers poll due retry work, claim it transactionally, execute delivery through the adapter seam, record attempts, and transition to delivered, retry scheduled, failed, or dead-lettered.
What Was Implemented
- Explicit delivery states:
pending,delivering,delivered,retry_scheduled,failed,dead_lettered. - DB-driven retry scheduling.
- Transactional claiming with lock ownership and expiry.
- Persisted attempts with timing, duration, HTTP status, retryability, event references, and redacted error summaries.
- Dead-letter transition and audit path.
- Retry/dead-letter metrics.
Components Involved
db/postgres/005_delivery_retry_state.sqlpoc/ingest_api/integration_delivery_service.pyservices/worker_service/runner.py- Integrations console timeline components.
APIs / Events / Schemas
- Delivery read fields are exposed through existing integrations APIs.
- Delivery events and AsyncAPI docs include delivery lifecycle behavior.
- No replay API is implemented yet.
Deployment Notes
Retry polling is worker-owned and configurable through worker environment variables. Production-shaped retry requires Postgres and the worker service.
Security / Tenant Isolation
- Delivery rows and attempts are tenant-scoped.
- Worker retry polling requires explicit tenant configuration.
- Disabled or missing destinations fail safely.
- Error summaries are bounded and redacted.
- No secret values are persisted in attempts.
Validation Steps
UI Validation
- Open
/console/integrations. - Select a destination with retry or dead-letter history.
- Confirm the attempt timeline shows status, attempt number, retryability, and redacted error summary.
- Confirm dead-lettered deliveries show explanation without replay controls.
API Validation
Use the integration delivery read endpoints documented in OpenAPI and confirm delivery state and attempts are tenant-scoped.
Smoke Validation
make test-delivery-retry
make smoke-delivery-retry
Known Limitations
- DLQ browser and replay tooling are not implemented.
- Alerting dashboards for retry queues are future work.
- External ticketing adapters are not implemented.
Follow-Up Work
- Add DLQ browser/replay operations after safe API contracts.
- Add receiver replay-window fixtures.
- Add production dashboards and alerting.
Acceptance Criteria Mapping
| Acceptance criterion | Evidence |
|---|---|
| Retry is durable | DB retry state migration and tests |
| Duplicate retry execution is controlled | Transactional claiming tests |
| Dead-letter is auditable | Delivery service audit path |
| UI shows retry/dead-letter state | /console/integrations |