Observability
Current state:
- Structured JSON logs from API and worker.
- Correlation IDs connect events, audit, evidence, and delivery.
- Worker logs include lifecycle, event received/processed/failed, retry claimed/completed, dead-letter, and tenant context failure events.
- Worker exposes
/metrics,/health, and/readyonONEPROTECT_METRICS_PORTwhenONEPROTECT_METRICS_ENABLED=true. - Implemented metrics include EventBus publish/consume counters, event processing failures/duration, delivery attempts/failures/dead letters, retry queue due count, retry claim failures, and worker health.
Planned:
- OpenTelemetry Collector.
- Prometheus dashboards and alerts.
- Loki logs.
- Grafana dashboards.
- Future ClickHouse analytics.
- Exact NATS consumer lag metric from JetStream consumer metadata.
First dashboards should cover API health, worker health, event processing, delivery failures, audit volume, tenant/RLS errors, NATS consumer lag, and Postgres health.