Skip to main content

Observability

Current state:

  • Structured JSON logs from API and worker.
  • Correlation IDs connect events, audit, evidence, and delivery.
  • Worker logs include lifecycle, event received/processed/failed, retry claimed/completed, dead-letter, and tenant context failure events.
  • Worker exposes /metrics, /health, and /ready on ONEPROTECT_METRICS_PORT when ONEPROTECT_METRICS_ENABLED=true.
  • Implemented metrics include EventBus publish/consume counters, event processing failures/duration, delivery attempts/failures/dead letters, retry queue due count, retry claim failures, and worker health.

Planned:

  • OpenTelemetry Collector.
  • Prometheus dashboards and alerts.
  • Loki logs.
  • Grafana dashboards.
  • Future ClickHouse analytics.
  • Exact NATS consumer lag metric from JetStream consumer metadata.

First dashboards should cover API health, worker health, event processing, delivery failures, audit volume, tenant/RLS errors, NATS consumer lag, and Postgres health.