Architectural Hardening Catalogue
Living catalogue of architectural/security findings discovered by read-only audit
across the validator-note journeys (vn-01 … vn-12), plus their remediation status.
We fix one issue per commit on fix/flow-architectural-hardening, mark the row
Fixed with its commit, and update the relevant validator note in parallel.
Bug classes: (1) async/sync misuse, (2) contract/enum drift, (3) tenant-isolation, (4) fail-open authz, (5) idempotency/concurrency, (6) resource safety, (7) secret/crypto handling, (8) input validation, (RBAC) persona enforcement, (JFP) journey failure-point coverage.
Status legend
- Open — confirmed, not yet fixed.
- Fixed (
<sha>) — remediated with code + tests + docs in the named commit. - By design — intentional; documented, no code change.
Findings
| # | Pri | Journey | Location | Class | Issue | Status |
|---|---|---|---|---|---|---|
| 1 | P0 | vn-11 | services/common/webhook_adapter.py _target_url; integration_config_service.py:96 | 8 SSRF | Any http(s):// target accepted; no deny-list for RFC-1918 / link-local / loopback / 169.254.169.254. Tenant admin → cloud-metadata & internal-service access | Fixed — services/common/url_safety.py guard at destination create/update + send-time resolution check; opt-out ONEPROTECT_WEBHOOK_ALLOW_PRIVATE_TARGETS |
| 2 | P1 | vn-06 | scim_service.py:40-77 | 7 secret | Configured bearer_token validated then discarded; secret_ref resolved deterministically by provider. Re-classified By design for the POC secret model (no store API); revisit when a real vault provider lands | By design |
| 3 | P1 | vn-06/10 | scim_service.py _recompute_user_roles | RBAC/5 | DELETE FROM tenant_user_roles WHERE tenant_id,user_id (no source filter) then reinserts only SCIM roles → silently wipes manually-granted roles | Fixed — added source column (migration 020 + SQLite mirror, default manual); SCIM recompute/deprovision scope DELETE to source='scim' and insert with source='scim'; regression test_manual_role_survives_scim_recompute |
| 4 | P2 | vn-09 | intune_service.py _managed_devices | 6 | Graph pagination while url: … nextLink has per-call timeout but no page/item cap → cyclic/poisoned nextLink = unbounded loop+memory | Fixed — bounded by _MAX_SYNC_PAGES (200) and _MAX_SYNC_DEVICES (20000); test test_managed_devices_pagination_is_bounded |
| 5 | P2 | vn-02 | ssh_broker_service.py:256 | 5 | sequence = command_count+1 read-then-increment races UNIQUE(session,seq) → unhandled IntegrityError 500 | Open |
| 6 | P2 | vn-11 | integration_config_service.py:290; integration_delivery_service.py:609 | 6 | max_attempts/timeout/backoff have no upper bound; exp backoff 60×2^19 ≈ a year | Open |
| 7 | P2 | vn-01 | agent_enrollment_service.py:176 | 5 | Token uses+1 check-then-act without WHERE uses=? → token-use count race | Open |
| 8 | P2 | vn-05 | compliance_export_service.py:74-153 | 6 | In-memory PDF/CSV build, no row/size cap or timeout; artifacts stored inline in a DB row | Open |
| 9 | P3 | vn-01 | enrollment-exchange route | 6/4 | Token-as-credential endpoint has no rate-limit/throttle → token brute-force | Open |
| 10 | P3 | vn-06 | scim_service.py; api_models.py:360,379 | 6 | members/mappings arrays no max_length; recompute loop unbatched | Open |
| 11 | P3 | vn-09 | intune_service.py:313 | 7 | _safe_failure_summary redacts via substring replace → bypassable | Open |
| 12 | P3 | vn-04 | compliance_policy_service.py:103 | 5 | Fork check-then-insert TOCTOU; no UNIQUE(tenant,baseline) | Open |
| 13 | P3 | vn-05 | auditor_session_service.py:69 | 5 | No idempotency guard → duplicate active auditor sessions | Open |
| 14 | P3 | vn-01/03 | database.py / migrations | 2 | No DB CHECK on safety_profile, enrollment-token status (code constants guard, no DB parity) | Open |
| 15 | P3 | vn-10/05 | platform_admin_service.py:68; evidence_service.py:11 | 6/8 | list_tenants unpaginated; evidence summary no length cap | Open |
| 16 | P3 | vn-02 | http_routes.py:1033 | RBAC | 409 if str(exc) in {…} else 403 — operational preconditions raised as PermissionError; status chosen by matching the message string (fragile) | Open |
| 17 | P1 | vn-06/09 | scim_service.py create+update; intune_service.py create+update | 2 | int(<bool>) bound into a Postgres boolean column (fail_closed_on_conflict, sync_enabled) → works on SQLite, raises is of type boolean but expression is of type smallint on Postgres. Blocked SCIM + Intune connection create in the deployed env | Fixed — bind native bool; guard test test_boolean_param_types (asserts value is True, which fails for int(1)) |
| 18 | P1 | portal/auth | frontend/src/lib/console-auth.ts requireConsoleSession; no middleware.ts | 4 | Console pages only read the access-token cookie (lifetime = Keycloak access lifespan ~5 min) and redirect to /login on expiry without using the still-valid refresh token → forced re-login every ~5 min | Fixed — added frontend/src/middleware.ts that refreshes via the refresh-token cookie (persisting rotation) before render; tests in middleware.test.ts |
By-design / out-of-scope (no code change)
- vn-05
POST /compliance/exportsallowsROLE_AUDITORto create exports (operator excluded) — intentional per VN-05 (auditors generate evidence). Documented carve-out from auditor-read-only. - vn-11 webhook request signing off by default in
local/dev/test— prod enforces; demo gap only. - vn-12 disabled-key runtime fail-closed enforcement explicitly deferred per VN-12 scope.
Confirmed clean
async/sync misuse (only the already-fixed SSH gateway); tenant isolation (every audited query filters tenant_id; enrollment exchange derives tenant from the token); per-tenant encryption keys (refs only, no key bytes in events/audit); ClickHouse log search (parameterized, tenant-scoped); persona/RBAC on all ~45 mutating routes (auditor strictly read-only except the export carve-out; operator cannot reach admin config; no swallowed 403s); journey failure-points vn-01…12 enforced fail-closed with correct status.