Skip to main content

Architectural Hardening Catalogue

Living catalogue of architectural/security findings discovered by read-only audit across the validator-note journeys (vn-01 … vn-12), plus their remediation status. We fix one issue per commit on fix/flow-architectural-hardening, mark the row Fixed with its commit, and update the relevant validator note in parallel.

Bug classes: (1) async/sync misuse, (2) contract/enum drift, (3) tenant-isolation, (4) fail-open authz, (5) idempotency/concurrency, (6) resource safety, (7) secret/crypto handling, (8) input validation, (RBAC) persona enforcement, (JFP) journey failure-point coverage.

Status legend

  • Open — confirmed, not yet fixed.
  • Fixed (<sha>) — remediated with code + tests + docs in the named commit.
  • By design — intentional; documented, no code change.

Findings

#PriJourneyLocationClassIssueStatus
1P0vn-11services/common/webhook_adapter.py _target_url; integration_config_service.py:968 SSRFAny http(s):// target accepted; no deny-list for RFC-1918 / link-local / loopback / 169.254.169.254. Tenant admin → cloud-metadata & internal-service accessFixedservices/common/url_safety.py guard at destination create/update + send-time resolution check; opt-out ONEPROTECT_WEBHOOK_ALLOW_PRIVATE_TARGETS
2P1vn-06scim_service.py:40-777 secretConfigured bearer_token validated then discarded; secret_ref resolved deterministically by provider. Re-classified By design for the POC secret model (no store API); revisit when a real vault provider landsBy design
3P1vn-06/10scim_service.py _recompute_user_rolesRBAC/5DELETE FROM tenant_user_roles WHERE tenant_id,user_id (no source filter) then reinserts only SCIM roles → silently wipes manually-granted rolesFixed — added source column (migration 020 + SQLite mirror, default manual); SCIM recompute/deprovision scope DELETE to source='scim' and insert with source='scim'; regression test_manual_role_survives_scim_recompute
4P2vn-09intune_service.py _managed_devices6Graph pagination while url: … nextLink has per-call timeout but no page/item cap → cyclic/poisoned nextLink = unbounded loop+memoryFixed — bounded by _MAX_SYNC_PAGES (200) and _MAX_SYNC_DEVICES (20000); test test_managed_devices_pagination_is_bounded
5P2vn-02ssh_broker_service.py:2565sequence = command_count+1 read-then-increment races UNIQUE(session,seq) → unhandled IntegrityError 500Open
6P2vn-11integration_config_service.py:290; integration_delivery_service.py:6096max_attempts/timeout/backoff have no upper bound; exp backoff 60×2^19 ≈ a yearOpen
7P2vn-01agent_enrollment_service.py:1765Token uses+1 check-then-act without WHERE uses=? → token-use count raceOpen
8P2vn-05compliance_export_service.py:74-1536In-memory PDF/CSV build, no row/size cap or timeout; artifacts stored inline in a DB rowOpen
9P3vn-01enrollment-exchange route6/4Token-as-credential endpoint has no rate-limit/throttle → token brute-forceOpen
10P3vn-06scim_service.py; api_models.py:360,3796members/mappings arrays no max_length; recompute loop unbatchedOpen
11P3vn-09intune_service.py:3137_safe_failure_summary redacts via substring replace → bypassableOpen
12P3vn-04compliance_policy_service.py:1035Fork check-then-insert TOCTOU; no UNIQUE(tenant,baseline)Open
13P3vn-05auditor_session_service.py:695No idempotency guard → duplicate active auditor sessionsOpen
14P3vn-01/03database.py / migrations2No DB CHECK on safety_profile, enrollment-token status (code constants guard, no DB parity)Open
15P3vn-10/05platform_admin_service.py:68; evidence_service.py:116/8list_tenants unpaginated; evidence summary no length capOpen
16P3vn-02http_routes.py:1033RBAC409 if str(exc) in {…} else 403 — operational preconditions raised as PermissionError; status chosen by matching the message string (fragile)Open

| 17 | P1 | vn-06/09 | scim_service.py create+update; intune_service.py create+update | 2 | int(<bool>) bound into a Postgres boolean column (fail_closed_on_conflict, sync_enabled) → works on SQLite, raises is of type boolean but expression is of type smallint on Postgres. Blocked SCIM + Intune connection create in the deployed env | Fixed — bind native bool; guard test test_boolean_param_types (asserts value is True, which fails for int(1)) |

| 18 | P1 | portal/auth | frontend/src/lib/console-auth.ts requireConsoleSession; no middleware.ts | 4 | Console pages only read the access-token cookie (lifetime = Keycloak access lifespan ~5 min) and redirect to /login on expiry without using the still-valid refresh token → forced re-login every ~5 min | Fixed — added frontend/src/middleware.ts that refreshes via the refresh-token cookie (persisting rotation) before render; tests in middleware.test.ts |

By-design / out-of-scope (no code change)

  • vn-05 POST /compliance/exports allows ROLE_AUDITOR to create exports (operator excluded) — intentional per VN-05 (auditors generate evidence). Documented carve-out from auditor-read-only.
  • vn-11 webhook request signing off by default in local/dev/test — prod enforces; demo gap only.
  • vn-12 disabled-key runtime fail-closed enforcement explicitly deferred per VN-12 scope.

Confirmed clean

async/sync misuse (only the already-fixed SSH gateway); tenant isolation (every audited query filters tenant_id; enrollment exchange derives tenant from the token); per-tenant encryption keys (refs only, no key bytes in events/audit); ClickHouse log search (parameterized, tenant-scoped); persona/RBAC on all ~45 mutating routes (auditor strictly read-only except the export carve-out; operator cannot reach admin config; no swallowed 403s); journey failure-points vn-01…12 enforced fail-closed with correct status.