Skip to main content

AWS Platform Foundation

Status

Partial: AWS production platform foundation, bootstrap/state strategy, CI/CD readiness, docs-site hosting scaffold, and AWS dev runtime deploy readiness exist. Application workloads are still deployed only through an explicit, protected pipeline run.

  • SRS references: scalable SaaS platform deployment.
  • Client response references: AWS is Phase 1 primary; Azure failover is Phase 2 evaluation; operational simplicity remains a constraint.
  • ADR references: AWS deployment ADR, AWS bootstrap/state ADR, service mesh note.
  • Task board references: OP-D039, OP-D040, OP-D041, OP-D042, OP-043, OP-044.

Problem Statement

OneProtect needs a safe, isolated AWS path that does not disturb existing workloads, does not use long-lived CI keys, and gates real infrastructure/app deployments behind reviewed IaC and CI/CD.

Architectural Intent

AWS dev is the first target. Terraform/OpenTofu owns infrastructure, Helm owns application deployment, and GitLab/GitHub CI wrappers call shared scripts. The platform starts with bootstrap/state and CI trust before VPC/EKS/RDS/app workloads.

What Was Implemented

  • AWS Phase 1 deployment ADR.
  • Service mesh decision note: no service mesh for first bootstrap.
  • infra/aws/ modules for network, EKS, RDS/Postgres, KMS, S3 evidence, ECR, IAM, observability, and docs-site hosting.
  • infra/aws/bootstrap/ for state bucket, lock table, optional KMS, and CI OIDC trust roles.
  • AWS dev Helm values.
  • GitLab-first CI/CD scripts and GitHub Actions parity skeletons.
  • AWS pre-apply checklist, first-apply runbook, CI variable matrix, drift docs, and cost/safety guardrails.
  • Docs-site S3 + CloudFront hosting scaffold with private S3 and Origin Access Control.
  • A narrow AWS dev ECR-only stack for api-service and frontend image repositories so CI image publishing can be unblocked before VPC/EKS/RDS.
  • On-demand AWS deploy gates so normal feature/develop pipelines do not publish images or deploy unless DEPLOY_AWS_DEV=true.
  • Mutable ECR BuildKit cache repositories for faster on-demand image publish.
  • Helm runtime readiness for AWS dev:
    • in-cluster NATS JetStream dev StatefulSet,
    • Postgres migration pre-install/pre-upgrade Job,
    • protected Helm values materialization,
    • runtime Kubernetes Secret creation helper.

Components Involved

  • infra/aws/
  • deploy/helm/oneprotect/values-aws-dev.yaml
  • .gitlab-ci.yml
  • .github/workflows/
  • scripts/ci/
  • scripts/aws/preflight-check.sh
  • scripts/aws/create-dev-k8s-secrets.sh
  • scripts/ci/materialize-helm-values.sh

APIs / Events / Schemas

No product APIs or event contracts were added by the AWS foundation work.

Deployment Notes

Current intended real-world order:

  1. Fill local untracked bootstrap tfvars.
  2. Run AWS preflight.
  3. Plan bootstrap only.
  4. Manually review the plan.
  5. Apply bootstrap only: state bucket, lock table, optional KMS, CI OIDC roles.
  6. Plan/apply the ECR-only dev stack if publish_ecr is blocked.
  7. Publish immutable images to ECR with registry-backed BuildKit cache.
  8. Prepare broader dev infra in a later reviewed plan.
  9. Create runtime Kubernetes Secrets from approved secret sources.
  10. Deploy app workloads only after protected pipeline gates are ready.

The ECR-only stack must create only:

  • oneprotect/dev/api-service
  • oneprotect/dev/frontend

AWS dev infra plan readiness now defines the non-app apply order: ECR-only if image publishing is blocked, then VPC/foundational IAM, private RDS/Postgres, EKS, kubectl access, runtime Kubernetes Secrets, Helm render, then gated workload deploy.

AWS dev now prefers Graviton/arm64 EKS nodes for cost, keeps an x86 fallback path, and requires multi-arch OneProtect images before arm64 workload scheduling.

AWS dev runtime readiness now adds an in-cluster NATS JetStream dev instance and a Helm migration Job. These are dev bootstrap choices, not final production decisions for the event backbone or database operations model.

Security / Tenant Isolation

  • AWS resources are isolated under OneProtect naming and tags.
  • No real account IDs, domains, secrets, kubeconfigs, or tfvars are committed.
  • CI uses OIDC role assumption, not static AWS keys.
  • Manual console drift is prohibited except documented break-glass.
  • Docs-site bucket is private and served through CloudFront OAC when enabled.

Validation Steps

UI Validation

No application UI validation exists until AWS dev app workloads are deployed. For docs-site hosting, validate the CloudFront URL only after publishing is approved and enabled.

API Validation

No product API validation exists until app workloads are deployed to AWS dev. For bootstrap, validate Terraform/OpenTofu outputs and AWS resource names in the approved account.

Smoke Validation

make aws-preflight-check
make aws-bootstrap-plan
make aws-dev-ecr-plan-dryrun
make aws-dev-ecr-plan
make aws-dev-plan-dryrun
make aws-dev-helm-render
make aws-dev-k8s-secrets-dryrun
make docker-buildx-check
make aws-terraform-validate
make aws-iac-check
make aws-helm-template

make aws-preflight-check, make aws-bootstrap-plan, and make aws-dev-plan are expected to fail closed without real local AWS inputs and untracked tfvars.

Known Limitations

  • App workloads are not deployed to AWS dev by this note.
  • The ECR-only stack does not prove runtime deployment; it only prepares image repositories.
  • Runtime Kubernetes Secrets still require approved values from local/CI secret sources.
  • AWS Load Balancer Controller is installed separately through a protected GitLab Agent job before ALB Ingress produces a stable AWS URL.
  • Public app DNS is manual in Namecheap for watchtower-app.mergematter.io.
  • Public docs DNS is manual in Namecheap for docs.watchtower-app.mergematter.io.
  • arm64 image support must stay proven with buildx before workloads are pinned to Graviton nodes.
  • EKS vs ECS, NATS vs MSK/Kinesis, Aurora vs RDS, production IdP, OpenSearch, time-series store, and runtime KMS/S3 enforcement remain open or queued decisions. The logical tenant key model is accepted in ADR-0018.
  • Docs-site publishing is scaffolded but disabled until audience/access are approved.

Follow-Up Work

  • Execute approved bootstrap apply only.
  • Execute approved ECR-only apply if image publishing is blocked.
  • Prepare broader AWS dev infra plan readiness.
  • Configure protected GitLab OIDC variables.
  • Configure protected AWS dev Helm values file.
  • Create runtime Kubernetes Secrets.
  • Deploy app workloads only through gated Helm pipeline.

Acceptance Criteria Mapping

Acceptance criterionEvidence
AWS primary path is scaffoldedAWS ADR and infra/aws/
CI uses OIDC, not static keysCI/CD docs and workflow skeletons
Bootstrap is gatedFirst-apply runbook and preflight script
ECR-only apply is narrowinfra/aws/envs/dev-ecr/ and deployment runbooks
Runtime deploy path is gatedDEPLOY_AWS_DEV, AWS_DEV_HELM_VALUES_FILE, Helm migration/NATS templates
No app workloads deployed by docs branchDeployment docs and task board status