Initializing Enclave...

Fixing Velero 'Failed to Get Backup Storage Location' S3 IAM Permission Error

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 10–20 mins


TL;DR

  • What broke: Velero's BackupStorageLocation controller cannot call s3:GetBucketLocation, s3:ListBucket, or related actions because the IAM principal (role or user) attached to the Velero service account is missing required permissions or scoped to the wrong resource ARN.
  • How to fix it: Attach a least-privilege IAM policy granting the exact S3 actions Velero requires, scoped to your specific bucket ARN — not *.
  • Shortcut: Use our Client-Side Sandbox above to auto-refactor your broken IAM policy or BSL manifest without leaking your ARNs to a third-party server.

The Incident (What Does the Error Mean?)

You will see one or more of the following in velero pod logs or in kubectl get backupstoragelocations -n velero:

time="2024-01-15T03:12:44Z" level=error msg="failed to get backup storage location" 
error="rpc error: code = Unknown desc = RequestError: send request failed\ncaused by: AccessDenied: Access Denied\n\tstatus code: 403, request id: A1B2C3D4E5F6"

BackupStorageLocation "default" — Phase: Unavailable

Immediate consequence: Every scheduled and on-demand backup fails silently. velero backup get shows Failed or the BSL shows Unavailable. Your cluster is running without a valid recovery point. If a node failure or namespace wipe happens right now, you have no restore target.


The Attack Vector / Blast Radius

This is a misconfiguration that creates two simultaneous risks:

1. Operational: Zero backup coverage. Velero's BSL validation runs on a reconciliation loop. One 403 poisons the entire location as Unavailable. All backups — including those for stateful workloads like Postgres, Kafka, and Elasticsearch PVCs — are silently skipped. No alert fires unless you have explicit BSL phase monitoring.

2. Security: Overly permissive fixes introduce lateral movement risk. The knee-jerk fix is attaching s3:* on arn:aws:s3:::*. This is catastrophic. A compromised Velero pod (via a malicious container image, a CVE in the Velero binary, or a stolen IRSA token) now has full read/write/delete access to every S3 bucket in the account — including CloudTrail logs, billing exports, and other backup buckets. An attacker can:

  • Exfiltrate all backup archives (they contain etcd secrets, kubeconfig fragments, PVC data)
  • Delete all backup objects, destroying your DR capability
  • Overwrite backups with corrupted archives to poison future restores

Blast radius of s3:* on *: Total account-wide S3 compromise from a single Velero pod escape.


How to Fix It

Basic Fix — Attach the Correct Minimum IAM Policy

The following actions are required by Velero for S3 BSL operation. Scope them to your specific bucket.

- {
-   "Effect": "Allow",
-   "Action": "s3:*",
-   "Resource": "*"
- }

+ {
+   "Version": "2012-10-17",
+   "Statement": [
+     {
+       "Effect": "Allow",
+       "Action": [
+         "s3:GetBucketLocation",
+         "s3:ListBucket",
+         "s3:ListBucketMultipartUploads"
+       ],
+       "Resource": "arn:aws:s3:::YOUR-VELERO-BUCKET"
+     },
+     {
+       "Effect": "Allow",
+       "Action": [
+         "s3:AbortMultipartUpload",
+         "s3:DeleteObject",
+         "s3:GetObject",
+         "s3:ListMultipartUploadParts",
+         "s3:PutObject"
+       ],
+       "Resource": "arn:aws:s3:::YOUR-VELERO-BUCKET/*"
+     }
+   ]
+ }

After applying, force a BSL re-validation:

kubectl patch backupstoragelocation default \
  -n velero \
  --type merge \
  -p '{"spec":{"validationFrequency":"1m"}}'

# Watch until phase flips to Available
kubectl get backupstoragelocation -n velero -w

Enterprise Best Practice — IRSA with Condition Keys (EKS)

Do not use long-lived IAM user credentials (velero-credentials secret with aws_access_key_id). Use IAM Roles for Service Accounts (IRSA).

# IAM Trust Policy — WRONG: overly broad trust
- {
-   "Effect": "Allow",
-   "Principal": {"Federated": "arn:aws:iam::ACCOUNT:oidc-provider/oidc.eks.REGION.amazonaws.com/id/OIDCID"},
-   "Action": "sts:AssumeRoleWithWebIdentity"
- }

# IAM Trust Policy — CORRECT: locked to Velero service account
+ {
+   "Effect": "Allow",
+   "Principal": {
+     "Federated": "arn:aws:iam::ACCOUNT:oidc-provider/oidc.eks.REGION.amazonaws.com/id/OIDCID"
+   },
+   "Action": "sts:AssumeRoleWithWebIdentity",
+   "Condition": {
+     "StringEquals": {
+       "oidc.eks.REGION.amazonaws.com/id/OIDCID:sub": "system:serviceaccount:velero:velero",
+       "oidc.eks.REGION.amazonaws.com/id/OIDCID:aud": "sts.amazonaws.com"
+     }
+   }
+ }

Annotate the Velero service account:

kubectl annotate serviceaccount velero \
  -n velero \
  eks.amazonaws.com/role-arn=arn:aws:iam::ACCOUNT:role/velero-irsa-role

Add --no-secret to your Velero install so it does not mount static credentials:

velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket YOUR-VELERO-BUCKET \
  --backup-location-config region=us-east-1 \
  --no-secret \
  --sa-annotations eks.amazonaws.com/role-arn=arn:aws:iam::ACCOUNT:role/velero-irsa-role

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.


Prevention in CI/CD

1. Checkov — Scan IAM policies before terraform apply:

checkov -d ./terraform --check CKV_AWS_40,CKV_AWS_274
# CKV_AWS_40: IAM policy must not allow wildcard actions
# CKV_AWS_274: Disallow IAM policies with resource '*'

2. OPA/Conftest — Enforce no s3:* in Velero namespace policies:

package velero.iam

deny[msg] {
  action := input.Statement[_].Action
  action == "s3:*"
  msg := "Velero IAM policy must not use wildcard s3:* action"
}

deny[msg] {
  resource := input.Statement[_].Resource
  resource == "*"
  msg := "Velero IAM policy must scope Resource to specific bucket ARN"
}

3. Terraform aws_iam_policy validation block:

- resource "aws_iam_policy" "velero" {
-   policy = jsonencode({
-     Statement = [{ Effect = "Allow", Action = "s3:*", Resource = "*" }]
-   })
- }

+ resource "aws_iam_policy" "velero" {
+   policy = data.aws_iam_policy_document.velero_s3.json
+ }
+
+ data "aws_iam_policy_document" "velero_s3" {
+   statement {
+     actions   = ["s3:GetBucketLocation", "s3:ListBucket", "s3:ListBucketMultipartUploads"]
+     resources = ["arn:aws:s3:::${var.velero_bucket_name}"]
+   }
+   statement {
+     actions   = ["s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:AbortMultipartUpload", "s3:ListMultipartUploadParts"]
+     resources = ["arn:aws:s3:::${var.velero_bucket_name}/*"]
+   }
+ }

4. AlertManager rule — fire immediately when BSL goes Unavailable:

- alert: VeleroBSLUnavailable
  expr: velero_backup_storage_location_info{phase!="Available"} == 1
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Velero BSL {{ $labels.backup_storage_location }} is {{ $labels.phase }}"
    runbook: "https://your-wiki/velero-bsl-iam-fix"

This fires before the next scheduled backup window, giving you time to fix IAM before you lose a recovery point.

Related Diagnostics

"Part of the Security Utility Matrix."

View all 140 Security Tools →