Fixing ImagePullBackOff TLS Handshake Timeout on Cross-Account AWS ECR Pulls in Kubernetes
Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins depending on whether VPC endpoints exist
TL;DR
- What broke: The Kubernetes node cannot complete a TCP+TLS handshake to
xxxx.dkr.ecr.us-east-1.amazonaws.comon port 443. The packet is being dropped before AWS even sees the request — this is a network-layer block, not an auth failure. - How to fix it: Create or fix the
com.amazonaws.us-east-1.ecr.dkrandcom.amazonaws.us-east-1.ecr.apiVPC Interface Endpoints, attach a security group that allows port 443 inbound from the node CIDR, and ensure the cross-account ECR repository policy grantsecr:GetAuthorizationTokenand pull actions to the worker node's IAM role ARN. - Use our Client-Side Sandbox above to paste your node IAM role policy, ECR repo policy, and security group config — it will auto-generate the corrected policies and endpoint Terraform without sending your ARNs to any external server.
The Incident — What Does This Error Mean?
Raw error from kubectl describe pod:
Failed to pull image "xxxx.dkr.ecr.us-east-1.amazonaws.com/my-app:latest":
rpc error: code = Unknown desc = Error response from daemon:
Get https://xxxx.dkr.ecr.us-east-1.amazonaws.com/v2/: net/http: TLS handshake timeout
This error fires from the container runtime (containerd or dockerd) on the worker node. The runtime issued a TCP SYN to port 443 of the ECR registry endpoint and never received a SYN-ACK back within the TLS handshake timeout window (~30s). The TLS handshake never started. This is categorically different from an auth error (401 Unauthorized) or an IAM permission error (403 Forbidden). Those errors mean the packet reached AWS. This one means it did not.
Immediate consequence: Every pod on every node in that subnet that references this ECR registry goes into ImagePullBackOff. If this is a rollout, your deployment is dead. If nodes are cycling (spot interruption, ASG scale-out), new nodes cannot pull any image and the cluster degrades progressively.
The Attack Vector / Blast Radius
This is not a security exploit in the traditional sense — but the blast radius is a full cluster outage for any workload using this registry. The cascading failure path:
- New node joins the cluster (spot replacement, scale-out event).
- Node cannot pull the pause/infra container or any app image.
- All pods scheduled to that node stay in
Pending→ImagePullBackOff. - Kubernetes reschedules to other nodes; those nodes hit the same network block.
- Horizontal Pod Autoscaler fires more replicas due to load; none start. You are now in a death spiral.
The secondary security risk: teams under pressure during an outage will often open port 443 outbound to 0.0.0.0/0 as a quick fix, or worse, make the ECR repository public. Both are serious security regressions. The correct fix is surgical.
Root causes ranked by frequency in cross-account setups:
| # | Root Cause | Signal |
|---|---|---|
| 1 | Missing or misconfigured ecr.dkr VPC Interface Endpoint |
Nodes in private subnet with no NAT or broken endpoint |
| 2 | Endpoint security group blocks port 443 from node CIDR | Endpoint exists but SG is too restrictive |
| 3 | Endpoint is in wrong subnets / AZs | Intermittent failures, AZ-specific |
| 4 | Missing ecr.api endpoint (required since ~2022) |
ecr.dkr exists but pulls still fail |
| 5 | Route table not associated with endpoint | Traffic bypasses endpoint, hits NAT or nothing |
How to Fix It
Step 1 — Verify the Network Block (Do This First)
SSH or kubectl exec onto the affected node and run:
# Replace with your account ID and region
curl -v --max-time 10 https://xxxx.dkr.ecr.us-east-1.amazonaws.com/v2/
If this hangs and times out: network block confirmed. Proceed below.
Basic Fix — Create the Required VPC Interface Endpoints
You need three endpoints for ECR to function in a private subnet:
| Endpoint Service | Purpose |
|---|---|
com.amazonaws.us-east-1.ecr.dkr |
Image layer pulls |
com.amazonaws.us-east-1.ecr.api |
GetAuthorizationToken, BatchGetImage |
com.amazonaws.us-east-1.s3 (Gateway) |
Image layer data (S3-backed) |
# Create ECR DKR endpoint
aws ec2 create-vpc-endpoint \
--vpc-id vpc-0abc123 \
--vpc-endpoint-type Interface \
--service-name com.amazonaws.us-east-1.ecr.dkr \
--subnet-ids subnet-0aaa subnet-0bbb \
--security-group-ids sg-0endpoint \
--private-dns-enabled
# Create ECR API endpoint
aws ec2 create-vpc-endpoint \
--vpc-id vpc-0abc123 \
--vpc-endpoint-type Interface \
--service-name com.amazonaws.us-east-1.ecr.api \
--subnet-ids subnet-0aaa subnet-0bbb \
--security-group-ids sg-0endpoint \
--private-dns-enabled
# Create S3 Gateway endpoint (no SG needed)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-0abc123 \
--vpc-endpoint-type Gateway \
--service-name com.amazonaws.us-east-1.s3 \
--route-table-ids rtb-0abc123
Security group on the endpoint must allow inbound 443 from the node security group or CIDR:
aws ec2 authorize-security-group-ingress \
--group-id sg-0endpoint \
--protocol tcp \
--port 443 \
--source-group sg-0nodes # node instance security group
Enterprise Best Practice — Cross-Account ECR Repository Policy + Least-Privilege Node IAM
The cross-account pull requires two grants: the node's IAM role must have permission to call ECR, AND the ECR repository in the source account must explicitly allow the node role from the target account.
ECR Repository Policy (source account 111111111111) — diff:
{
"Version": "2012-10-17",
"Statement": [
- {
- "Sid": "AllowAll",
- "Effect": "Allow",
- "Principal": "*",
- "Action": "ecr:*"
- }
+ {
+ "Sid": "CrossAccountPull",
+ "Effect": "Allow",
+ "Principal": {
+ "AWS": "arn:aws:iam::222222222222:role/eks-node-role"
+ },
+ "Action": [
+ "ecr:GetDownloadUrlForLayer",
+ "ecr:BatchGetImage",
+ "ecr:BatchCheckLayerAvailability"
+ ]
+ }
]
}
Node IAM Role Policy (target account 222222222222) — diff:
{
"Version": "2012-10-17",
"Statement": [
- {
- "Effect": "Allow",
- "Action": "ecr:*",
- "Resource": "*"
- }
+ {
+ "Effect": "Allow",
+ "Action": "ecr:GetAuthorizationToken",
+ "Resource": "*"
+ },
+ {
+ "Effect": "Allow",
+ "Action": [
+ "ecr:GetDownloadUrlForLayer",
+ "ecr:BatchGetImage",
+ "ecr:BatchCheckLayerAvailability"
+ ],
+ "Resource": "arn:aws:ecr:us-east-1:111111111111:repository/my-app"
+ }
]
}
Critical:
ecr:GetAuthorizationTokenis a global action — it always requiresResource: "*". Scoping it to a repository ARN will silently fail.
💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.
Terraform — Enterprise Endpoint Module (Best Practice)
- # No VPC endpoints defined — nodes rely on NAT Gateway or public routing
- # This breaks in private subnets and incurs NAT data transfer costs
+ resource "aws_vpc_endpoint" "ecr_dkr" {
+ vpc_id = var.vpc_id
+ service_name = "com.amazonaws.${var.region}.ecr.dkr"
+ vpc_endpoint_type = "Interface"
+ subnet_ids = var.private_subnet_ids
+ security_group_ids = [aws_security_group.endpoints.id]
+ private_dns_enabled = true
+ tags = { Name = "ecr-dkr-endpoint" }
+ }
+
+ resource "aws_vpc_endpoint" "ecr_api" {
+ vpc_id = var.vpc_id
+ service_name = "com.amazonaws.${var.region}.ecr.api"
+ vpc_endpoint_type = "Interface"
+ subnet_ids = var.private_subnet_ids
+ security_group_ids = [aws_security_group.endpoints.id]
+ private_dns_enabled = true
+ tags = { Name = "ecr-api-endpoint" }
+ }
+
+ resource "aws_vpc_endpoint" "s3" {
+ vpc_id = var.vpc_id
+ service_name = "com.amazonaws.${var.region}.s3"
+ vpc_endpoint_type = "Gateway"
+ route_table_ids = var.private_route_table_ids
+ tags = { Name = "s3-gateway-endpoint" }
+ }
+
+ resource "aws_security_group" "endpoints" {
+ name = "vpc-endpoints-sg"
+ vpc_id = var.vpc_id
+
+ ingress {
+ from_port = 443
+ to_port = 443
+ protocol = "tcp"
+ security_groups = [var.node_security_group_id]
+ }
+ }
Prevention in CI/CD
1. Checkov — Catch missing VPC endpoints pre-merge:
# .checkov.yml
checks:
- CKV_AWS_123 # Ensure ECR repositories are not publicly accessible
- CKV_AWS_51 # ECR image scanning on push
Add a custom Checkov policy to assert VPC endpoints exist for ECR in any VPC that has private subnets.
2. OPA/Conftest — Enforce cross-account ECR policy structure:
# policy/ecr_repo_policy.rego
package ecr
deny[msg] {
stmt := input.Statement[_]
stmt.Principal == "*"
msg := "ECR repository policy must not use wildcard Principal"
}
deny[msg] {
stmt := input.Statement[_]
stmt.Action == "ecr:*"
msg := "ECR repository policy must not grant ecr:* — use explicit pull actions only"
}
3. AWS Config Rule — Continuous compliance:
aws configservice put-config-rule --config-rule '{
"ConfigRuleName": "ecr-no-wildcard-principal",
"Source": {
"Owner": "CUSTOM_LAMBDA",
"SourceIdentifier": "arn:aws:lambda:...:function:ecr-policy-auditor"
}
}'
4. Terraform Sentinel (HashiCorp Vault/TFC) — Block plans without endpoints:
# sentinel/require-ecr-endpoints.sentinel
import "tfplan/v2" as tfplan
endpoints = filter tfplan.resource_changes as _, rc {
rc.type is "aws_vpc_endpoint" and
rc.change.after.service_name contains "ecr"
}
main = rule { length(endpoints) >= 2 }
5. Add to your runbook: Any new AWS account or VPC provisioning checklist must include VPC endpoint creation for ECR before any EKS node group is deployed. The TLS timeout is always the symptom; missing endpoints are the disease.