Why does the ECR TLS handshake timeout instead of returning a 403 or auth error?

A TLS handshake timeout means the TCP connection to the ECR endpoint never completed — the SYN packet was dropped by a security group, NACL, or missing route. AWS never received the request, so there is no HTTP response to return. A 403 or 401 would only appear if the packet reached AWS infrastructure and was processed by the IAM/ECR auth layer. If you see a timeout, debug the network path first, not IAM.

Do I need both the ecr.dkr and ecr.api VPC endpoints for cross-account pulls?

Yes, both are required. The ecr.dkr endpoint handles the Docker v2 registry protocol used for layer pulls. The ecr.api endpoint handles AWS API calls including GetAuthorizationToken and BatchGetImage. Since approximately mid-2022, AWS split these into two distinct endpoints. Having only ecr.dkr will result in auth token failures or partial pull failures even if the TLS handshake succeeds.

Can I use a NAT Gateway instead of VPC Interface Endpoints for ECR in private subnets?

Technically yes, but it is strongly discouraged for production EKS clusters. NAT Gateway for ECR traffic incurs per-GB data transfer costs that scale significantly with image pull frequency across many nodes. More critically, NAT Gateway routes traffic over the public internet to ECR public endpoints, which is a larger attack surface than private VPC endpoints. VPC Interface Endpoints keep all ECR traffic on the AWS private network and are the documented best practice for private EKS clusters.

Fixing ImagePullBackOff TLS Handshake Timeout on Cross-Account AWS ECR Pulls in Kubernetes

Threat/Impact Level: HIGH | Downtime Risk: HIGH | Time to Fix: 15–45 mins depending on whether VPC endpoints exist

TL;DR

What broke: The Kubernetes node cannot complete a TCP+TLS handshake to xxxx.dkr.ecr.us-east-1.amazonaws.com on port 443. The packet is being dropped before AWS even sees the request — this is a network-layer block, not an auth failure.
How to fix it: Create or fix the com.amazonaws.us-east-1.ecr.dkr and com.amazonaws.us-east-1.ecr.api VPC Interface Endpoints, attach a security group that allows port 443 inbound from the node CIDR, and ensure the cross-account ECR repository policy grants ecr:GetAuthorizationToken and pull actions to the worker node's IAM role ARN.
Use our Client-Side Sandbox above to paste your node IAM role policy, ECR repo policy, and security group config — it will auto-generate the corrected policies and endpoint Terraform without sending your ARNs to any external server.

The Incident — What Does This Error Mean?

Raw error from kubectl describe pod:

Failed to pull image "xxxx.dkr.ecr.us-east-1.amazonaws.com/my-app:latest":
rpc error: code = Unknown desc = Error response from daemon:
Get https://xxxx.dkr.ecr.us-east-1.amazonaws.com/v2/: net/http: TLS handshake timeout

This error fires from the container runtime (containerd or dockerd) on the worker node. The runtime issued a TCP SYN to port 443 of the ECR registry endpoint and never received a SYN-ACK back within the TLS handshake timeout window (~30s). The TLS handshake never started. This is categorically different from an auth error (401 Unauthorized) or an IAM permission error (403 Forbidden). Those errors mean the packet reached AWS. This one means it did not.

Immediate consequence: Every pod on every node in that subnet that references this ECR registry goes into ImagePullBackOff. If this is a rollout, your deployment is dead. If nodes are cycling (spot interruption, ASG scale-out), new nodes cannot pull any image and the cluster degrades progressively.

The Attack Vector / Blast Radius

This is not a security exploit in the traditional sense — but the blast radius is a full cluster outage for any workload using this registry. The cascading failure path:

New node joins the cluster (spot replacement, scale-out event).
Node cannot pull the pause/infra container or any app image.
All pods scheduled to that node stay in Pending → ImagePullBackOff.
Kubernetes reschedules to other nodes; those nodes hit the same network block.
Horizontal Pod Autoscaler fires more replicas due to load; none start. You are now in a death spiral.

The secondary security risk: teams under pressure during an outage will often open port 443 outbound to 0.0.0.0/0 as a quick fix, or worse, make the ECR repository public. Both are serious security regressions. The correct fix is surgical.

Root causes ranked by frequency in cross-account setups:

#	Root Cause	Signal
1	Missing or misconfigured `ecr.dkr` VPC Interface Endpoint	Nodes in private subnet with no NAT or broken endpoint
2	Endpoint security group blocks port 443 from node CIDR	Endpoint exists but SG is too restrictive
3	Endpoint is in wrong subnets / AZs	Intermittent failures, AZ-specific
4	Missing `ecr.api` endpoint (required since ~2022)	`ecr.dkr` exists but pulls still fail
5	Route table not associated with endpoint	Traffic bypasses endpoint, hits NAT or nothing

How to Fix It

Step 1 — Verify the Network Block (Do This First)

SSH or kubectl exec onto the affected node and run:

# Replace with your account ID and region
curl -v --max-time 10 https://xxxx.dkr.ecr.us-east-1.amazonaws.com/v2/

If this hangs and times out: network block confirmed. Proceed below.

Basic Fix — Create the Required VPC Interface Endpoints

You need three endpoints for ECR to function in a private subnet:

Endpoint Service	Purpose
`com.amazonaws.us-east-1.ecr.dkr`	Image layer pulls
`com.amazonaws.us-east-1.ecr.api`	`GetAuthorizationToken`, `BatchGetImage`
`com.amazonaws.us-east-1.s3` (Gateway)	Image layer data (S3-backed)

# Create ECR DKR endpoint
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123 \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.dkr \
  --subnet-ids subnet-0aaa subnet-0bbb \
  --security-group-ids sg-0endpoint \
  --private-dns-enabled

# Create ECR API endpoint
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123 \
  --vpc-endpoint-type Interface \
  --service-name com.amazonaws.us-east-1.ecr.api \
  --subnet-ids subnet-0aaa subnet-0bbb \
  --security-group-ids sg-0endpoint \
  --private-dns-enabled

# Create S3 Gateway endpoint (no SG needed)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123 \
  --vpc-endpoint-type Gateway \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-0abc123

Security group on the endpoint must allow inbound 443 from the node security group or CIDR:

aws ec2 authorize-security-group-ingress \
  --group-id sg-0endpoint \
  --protocol tcp \
  --port 443 \
  --source-group sg-0nodes  # node instance security group

Enterprise Best Practice — Cross-Account ECR Repository Policy + Least-Privilege Node IAM

The cross-account pull requires two grants: the node's IAM role must have permission to call ECR, AND the ECR repository in the source account must explicitly allow the node role from the target account.

ECR Repository Policy (source account 111111111111) — diff:

{
  "Version": "2012-10-17",
  "Statement": [
-   {
-     "Sid": "AllowAll",
-     "Effect": "Allow",
-     "Principal": "*",
-     "Action": "ecr:*"
-   }
+   {
+     "Sid": "CrossAccountPull",
+     "Effect": "Allow",
+     "Principal": {
+       "AWS": "arn:aws:iam::222222222222:role/eks-node-role"
+     },
+     "Action": [
+       "ecr:GetDownloadUrlForLayer",
+       "ecr:BatchGetImage",
+       "ecr:BatchCheckLayerAvailability"
+     ]
+   }
  ]
}

Node IAM Role Policy (target account 222222222222) — diff:

{
  "Version": "2012-10-17",
  "Statement": [
-   {
-     "Effect": "Allow",
-     "Action": "ecr:*",
-     "Resource": "*"
-   }
+   {
+     "Effect": "Allow",
+     "Action": "ecr:GetAuthorizationToken",
+     "Resource": "*"
+   },
+   {
+     "Effect": "Allow",
+     "Action": [
+       "ecr:GetDownloadUrlForLayer",
+       "ecr:BatchGetImage",
+       "ecr:BatchCheckLayerAvailability"
+     ],
+     "Resource": "arn:aws:ecr:us-east-1:111111111111:repository/my-app"
+   }
  ]
}

Critical: ecr:GetAuthorizationToken is a global action — it always requires Resource: "*". Scoping it to a repository ARN will silently fail.

💡 Tired of pasting proprietary configs into ChatGPT? Generic AI tools log your company's ARNs, DB strings, and private keys. StackEngine is a zero-backend, pure Client-Side WASM utility. Drop your failing config into the sandbox above. We redact your secrets locally in the browser and auto-generate the refactored code using your own API key.

Terraform — Enterprise Endpoint Module (Best Practice)

- # No VPC endpoints defined — nodes rely on NAT Gateway or public routing
- # This breaks in private subnets and incurs NAT data transfer costs

+ resource "aws_vpc_endpoint" "ecr_dkr" {
+   vpc_id              = var.vpc_id
+   service_name        = "com.amazonaws.${var.region}.ecr.dkr"
+   vpc_endpoint_type   = "Interface"
+   subnet_ids          = var.private_subnet_ids
+   security_group_ids  = [aws_security_group.endpoints.id]
+   private_dns_enabled = true
+   tags = { Name = "ecr-dkr-endpoint" }
+ }
+
+ resource "aws_vpc_endpoint" "ecr_api" {
+   vpc_id              = var.vpc_id
+   service_name        = "com.amazonaws.${var.region}.ecr.api"
+   vpc_endpoint_type   = "Interface"
+   subnet_ids          = var.private_subnet_ids
+   security_group_ids  = [aws_security_group.endpoints.id]
+   private_dns_enabled = true
+   tags = { Name = "ecr-api-endpoint" }
+ }
+
+ resource "aws_vpc_endpoint" "s3" {
+   vpc_id            = var.vpc_id
+   service_name      = "com.amazonaws.${var.region}.s3"
+   vpc_endpoint_type = "Gateway"
+   route_table_ids   = var.private_route_table_ids
+   tags = { Name = "s3-gateway-endpoint" }
+ }
+
+ resource "aws_security_group" "endpoints" {
+   name   = "vpc-endpoints-sg"
+   vpc_id = var.vpc_id
+
+   ingress {
+     from_port       = 443
+     to_port         = 443
+     protocol        = "tcp"
+     security_groups = [var.node_security_group_id]
+   }
+ }

Prevention in CI/CD

1. Checkov — Catch missing VPC endpoints pre-merge:

# .checkov.yml
checks:
  - CKV_AWS_123  # Ensure ECR repositories are not publicly accessible
  - CKV_AWS_51   # ECR image scanning on push

Add a custom Checkov policy to assert VPC endpoints exist for ECR in any VPC that has private subnets.

2. OPA/Conftest — Enforce cross-account ECR policy structure:

# policy/ecr_repo_policy.rego
package ecr

deny[msg] {
  stmt := input.Statement[_]
  stmt.Principal == "*"
  msg := "ECR repository policy must not use wildcard Principal"
}

deny[msg] {
  stmt := input.Statement[_]
  stmt.Action == "ecr:*"
  msg := "ECR repository policy must not grant ecr:* — use explicit pull actions only"
}

3. AWS Config Rule — Continuous compliance:

aws configservice put-config-rule --config-rule '{
  "ConfigRuleName": "ecr-no-wildcard-principal",
  "Source": {
    "Owner": "CUSTOM_LAMBDA",
    "SourceIdentifier": "arn:aws:lambda:...:function:ecr-policy-auditor"
  }
}'

4. Terraform Sentinel (HashiCorp Vault/TFC) — Block plans without endpoints:

# sentinel/require-ecr-endpoints.sentinel
import "tfplan/v2" as tfplan

endpoints = filter tfplan.resource_changes as _, rc {
  rc.type is "aws_vpc_endpoint" and
  rc.change.after.service_name contains "ecr"
}

main = rule { length(endpoints) >= 2 }

5. Add to your runbook: Any new AWS account or VPC provisioning checklist must include VPC endpoint creation for ECR before any EKS node group is deployed. The TLS timeout is always the symptom; missing endpoints are the disease.