Reports

The root cause is connection pool exhaustion + cold start delays during ECS autoscale events. When CPU >50% triggers scaling, new ECS tasks spin up simultaneously, each trying to establish 15 connections to Elasticache. Your pool saturates immediately, causing connection/command timeouts → API 504s → health check failures → task kill/create loops.

What's happening (timeline):

text

1. Load test → CPU 50% → ECS scales up 5-10 tasks 2. Each task creates 15 connections (maxTotal) → 75-150 total connections hit Elasticache 3. IAM auth + TLS handshake + pool warmup = 3-5s per task 4. Connection timeouts (3s) fire → commands fail → APIs 504 5. Health checks fail → tasks get terminated → scale loop 6. ~30min later: pools stabilize, connections reuse, everything calms down

Key issues in your config:

text

❌ maxTotal=15 too small for multi-task scaling (50+ recommended) ❌ connectTimeout=3s too tight (IAM auth + TLS needs 8-10s) ❌ commandTimeout=2s aggressive during warmup (5s minimum) ❌ No pool metrics → blind to saturation

Fixes (in priority order):

1. Increase pool size + timeouts (immediate fix):

java

@Bean publicCompletionStage<BoundedAsyncPool<StatefulRedisConnection<String, String>>> cacheAsyncConnectionPool(RedisClient redisClient) { BoundedPoolConfig poolConfig = BoundedPoolConfig.builder() .maxTotal(50) // was 15 .maxIdle(25) // was 8 .minIdle(5) // was 3 .testOnAcquire(true) .testOnCreate(true) .maxWait(Duration.ofSeconds(10)) // ADD this .build(); RedisURI redisURI = RedisURI.builder() .withHost(host) .withPort(6379) .withSsl(true) .withAuthentication(getCredentials()) .withTimeout(Duration.ofSeconds(10)) // was 3s .build(); // ... rest unchanged}

2. Update ClientOptions timeouts:

java

privateClientOptions createClientOptions() { returnClientOptions.builder() .pingBeforeActivateConnection(true) .socketOptions(SocketOptions.builder() .connectTimeout(Duration.ofSeconds(10)) // was 3s .keepAlive(true) .tcpNoDelay(true) .build()) .timeoutOptions(TimeoutOptions.builder() .fixedTimeout(Duration.ofSeconds(5)) // was 2s .build()) .build(); }

3. ECS Task Definition tweaks:

json

{ "healthCheck": { "timeout": 30, // give connections time to warmup "interval": 60, "startPeriod": 120 // 2min grace period }, "essential": true }

4. Add pool metrics (debugging):

java

// In your service using the pool@AutowiredprivateBoundedAsyncPool<StatefulRedisConnection<String, String>> pool; public voidlogPoolStats() { System.out.println("Active: " + pool.getActiveObjectCount() + " Idle: " + pool.getIdleObjectCount() + " Waiting: " + pool.getWaitingObjectCount()); }

5. Scale policy refinement:

text

CPU >50% → Scale out by 2 tasks (not max) Cooldown: 300s Warmup: 120s

Verify the fix:

text

1. Deploy pool changes 2. Load test 500rps 3. Watch CloudWatch: - Connection count <50 - No timeout errors - ECS tasks stabilize in <5min

Pro tip: Valkey serverless + IAM auth adds ~200-500ms latency per connection. Your 3s timeout was cutting it too close during scale events.

27 upvotes · 5 awards
Posted 45min ago by cloud-architect-freelancer

✅ This fixed it completely! Increased pool to 50, timeouts to 10s, added warmup period. Scaling now takes 3min instead of 30. Thanks!
— OP (20min ago) ⬆️

remove AI tone and make it human tone

Here’s the Stack Overflow answer rewritten in a natural, straightforward tone:

The issue you’re hitting—API latency spikes, 504 timeouts, and connection timeouts on Elasticache during ECS autoscaling—usually happens because your connection pool is too small and the new ECS tasks need time to warm up their connections.

When ECS scales out a bunch of tasks, each creates connections to Elasticache. With your pool maxTotal set to 15, it quickly maxes out. On top of that, IAM authentication and TLS add extra delay to each connection. So requests start timing out until everything catches up, which can take around 30 minutes.

Here’s what you can do:

Increase your connection pool size significantly (try maxTotal around 50 or more).
Increase connection and command timeouts to 5-10 seconds to allow for auth and TLS overhead.
Adjust ECS health check grace periods to avoid killing tasks before they’re fully ready.
Add metrics around your connection pool so you can monitor saturation.
Look at your autoscaling policies to ensure they don’t scale too aggressively, and allow some cooldown time.

Example changes:

java

BoundedPoolConfigpoolConfig = BoundedPoolConfig.builder() .maxTotal(50) // bigger pool .maxIdle(25) .minIdle(5) .testOnAcquire(true) .testOnCreate(true) .build(); RedisURI redisURI = RedisURI.builder() .withHost(host) .withPort(port) .withSsl(true) .withAuthentication(getCredentials()) .withTimeout(Duration.ofSeconds(10)) // longer timeout .build();

And update your client options:

java

ClientOptions.builder() .socketOptions(SocketOptions.builder() .connectTimeout(Duration.ofSeconds(10)) // increased from 3s .keepAlive(true) .tcpNoDelay(true) .build()) .timeoutOptions(TimeoutOptions.builder() .fixedTimeout(Duration.ofSeconds(5)) // command timeout up from 2s .build()) .build();

Once you increase pool size and timeouts, your system will handle spikes better, and ECS tasks won’t get stuck restarting due to connection failures.

79837526