Skip to content

Add preview of changes for standard retry mode behind flag#3400

Open
Madrigal wants to merge 3 commits into
mainfrom
feat-retries-2026
Open

Add preview of changes for standard retry mode behind flag#3400
Madrigal wants to merge 3 commits into
mainfrom
feat-retries-2026

Conversation

@Madrigal
Copy link
Copy Markdown
Contributor

Add preview of new standard retry behavior behind AWS_NEW_RETRIES_2026 flag.

All changes are gated behind the AWS_NEW_RETRIES_2026 environment variable and have no impact on existing behavior unless the flag is explicitly set to "true". This feature is expected to be enabled by default later in the year.

This is part of a cross-SDK change to standardize how we handle retries, and will be implemented by all AWS SDKs (Java, Python, Rust, etc.)

Changes

Retry token bucket costs

  • Non-throttling retry cost increases from 5 to 14 tokens, reducing retry amplification during service outages.
  • Throttling errors now use a discounted cost of 5 tokens (previously, timeouts had a special cost of 10). This
    allows more retries when a service explicitly signals "try again later."
  • Timeouts are no longer treated differently from other transient errors.

Backoff timing

  • Base backoff for non-throttling errors reduced from 1,000ms to 50ms, significantly reducing latency for
    transient failures.
  • Throttling errors retain the 1,000ms base backoff.
  • MAX_BACKOFF is now applied before jitter rather than after.
  • DynamoDB and DynamoDB Streams use a 25ms base backoff and 4 max attempts (up from 3).

x-amz-retry-after header

  • The SDK now honors the x-amz-retry-after response header, which specifies a server-recommended backoff in
    milliseconds.
  • The value is clamped between the computed backoff and 5 seconds above it.
  • Invalid header values are silently ignored.

Long-polling operations

  • Operations marked as long-polling (currently manually adding SQS ReceiveMessage, SFN GetActivityTask, SWF PollForActivityTask/PollForDecisionTask, but will be implemented as a trait on the service) now back off even when the retry quota is exhausted, preventing request amplification on polling endpoints.
  • Codegen support added via LongPollingRetryIntegration, which checks for the LongPollTrait Smithy trait with a
    hardcoded fallback for known operations until the trait is widely applied.

Attempt token bookkeeping

  • NoRetryIncrement is now only applied on the first attempt's success, not after retries. This prevents double-
    counting when the retry token is also refunded on success.

Backwards compatibility

  • All behavioral changes are gated behind AWS_NEW_RETRIES_2026=true. Without the flag, the SDK behaves
    identically to before this change.
  • Public constants (DefaultRetryCost, DefaultRetryTimeoutCost) retain their original values.
  • New public API surface (ThrottlingRetryCost, Throttles, LongPolling on StandardOptions; AddWithLongPolling
    wrapper; NewExponentialJitterBackoffWithOptions) is additive only.
  • Existing tests explicitly set the flag to false to ensure they validate legacy behavior.

Testing

  • All standard mode test cases from the cross-SDK specification are implemented in retry2026_test.go, covering:
    basic retries, max attempts, quota exhaustion, quota recovery, exponential backoff, max backoff, throttling
    costs, DynamoDB-specific config, long-polling backoff, x-amz-retry-after header (valid, min-clamped, max-clamped,
    invalid), long-polling with throttle errors, and non-retryable errors.
  • Concurrent shared retry quota test validates correct token bucket behavior across goroutines.

@Madrigal Madrigal requested a review from a team as a code owner April 24, 2026 17:29
Comment thread aws/retry/middleware.go
// that time. Potentially early exist if the sleep is canceled via the
// context.
retryDelay, reqErr := r.retryer.RetryDelay(attemptNum, err)
retryDelay, reqErr := r.retryer.RetryDelay(attemptNum-1, err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 here why?

Comment thread aws/retry/middleware.go
if newRetries2026() {
longPolling := false
if std, ok := r.retryer.(*Standard); ok {
longPolling = std.IsLongPolling()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need both the option and the wrap? it seems like this is just a fact of certain operations so i'm not sure why we'd make it configurable

// WithBaseDelay sets the base delay for non-throttle errors.
func WithBaseDelay(d time.Duration) ExponentialJitterBackoffOption {
return func(j *ExponentialJitterBackoff) {
j.baseDelay = d
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • am i reading correctly that this and WithThrottleCheck won't be used unless they're in the 2026 path
  • this is only used as a vehicle to customize ddb's values, right? could it be internalized? i don't want to add to the retry API surface unless we absolutely have to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants