Skip to content

[BUG] Potential Memory Leak in orchestrion.(*contextStack).Push at High RPS #782

@akash-pocketfm

Description

@akash-pocketfm

Environment Details

  • Orchestrion Version: v1.8.0
  • dd-trace-go Version: v1.73.1 (gopkg.in/DataDog/dd-trace-go.v1)
  • orchestrion/all/v2 Version: v2.6.0
  • Go Version: 1.24.3
  • Application Type: HTTP API server (Gin framework)
  • Traffic Pattern:
    • Main pods: ~6,000 RPS, 5 pods @ 3 CPU / 3Gi memory
    • Pilot pods: ~250 RPS, 2 pods @ 3 CPU / 3Gi memory

Problem Description

We're experiencing a continuous memory spike exclusively on high-traffic pods (6k RPS), while low-traffic pods (250 RPS) remain stable. Heap profiling consistently shows that 64-78% of in-use memory is consumed by:

github.com/DataDog/dd-trace-go/v2/internal/orchestrion.(*contextStack).Push

Observed Behavior

Heap Profile Data (inuse_space) over time:

  • Sample 1: 31.14 MB (64.01%)
  • Sample 2: 28.59 MB (71.17%)
  • Sample 3: 37.69 MB (75.12%)
  • Sample 4: 33.74 MB (67.02%)
  • Sample 5: 36.14 MB (71.18%)
  • Sample 6: 48.08 MB (78.68%)
  • Sample 7: 31.92 MB (72.10%)

Key Observations:

  1. Memory grows continuously and doesn't stabilize
  2. The issue is RPS-dependent: 24x traffic difference correlates with memory spike presence
  3. GC cannot keep up at high RPS, suggesting contexts are not being released properly
  4. The problem appears to be a leak or unbounded growth in the context stack

Allocs Profile Data (alloc_space):

  • orchestrion.(*contextStack).Push: 239-346 MB allocated over profiling period
  • This represents 2-2.5% of total allocations, but dominates heap retention

Configuration

Build Process:

RUN go install github.com/DataDog/orchestrion@v1.8.0
RUN orchestrion pin
RUN orchestrion go build -ldflags="-s -w" -o ads cmd/main.go

Middleware Stack:

  1. Sentry middleware (with tracing enabled, 1% sample rate)
  2. Request ID middleware
  3. Prometheus metrics middleware

Concurrent Instrumentation:

  • Sentry tracing: Enabled with 1% traces sample rate
  • DataDog Orchestrion: Auto-instrumentation enabled

Hypothesis

At high RPS, the contextStack.Push operation appears to be accumulating contexts faster than they can be released. Possible causes:

  1. Context leak: Contexts pushed to the stack are not being properly popped after request completion
  2. Interaction with Sentry: The combination of Sentry's context wrapping + Orchestrion's context wrapping may create a multiplicative effect

Questions for DataDog Team

  1. Is there a known issue with contextStack.Push at high RPS or high concurrency?
  2. What is the expected behavior of the context stack? Should it be bounded or can it grow unbounded?
  3. Are there any cleanup operations we should be calling explicitly to release contexts from the stack?
  4. Is there a known interaction issue between Orchestrion and other tracing libraries (like Sentry)?
  5. Are there any configuration options to limit context stack depth or enable more aggressive cleanup?
  6. Is there a debug mode or additional logging we can enable to trace context lifecycle?

What We're Doing

We've instrumented our code to track:

  • Per-request memory allocations
  • Goroutine count deltas
  • Context depth per request
  • Correlation with Sentry header presence

We can provide additional profiling data or enable debug modes if needed.

Workaround Attempts

We're considering:

  1. Disabling Orchestrion temporarily to confirm it's the root cause

Request

Could you please:

  1. Confirm if this is a known issue or provide guidance on proper usage
  2. Suggest configuration changes or code patterns to mitigate the leak
  3. Let us know if you need additional profiling data or debug information

We're happy to collaborate on debugging this issue.

Image Image Image

Pprof_heap_session.txt

profile (1).pb.gz
profile (2).pb.gz

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions