Environment Details
- Orchestrion Version: v1.8.0
- dd-trace-go Version: v1.73.1 (gopkg.in/DataDog/dd-trace-go.v1)
- orchestrion/all/v2 Version: v2.6.0
- Go Version: 1.24.3
- Application Type: HTTP API server (Gin framework)
- Traffic Pattern:
- Main pods: ~6,000 RPS, 5 pods @ 3 CPU / 3Gi memory
- Pilot pods: ~250 RPS, 2 pods @ 3 CPU / 3Gi memory
Problem Description
We're experiencing a continuous memory spike exclusively on high-traffic pods (6k RPS), while low-traffic pods (250 RPS) remain stable. Heap profiling consistently shows that 64-78% of in-use memory is consumed by:
github.com/DataDog/dd-trace-go/v2/internal/orchestrion.(*contextStack).Push
Observed Behavior
Heap Profile Data (inuse_space) over time:
- Sample 1: 31.14 MB (64.01%)
- Sample 2: 28.59 MB (71.17%)
- Sample 3: 37.69 MB (75.12%)
- Sample 4: 33.74 MB (67.02%)
- Sample 5: 36.14 MB (71.18%)
- Sample 6: 48.08 MB (78.68%)
- Sample 7: 31.92 MB (72.10%)
Key Observations:
- Memory grows continuously and doesn't stabilize
- The issue is RPS-dependent: 24x traffic difference correlates with memory spike presence
- GC cannot keep up at high RPS, suggesting contexts are not being released properly
- The problem appears to be a leak or unbounded growth in the context stack
Allocs Profile Data (alloc_space):
orchestrion.(*contextStack).Push: 239-346 MB allocated over profiling period
- This represents 2-2.5% of total allocations, but dominates heap retention
Configuration
Build Process:
RUN go install github.com/DataDog/orchestrion@v1.8.0
RUN orchestrion pin
RUN orchestrion go build -ldflags="-s -w" -o ads cmd/main.go
Middleware Stack:
- Sentry middleware (with tracing enabled, 1% sample rate)
- Request ID middleware
- Prometheus metrics middleware
Concurrent Instrumentation:
- Sentry tracing: Enabled with 1% traces sample rate
- DataDog Orchestrion: Auto-instrumentation enabled
Hypothesis
At high RPS, the contextStack.Push operation appears to be accumulating contexts faster than they can be released. Possible causes:
- Context leak: Contexts pushed to the stack are not being properly popped after request completion
- Interaction with Sentry: The combination of Sentry's context wrapping + Orchestrion's context wrapping may create a multiplicative effect
Questions for DataDog Team
- Is there a known issue with
contextStack.Push at high RPS or high concurrency?
- What is the expected behavior of the context stack? Should it be bounded or can it grow unbounded?
- Are there any cleanup operations we should be calling explicitly to release contexts from the stack?
- Is there a known interaction issue between Orchestrion and other tracing libraries (like Sentry)?
- Are there any configuration options to limit context stack depth or enable more aggressive cleanup?
- Is there a debug mode or additional logging we can enable to trace context lifecycle?
What We're Doing
We've instrumented our code to track:
- Per-request memory allocations
- Goroutine count deltas
- Context depth per request
- Correlation with Sentry header presence
We can provide additional profiling data or enable debug modes if needed.
Workaround Attempts
We're considering:
- Disabling Orchestrion temporarily to confirm it's the root cause
Request
Could you please:
- Confirm if this is a known issue or provide guidance on proper usage
- Suggest configuration changes or code patterns to mitigate the leak
- Let us know if you need additional profiling data or debug information
We're happy to collaborate on debugging this issue.
Pprof_heap_session.txt
profile (1).pb.gz
profile (2).pb.gz
Environment Details
Problem Description
We're experiencing a continuous memory spike exclusively on high-traffic pods (6k RPS), while low-traffic pods (250 RPS) remain stable. Heap profiling consistently shows that 64-78% of in-use memory is consumed by:
Observed Behavior
Heap Profile Data (inuse_space) over time:
Key Observations:
Allocs Profile Data (alloc_space):
orchestrion.(*contextStack).Push: 239-346 MB allocated over profiling periodConfiguration
Build Process:
Middleware Stack:
Concurrent Instrumentation:
Hypothesis
At high RPS, the
contextStack.Pushoperation appears to be accumulating contexts faster than they can be released. Possible causes:Questions for DataDog Team
contextStack.Pushat high RPS or high concurrency?What We're Doing
We've instrumented our code to track:
We can provide additional profiling data or enable debug modes if needed.
Workaround Attempts
We're considering:
Request
Could you please:
We're happy to collaborate on debugging this issue.
Pprof_heap_session.txt
profile (1).pb.gz
profile (2).pb.gz