You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Health Status: ⚠️ Degraded — recurring rate-limit burst recurred for 3rd time
The primary issue today is the API rate limit concurrent burst pattern — 3 create_issue safe output operations failed with HTTP 403 during a 3-minute window (12:17–12:20 UTC) when multiple daily workflows completed and triggered safe output jobs simultaneously. This is the third occurrence of this pattern (also seen 2026-04-02 and 2026-04-07).
Safe Output Job Statistics
Job Type
Total Executions
Failures
Success Rate
create_issue
19
3
84.2%
create_discussion
21
0
100%
add_comment
25
0
100%
create_pull_request
20
0
100%
upload_asset
11
0
100%
noop
11
0
100%
add_labels
10
0
100%
upload_artifact
3
0
100%
update_issue
3
0
100%
assign_to_agent
4
0
100%
submit_pull_request_review
1
0
100%*
dispatch_workflow
1
0
100%
create_pull_request_review_comment
2
0
100%**
*Warning issued but counted as non-failure (see Cluster 2) **Skipped (not in PR context)
Error Clusters
Cluster 1: API Rate Limit Exceeded — create_issue (HIGH severity)
Count: 3 failures across 2 runs
Affected Workflows: Workflow Health Manager - Meta-Orchestrator, Multi-Device Docs Tester
Time Window: 12:17:39–12:20:17 UTC (3-minute burst)
HTTP Response: POST /repos/github/gh-aw/issues — 403
Error details (run-24342586738 — Workflow Health Manager)
2026-04-13T12:17:39Z POST /repos/github/gh-aw/issues - 403 with id 2430:FD3B5:A5BFD83:2B114625:69DCDEE3 in 96ms
##[error]✗ Failed to create issue "Daily Semgrep Scan: Failure detected Apr 13 2026" in github/gh-aw:
API rate limit exceeded for installation. Request ID: 2430:FD3B5:A5BFD83:2B114625:69DCDEE3
2026-04-13T12:17:39Z POST /repos/github/gh-aw/issues - 403 with id 2430:3E650:11B2B85:47E47C4:69DCDEE3 in 45ms
##[error]✗ Failed to create issue "Workflow Health Dashboard — Apr 13, 2026" in github/gh-aw:
API rate limit exceeded for installation. Request ID: 2430:3E650:11B2B85:47E47C4:69DCDEE3
Processing Summary: Total messages: 2, Successful: 0, Failed: 2
```
</details>
<details>
<summary>Error details (run-24342635836 — Multi-Device Docs Tester)</summary>
```
2026-04-13T12:20:17Z POST /repos/github/gh-aw/issues - 403 with id 1680:DFCE9:9CFC4D1:28D76E42:69DCDF81 in 93ms
##[error]✗ Failed to create issue "🔍 Multi-Device Docs Testing Report - 2026-04-13" in github/gh-aw:
API rate limit exceeded for installation. Request ID: 1680:DFCE9:9CFC4D1:28D76E42:69DCDF81
Processing Summary: Total messages: 1, Successful: 0, Failed: 1
```
</details>
- **Root Cause**: 4 workflows (Workflow Health Manager, Multi-Device Docs Tester, Delight, GitHub MCP Structural Analysis) all started within a 2-minute window (12:10–12:12 UTC) due to the noon daily schedule. Their safe output jobs executed concurrently, exhausting the GitHub App installation rate limit.
- **Impact**: Three issues were not created — daily semgrep scan failure report, workflow health dashboard, and multi-device docs test report for 2026-04-13 are missing from the issue tracker.
#### Cluster 2: PR Review Context Warning — `submit_pull_request_review` (LOW severity)
- **Count**: 1 warning (non-fatal, Failed: 0)
- **Affected Workflow**: Smoke Copilot ([§24344369056](https://github.com/github/gh-aw/actions/runs/24344369056))
- **Event**: `schedule` (not pull_request)
```
##[warning]No review context set - cannot submit review
##[warning]✗ Failed to submit PR review: No review context available
Failed: 0
Root Cause: submit_pull_request_review was configured but the run was triggered by schedule, not a PR event. The inline create_pull_request_review_comment messages were also skipped (not in PR context). The overall run completed successfully (7/10 messages succeeded).
Impact: Non-fatal. This is the same intermittent pattern seen 2026-04-04 and 2026-04-06.
Root Cause Analysis
API Rate Limit Pattern
The GitHub App installation has a finite hourly request budget. When many workflows complete concurrently (the noon UTC burst), safe output jobs race to make API calls. The safe_output_handler_manager.cjs has zero retry logic for rate-limited responses (retries: 0, retry-exempt-status-codes: 400,401,403,404,422 — notably 403 is exempt). This means rate limit hits are instant, unrecoverable failures.
Historical occurrence frequency:
Date
Failures
Window
2026-04-02
7
12:13–12:14 UTC
2026-04-07
~1
unknown
2026-04-13
3
12:17–12:20 UTC
PR Review Context Warning
submit_pull_request_review handler checks for a PR review context at finalization time. When called via schedule (not PR event), inline comments are skipped and the review context is never set. The handler emits a warning but does not count this as a failure. This is expected behavior for smoke tests that run on schedules — the agent submits review-related operations that only make sense in PR context.
Recommendations
Critical Issues (Immediate Action Required)
Add retry logic for rate-limited safe output operations
Priority: High
Root Cause: safe_output_handler_manager.cjs uses actions/github-script with retries: 0 and includes 403 in retry-exempt-status-codes, bypassing all retry logic for rate limit responses.
Recommended Action: Implement exponential backoff retry for rate limit 403 responses specifically (distinguishable from permission 403 by error message body). Add 3 retries with 30–60s delays.
Affected: Workflow Health Manager, Multi-Device Docs Tester, and any workflow that creates issues during the noon burst.
Stagger noon-UTC schedule triggers to reduce burst
Priority: Medium
Root Cause: Multiple workflows all scheduled at "daily" defaulting to noon UTC create a burst of concurrent completions.
Recommended Action: Distribute daily schedules across ±30 minutes from noon (e.g., use cron 15 12 * * *, 30 12 * * *, etc.) for non-time-critical workflows.
Problem: Rate limit 403 responses cause immediate, unrecoverable failures with no retry
Fix: Detect rate-limit 403 by checking error message for "API rate limit exceeded for installation"; implement up to 3 retry attempts with exponential backoff (30s, 60s, 120s); log each retry attempt with remaining rate-limit headers if available.
Affected Jobs: create_issue, add_comment, create_discussion, and all GitHub API write operations.
Configuration Changes
Schedule diversification for noon burst workflows
Current: Multiple workflows at daily → noon UTC
Recommended: Assign explicit staggered cron expressions (e.g., 5 12, 15 12, 25 12) to spread safe output calls over a wider window.
Reason: 3 recurrences of rate limit burst demonstrate this is a systemic scheduling issue.
Process Improvements
Rate limit monitoring and alerting
Current State: Rate limit hits are silent failures; no proactive alerting
Proposed: Monitor GitHub API rate limit headers (x-ratelimit-remaining) in safe output jobs; emit a warning annotation when remaining drops below a threshold (e.g., 10% of hourly limit); create a noop or missing_tool report when rate limits are the root cause so the issue is surfaced in the health monitor.
PR review context guard
Current State: Smoke Copilot emits a non-fatal warning when submit_pull_request_review is called in schedule context
Proposed: The agent or workflow config should guard PR-review operations with an event-type check. The create_pull_request_review_comment config could include require_pr_context: true to suppress warnings and cleanly skip when not in PR context.
Work Item Plans
Work Item 1: Retry Logic for Rate-Limited Safe Output Operations
Type: Bug Fix
Priority: High
Description: Safe output handler fails immediately on HTTP 403 rate limit responses with no retry. Three instances of data loss (issues not created) over 3 calendar dates.
Acceptance Criteria:
Rate limit 403 responses (identified by message body) trigger retry with exponential backoff
Up to 3 retries attempted before final failure
Each retry is logged with timestamp and attempt number
Non-rate-limit 403 (permission errors) still fail immediately without retry
Rate limit recovery succeeds in a test scenario
Technical Approach: In safe_output_handler_manager.cjs, wrap GitHub API calls with a retry helper that checks error.message.includes('API rate limit exceeded'). Use setTimeout with 30s/60s/120s delays. Consider checking x-ratelimit-reset header for exact reset time.
Estimated Effort: Small
Dependencies: None
Work Item 2: Schedule Stagger for Noon-UTC Burst Workflows
Type: Configuration Enhancement
Priority: Medium
Description: Multiple workflows trigger at noon UTC ("daily" default), creating a burst of concurrent API calls in safe output jobs.
Acceptance Criteria:
No more than 3–4 workflows complete safe output jobs in any 5-minute window
Noon-UTC workflows are spread across a 30–60 minute window
Rate limit burst does not recur on next audit
Technical Approach: Audit all workflows with schedule: daily and assign explicit staggered cron expressions distributed across 11:45–12:45 UTC. Document the assignment convention to prevent future clumping.
Estimated Effort: Small
Dependencies: Work Item 1 (as a backstop even if staggering helps)
Historical Context
Daily health trend (last 7 audited days)
Date
Hard Failures
Overall Health
Notes
2026-04-05
0
Excellent
4th zero-failure day
2026-04-06
0
Healthy
1 PR review context warning
2026-04-07
1
Degraded
Rate limit burst (1 failure)
2026-04-08
0
Excellent
100% success
2026-04-09
n/a
—
No audit
2026-04-10
2
Degraded
upload_artifact staging + resolve_review_thread
2026-04-11
1
Degraded
upload_artifact staging missing (day 2)
2026-04-12
n/a
—
No audit
2026-04-13
3
⚠️ Degraded
Rate limit burst (3rd recurrence)
Error rate trend: Unstable — alternating good/bad days with recurring rate limit as the dominant systemic issue.
Most reliable job type: add_comment, create_discussion, create_pull_request, upload_asset — all 100% today.
Most problematic job type: create_issue — all 3 failures today, consistently the job most affected by rate limiting.
Recurring issue: api_rate_limit_concurrent_burst — 3 dates, 10 total failures. No fix has been implemented despite appearing in recommendations on 2026-04-02.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Executive Summary
The primary issue today is the API rate limit concurrent burst pattern — 3
create_issuesafe output operations failed with HTTP 403 during a 3-minute window (12:17–12:20 UTC) when multiple daily workflows completed and triggered safe output jobs simultaneously. This is the third occurrence of this pattern (also seen 2026-04-02 and 2026-04-07).Safe Output Job Statistics
create_issuecreate_discussionadd_commentcreate_pull_requestupload_assetnoopadd_labelsupload_artifactupdate_issueassign_to_agentsubmit_pull_request_reviewdispatch_workflowcreate_pull_request_review_comment*Warning issued but counted as non-failure (see Cluster 2)
**Skipped (not in PR context)
Error Clusters
Cluster 1: API Rate Limit Exceeded —
create_issue(HIGH severity)POST /repos/github/gh-aw/issues — 403Error details (run-24342586738 — Workflow Health Manager)
submit_pull_request_reviewwas configured but the run was triggered byschedule, not a PR event. The inlinecreate_pull_request_review_commentmessages were also skipped (not in PR context). The overall run completed successfully (7/10 messages succeeded).Root Cause Analysis
API Rate Limit Pattern
The GitHub App installation has a finite hourly request budget. When many workflows complete concurrently (the noon UTC burst), safe output jobs race to make API calls. The
safe_output_handler_manager.cjshas zero retry logic for rate-limited responses (retries: 0,retry-exempt-status-codes: 400,401,403,404,422— notably 403 is exempt). This means rate limit hits are instant, unrecoverable failures.Historical occurrence frequency:
PR Review Context Warning
submit_pull_request_reviewhandler checks for a PR review context at finalization time. When called via schedule (not PR event), inline comments are skipped and the review context is never set. The handler emits a warning but does not count this as a failure. This is expected behavior for smoke tests that run on schedules — the agent submits review-related operations that only make sense in PR context.Recommendations
Critical Issues (Immediate Action Required)
Add retry logic for rate-limited safe output operations
safe_output_handler_manager.cjsusesactions/github-scriptwithretries: 0and includes403inretry-exempt-status-codes, bypassing all retry logic for rate limit responses.Stagger noon-UTC schedule triggers to reduce burst
15 12 * * *,30 12 * * *, etc.) for non-time-critical workflows.Bug Fixes Required
safe_output_handler_manager.cjs— Rate Limit Retryactions/safe_output_handler_manager.cjs(oractions/github-scriptconfiguration)create_issue,add_comment,create_discussion, and all GitHub API write operations.Configuration Changes
daily→ noon UTC5 12,15 12,25 12) to spread safe output calls over a wider window.Process Improvements
Rate limit monitoring and alerting
x-ratelimit-remaining) in safe output jobs; emit a warning annotation when remaining drops below a threshold (e.g., 10% of hourly limit); create a noop ormissing_toolreport when rate limits are the root cause so the issue is surfaced in the health monitor.PR review context guard
submit_pull_request_reviewis called in schedule contextcreate_pull_request_review_commentconfig could includerequire_pr_context: trueto suppress warnings and cleanly skip when not in PR context.Work Item Plans
Work Item 1: Retry Logic for Rate-Limited Safe Output Operations
safe_output_handler_manager.cjs, wrap GitHub API calls with a retry helper that checkserror.message.includes('API rate limit exceeded'). UsesetTimeoutwith 30s/60s/120s delays. Consider checkingx-ratelimit-resetheader for exact reset time.Work Item 2: Schedule Stagger for Noon-UTC Burst Workflows
schedule: dailyand assign explicit staggered cron expressions distributed across 11:45–12:45 UTC. Document the assignment convention to prevent future clumping.Historical Context
Daily health trend (last 7 audited days)
add_comment,create_discussion,create_pull_request,upload_asset— all 100% today.create_issue— all 3 failures today, consistently the job most affected by rate limiting.api_rate_limit_concurrent_burst— 3 dates, 10 total failures. No fix has been implemented despite appearing in recommendations on 2026-04-02.Metrics and KPIs
add_comment,create_discussion,create_pull_request(100%)create_issue(84.2%)Next Steps
safe_output_handler_manager.cjsReferences:
Beta Was this translation helpful? Give feedback.
All reactions