Skip to content

fix(admin): keep Cumulative Users scan alive on Cloud Run#6679

Merged
kodjima33 merged 1 commit intomainfrom
fix/cumulative-users-oom-prod
Apr 15, 2026
Merged

fix(admin): keep Cumulative Users scan alive on Cloud Run#6679
kodjima33 merged 1 commit intomainfrom
fix/cumulative-users-oom-prod

Conversation

@kodjima33
Copy link
Copy Markdown
Collaborator

Summary

Cumulative Users chart was showing "no data available" on production. Cloud Run logs showed repeated Uncaught signal: 6, pid=1 (SIGABRT) + "HTTP response was malformed" around every /api/omi/stats/daily-new-users hit — the Node process was OOM-killed while iterating ~112K Firebase Auth users on a 512Mi container.

Fixes

  • Persist to Redis (30 min TTL) so only one instance ever pays the full scan cost per window. Subsequent requests (including cold starts on other instances) read the cached series instantly.
  • Yield between pages with setImmediate so V8 can collect the previous batch of UserRecord objects before the next listUsers() call arrives, keeping peak memory flat across the scan.
  • Bump Cloud Run memory to --memory=1Gi --cpu=1 in gcp_admin.yml. The live revision was already hot-patched via gcloud run services update so production is already back up before this PR merges.

Test plan

  • After merge, confirm gcp_admin.yml deploy uses --memory=1Gi
  • Load /dashboard/analytics in production, Cumulative Users chart renders with all three filter windows
  • gcloud logging read ... severity>=ERROR shows no signal-6 crashes after deploy

🤖 Generated with Claude Code

The Cumulative Users chart was showing "no data available" on
production because the daily-new-users route was OOM-killing the
Cloud Run container (SIGABRT / signal 6) while iterating through
~112K Firebase Auth users. 512Mi wasn't enough headroom for Next.js
plus the listUsers() cursor, and the process died mid-response.

Three fixes:
- Persist the computed daily series to Redis under a 30 minute TTL
  so only one instance ever pays the full scan cost per window —
  subsequent requests (including cold starts on other instances)
  read the cached series instantly.
- Yield to the event loop between listUsers() pages so V8 can
  collect the previous batch of UserRecord objects before the next
  one arrives, keeping peak memory flat across the scan.
- Bump the Cloud Run revision to --memory=1Gi --cpu=1 in the deploy
  workflow as a safety margin. The live service was already
  hot-patched to 1Gi so production stays up before this workflow
  runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kodjima33 kodjima33 merged commit 7be0ac9 into main Apr 15, 2026
2 checks passed
@kodjima33 kodjima33 deleted the fix/cumulative-users-oom-prod branch April 15, 2026 21:42
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 15, 2026

Greptile Summary

This PR fixes a production OOM crash (SIGABRT/signal-6) in the /api/omi/stats/daily-new-users route by introducing three-layer caching (in-memory shadow → Redis → fresh build), setImmediate yields between Firebase Auth listUsers pages to aid GC, and bumping the Cloud Run service memory from 512Mi to 1Gi. Redis was already wired up in the deployment secrets so the new cache layer should work without any infrastructure changes beyond the memory bump.

Confidence Score: 5/5

Safe to merge — fixes a confirmed production OOM with no correctness, security, or data-integrity concerns.

All three layers of the fix are correct: the in-memory/Redis/rebuild ordering is intentional and handles cross-instance cache population, the setImmediate yield correctly allows V8 to GC each page batch, and the Redis helpers are fail-open matching the existing codebase pattern. Redis credentials were already present in the deployment secrets. No P0/P1 findings.

No files require special attention.

Important Files Changed

Filename Overview
web/admin/app/api/omi/stats/daily-new-users/route.ts Redis cache layer added between in-memory shadow and full Firebase scan; setImmediate yield between listUsers pages for GC; TTL constants correctly aligned at 30 min.
web/admin/lib/redis.ts New getJsonCache/setJsonCache helpers added; both are fail-open (errors logged and swallowed), consistent with existing invalidateEnforcementCache pattern.
.github/workflows/gcp_admin.yml Cloud Run flags bumped to --memory=1Gi --cpu=1 with an explanatory comment; no other deployment logic changed.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Route as route.ts (getSeries)
    participant Mem as In-Memory Cache
    participant Redis as Redis (30 min TTL)
    participant FB as Firebase Auth

    Client->>Route: GET /api/omi/stats/daily-new-users
    Route->>Mem: cachedSeries fresh?
    alt In-memory hit (< 30 min)
        Mem-->>Route: CachedSeries
        Route-->>Client: JSON response
    else In-memory miss/stale
        Route->>Redis: getJsonCache(REDIS_KEY)
        alt Redis hit (generatedAt < 30 min)
            Redis-->>Route: CachedSeries
            Route->>Mem: update cachedSeries
            Route-->>Client: JSON response
        else Redis miss/stale
            alt pendingBuild running?
                Route-->>Route: await existing pendingBuild
            else No pending build
                Route->>FB: listUsers(1000, pageToken) x N pages
                Note over Route,FB: setImmediate yield between pages (GC)
                FB-->>Route: UserRecord batches
                Route->>Redis: setJsonCache(series, 30 min TTL)
                Route->>Mem: update cachedSeries
                Route-->>Client: JSON response
            end
        end
    end
Loading

Reviews (1): Last reviewed commit: "fix(admin): keep Cumulative Users scan a..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant