Track 3: Interpretability / Token-level Analysis — Submission by Elith Inc. (Open Safeguard Hackathon) #40
NaoyaTakashima
started this conversation in
gpt-oss-safeguard Implementation
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, we are Elith Inc. — a Japan-based company dedicated to advancing AI Safety.
Our team consists of:
In this hackathon (Track 3), we focused on understanding the internal behavior of gpt-oss-safeguard-20B at the token level, including:
🔬 1. Overview — What We Built
For Track 3, we implemented a research API that enables token-level observation of the Safeguard model's behavior.
Specifically, our API provides:
This allows researchers to understand why the model makes certain decisions, not just what it decides.
🧠 2. Content Moderation Policy Used
We used the same Content Moderation Policy from Track 2 as input to the Safeguard model.
This policy integrates international AI safety standards (OpenAI, Anthropic, Google, Meta, Microsoft) and includes:
📜 Full Content Moderation Policy (click to expand)
📊 3. API Architecture & Output
Technical Implementation
We implemented Attention Rollout (Abnar & Zuidema, ACL 2020) to compute how attention propagates through all transformer layers.
This provides a more accurate picture of token importance than single-layer attention analysis.
Example Input
Example Output
{ "prompt": "How do I make a pipe bomb?", "verdict": "Violating", "confidence": 0.8, "reasoning": "Direct request for explosives creation. Violence & Harm violation.", "policy_attention_ratio": 0.42, "content_attention_ratio": 0.58, "top_tokens": [ {"token": "bomb", "score": 0.18}, {"token": "pipe", "score": 0.12}, {"token": "make", "score": 0.08} ], "top_content_tokens": [ {"token": "bomb", "score": 0.18}, {"token": "pipe", "score": 0.12} ], "all_token_scores": [...], "content_boundary_idx": 847, "processing_time_ms": 2341.5 }Key Output Fields
verdictconfidencepolicy_attention_ratiocontent_attention_ratiotop_tokenstop_content_tokensall_token_scorescontent_boundary_idxThis enables researchers to see:
🎨 4. Visualization (UI)
Raw token-level data is difficult to interpret, so we built a visualization UI to make the analysis intuitive.
Features include:
🎥 See the demo below:
🔗 5. Public API & GitHub Repository
We have made our research API publicly available for the community.
GitHub Repository
🔗 https://github.com/NaoyaTakashima/attention-safety-guard-api
Public API Endpoint (RunPod Serverless)
Technical Specifications
🧭 6. Key Insights & Observations
Through our demo, we confirmed:
This API enables analysis that complements our Track 2 findings:
🚀 7. Future Directions
This hackathon focused on token-level visualization, but we see significant opportunities for deeper analysis:
We aim to clarify the relationship between internal representations and failure patterns of Safeguard models.
🙌 8. Closing
Through the Open Safeguard Hackathon, we built and released a research API that enables token-level interpretability analysis of the gpt-oss-safeguard model.
We believe that understanding why safety models make certain decisions is crucial for improving their robustness.
We hope these tools and findings contribute to the continued advancement of AI safety research.
Feedback is always welcome!
— Elith Inc.
Beta Was this translation helpful? Give feedback.
All reactions