CI Failure Tracker

A comprehensive suite of tools for tracking and analyzing OpenShift CI test failures.

Overview

This repository contains two complementary tools:

Dashboard - Web-based CI test health tracking with pass rate analytics and weekly reports
ReportPortal to Jira Bridge - Automated failure tracking and ticket creation

Both tools are generic and customizable for any QE team tracking OpenShift CI tests.

Quick Start for Your Team

This tool works for any QE team running periodic CI tests. To use it:

Option 1: Use the Dashboard (Recommended)

Fork this repository to your GitHub account/org

Choose an example configuration:

# See examples for Networking, Storage, or other teams
cp examples/networking-team-config.yaml dashboard/config.yaml
# OR customize from scratch - see dashboard/config.yaml

Edit configuration for your team:
- Update job_patterns to match your periodic CI jobs
- Set versions (e.g., 4.21, 4.22, 4.23)
- Choose platforms (aws, gcp, azure, vsphere, etc.)
Deploy:
- OpenShift: See dashboard/openshift/README.md
- Local: cd dashboard && pip install -r requirements.txt && ./dashboard.py serve

Option 2: Use the Jira Bridge

See CONTRIBUTING.md for detailed setup instructions.

Need Help?

Examples: Check examples/ directory for sample configurations
Documentation: See CONTRIBUTING.md for detailed customization guide
Issues: Open an issue if you need assistance

Example job patterns for different teams:

WinC: periodic-ci-*-winc-*
Networking: periodic-ci-*-network-*, periodic-ci-*-ovn-*
Storage: periodic-ci-*-storage-*, periodic-ci-*-csi-*
Your team: Find your jobs at https://prow.ci.openshift.org/

Architecture

┌─────────────────┐
│  ReportPortal   │ ← Prow periodic jobs report results
│   (Data Source) │
└────────┬────────┘
         │ API queries
         ↓
┌─────────────────┐
│ Failure Tracker │ ← Python script running periodically
│     Script      │
└────────┬────────┘
         │
         ├─→ Query ReportPortal API for recent failures
         ├─→ Filter by: version (4.20, 4.21, 4.22), test type (winc)
         ├─→ Analyze failure patterns
         ├─→ Check for existing Jira tickets
         │
         ↓
┌─────────────────┐
│   Jira (WINC)   │ ← Auto-create tickets for new failures
│   Project       │
└─────────────────┘

Features

One Ticket Per Test Case: Creates a single Jira ticket for each unique test (e.g., OCP-11111)
Multi-Platform Aggregation: If OCP-11111 fails on AWS, GCP, and Azure, all failures appear in ONE ticket
Comprehensive Failure Links: Lists every failure instance with direct links to ReportPortal logs
Platform Breakdown: Shows failure count per platform (AWS: 3 failures, GCP: 2 failures, etc.)
Version Tracking: Track failures across multiple OpenShift versions (4.19, 4.20, 4.21, 4.22)
Team Configuration: YAML-based per-team configuration (jobs, platforms, thresholds)
Server-Side Filtering: Efficient API queries with status and name filtering
Configurable CLI: Adjust page size, max pages, and workers via command-line options
SAML/OAuth Authentication: Optional Red Hat SSO integration for dashboard access control (see SAML Authentication)

Data Flow

1. Query ReportPortal

GET /api/v1/{projectName}/launch
Filter by:
- Launch name pattern: periodic-ci-openshift-openshift-tests-private-release-*-winc-*
- Status: FAILED
- Time range: Last 7 days

2. Extract Failed Tests

GET /api/v1/{projectName}/item/{launchId}
For each failed launch:
- Get test items with status FAILED
- Extract test name, error message, stack trace
- Identify test file and line number

3. Analyze Patterns

Failure Pattern = {
    "test_name": "OCP-39451",
    "test_file": "test/extended/winc/winc.go",
    "error_signature": hash(error_message),
    "versions_affected": ["4.20", "4.21", "4.22"],
    "platforms_affected": ["aws", "azure", "gcp"],
    "failure_count": 15,
    "first_seen": "2026-01-25",
    "last_seen": "2026-02-02"
}

4. Check Existing Tickets

JQL Query:
project = WINC AND
labels = "ci-failure" AND
summary ~ "OCP-39451" AND
status NOT IN (Closed, Resolved)

5. Create/Update Ticket

If no ticket exists → Create new
If ticket exists but pattern changed → Add comment
If ticket is old but failure recurring → Reopen with new data

Implementation

Directory Structure

ci-failure-tracker/
├── README.md                    # This file
├── requirements.txt             # Python dependencies
├── ci_failure_tracker.py        # Main script (new tool)
├── src/
│   └── core/
│       ├── config_loader.py     # Team configuration loader
│       └── jira_client.py       # Jira API client
├── teams/
│   ├── winc.yaml               # WINC team configuration (example)
│   └── README.md               # Team config documentation
└── venv/                        # Python virtual environment

Configuration (config.yaml)

reportportal:
  url: "https://reportportal-openshift.apps.dno.ocp-hub.prod.psi.redhat.com"
  project: "prow"
  api_token: "${REPORTPORTAL_API_TOKEN}"  # Set in environment

jira:
  url: "https://issues.redhat.com"
  project: "WINC"
  parent_story: "WINC-1552"  # Parent epic for CI failures
  # Authentication via environment variables:
  # - JIRA_USER (username or email)
  # - JIRA_API_TOKEN (personal API token)

tracking:
  versions:
    - "4.20"
    - "4.21"
    - "4.22"

  platforms:
    - "aws"
    - "azure"
    - "gcp"
    - "vsphere"
    - "nutanix"

  job_patterns:
    - "periodic-ci-openshift-openshift-tests-private-release-*-winc-*"
    - "periodic-ci-openshift-windows-machine-config-operator-release-*"

  lookback_days: 7  # How far back to search
  failure_threshold: 3  # Minimum failures before creating ticket

labels:
  - "ci-failure"
  - "automated"
  - "phase-1-stabilization"

ticket_template: |
  h2. Automated CI Failure Report

  *Test*: {test_name}
  *Test File*: {{test_file}}:{line_number}
  *Affected Versions*: {versions}
  *Affected Platforms*: {platforms}

  h2. Failure Summary

  * *First Seen*: {first_seen}
  * *Last Seen*: {last_seen}
  * *Failure Count*: {failure_count} failures in {lookback_days} days
  * *Failure Rate*: {failure_rate}%

  h2. Error Message

  {code}
  {error_message}
  {code}

  h2. Recent Failures

  {failure_table}

  h2. ReportPortal Links

  {reportportal_links}

  h2. Recommended Actions

  # Review test code at {{test_file}}:{line_number}
  # Check for recent changes in affected test
  # Verify if issue is platform-specific or version-specific
  # Investigate error logs in ReportPortal

  ---
  _This ticket was automatically created by CI Failure Tracker_
  _Configuration: versions={versions}, threshold={failure_threshold}, lookback={lookback_days}d_

Usage

Installation

cd /Users/rrasouli/Documents/GitHub/openshift-tests-private/tools/ci-failure-tracker
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Set Environment Variables

# Required for all operations
export REPORTPORTAL_API_TOKEN="your-reportportal-token"

# Required only for creating tickets (not needed for --dry-run)
export JIRA_USER="your-jira-username-or-email"
export JIRA_API_TOKEN="your-jira-api-token"

Note:

For dry-run mode, only REPORTPORTAL_API_TOKEN is needed
Jira authentication uses REST API with username/token
Get Jira API token from: https://id.atlassian.com/manage-profile/security/api-tokens

Run Manually

# Dry run (no ticket creation, analyze only)
./ci_failure_tracker.py --team winc --dry-run --days 7

# Analyze with custom lookback period
./ci_failure_tracker.py --team winc --dry-run --days 14

# Fetch more launches (if failures are missing)
./ci_failure_tracker.py --team winc --dry-run --days 7 --max-pages 20

# Use larger page sizes for fewer API calls
./ci_failure_tracker.py --team winc --dry-run --page-size 300 --max-pages 3

# Run and create tickets (remove --dry-run)
./ci_failure_tracker.py --team winc --days 7

# Show help
./ci_failure_tracker.py --help

CLI Options

--team TEXT: Team ID (required) - must match a YAML file in teams/ directory
--dry-run: Analyze failures without creating Jira tickets
--verbose: Enable verbose output for debugging
--days N: Override lookback period from config (default: from team config)
--page-size N: Number of launches per API page (default: 150)
--max-pages N: Maximum API pages to fetch (default: 5)
--max-workers N: Number of parallel workers (default: from team config)

Schedule as Periodic Job

# Add to crontab to run daily at 9 AM
0 9 * * * cd /path/to/ci-failure-tracker && ./venv/bin/python ci_failure_tracker.py

# Or create a Prow periodic job (recommended for running in CI cluster)

Ticket Creation Strategy

One Ticket Per Test Case

The tool creates one Jira ticket per unique test case, regardless of:

How many platforms it failed on (AWS, GCP, Azure, etc.)
How many times it failed
What the specific error messages were
Which OCP versions were affected

Example:

Test: OCP-11111
Failures:
- AWS: 3 failures across 4.19, 4.20
- GCP: 2 failures on 4.21
- Azure: 1 failure on 4.20
Result: ONE ticket with all 6 failures listed, grouped by platform

Deduplication Logic

# Group by test_name ONLY (not error signature)
for instance in all_failures:
    ticket_key = instance.test_name  # e.g., "OCP-11111"
    grouped_failures[ticket_key].append(instance)

# Check if ticket already exists in Jira
if jira.find_ticket(summary=f"CI Failure: {test_name}"):
    skip  # Ticket already exists
else:
    create_ticket(test_name, all_instances_for_this_test)

Output Examples

Console Output

CI Failure Tracker - Run at 2026-02-02 10:00:00
============================================

Querying ReportPortal for versions: 4.20, 4.21, 4.22
Looking back: 7 days

Found 45 failed periodic jobs:
- 4.20: 12 failures
- 4.21: 18 failures
- 4.22: 15 failures

Analyzing failure patterns...

Pattern 1: OCP-39451 - Windows→Linux ClusterIP failure
  Versions: 4.20, 4.21, 4.22
  Platforms: aws, azure, gcp
  Failures: 15 (23% failure rate)
  Status: Existing ticket WINC-1605 ✓
  Action: Skipped (ticket exists)

Pattern 2: OCP-77777 - WMCO metrics timeout
  Versions: 4.21, 4.22
  Platforms: vsphere
  Failures: 8 (45% failure rate)
  Status: No ticket found
  Action: Creating ticket... WINC-1606 ✓

Pattern 3: OCP-43832 - BYOH zone parsing
  Versions: 4.20
  Platforms: nutanix
  Failures: 2 (5% failure rate)
  Status: Below threshold (3)
  Action: Skipped (not enough failures)

Summary:
- Total patterns found: 12
- Tickets created: 3
- Tickets updated: 1
- Skipped (existing): 5
- Skipped (threshold): 3

Created Jira Ticket Example

See example in the ticket template above.

API References

ReportPortal API: https://developers.reportportal.io/api-docs/
Jira REST API: https://developer.atlassian.com/cloud/jira/platform/rest/v3/
How to get ReportPortal API token: https://reportportal.io/docs/log-data-in-reportportal/HowToGetAnAccessTokenInReportPortal/

Related Tools

WINC Test Pass Rate Dashboard

A companion tool for tracking test health over time. Located in /dashboard:

Features:

Track test pass rates over 7/14/30/60/90 days
Compare performance across OpenShift versions (4.21 vs 4.22)
Identify worst-performing tests
Interactive web dashboard with charts and trends
Filter by version and time range

Quick Start:

cd dashboard
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Collect data from ReportPortal
./dashboard.py collect --days 30

# Start web dashboard at http://localhost:8080
./dashboard.py serve

Documentation:

User Guide - Complete walkthrough
Quick Start - One-page reference
Technical README - Architecture and development

Use Case: While this CI Failure Tracker creates Jira tickets for individual failures, the dashboard provides a high-level view of overall test health and trends.

Future Enhancements

Slack Notifications: Send daily summary to #windows-containers channel
Trend Analysis: Track if failures are increasing/decreasing over time
Auto-assignment: Assign tickets based on test ownership (CODEOWNERS)
Integration with TestGrid: Cross-reference with OpenShift TestGrid data
ML-based Grouping: Use machine learning to group similar failures
Auto-close: Close tickets when test passes consistently for N days

Maintenance

Adding New Versions

Edit config.yaml:

tracking:
  versions:
    - "4.23"  # Add new version

Adjusting Thresholds

tracking:
  failure_threshold: 5  # Require 5 failures instead of 3
  lookback_days: 14     # Look back 14 days instead of 7

Filtering Out Flaky Tests

Create a skip list in config.yaml:

skip_tests:
  - "OCP-12345"  # Known flaky, tracked elsewhere
  - "OCP-67890"  # Intentionally failing

Troubleshooting

"Authentication failed"

Check ReportPortal API token is set: echo $REPORTPORTAL_API_TOKEN
Check Jira credentials are set (only needed for creating tickets):
- echo $JIRA_USER
- echo $JIRA_API_TOKEN
For dry-run mode, only REPORTPORTAL_API_TOKEN is needed

"No failures found"

Verify job name patterns in config match actual Prow job names
Check date range (expand lookback_days)
Verify ReportPortal project name is correct

"Duplicate tickets created"

Check JQL query is working: jira issue list -q "project=WINC AND labels=ci-failure"
Verify error signature generation is stable

Contributing

Test changes locally with --dry-run
Add unit tests for new features
Update this README with configuration changes
Submit PR to openshift-tests-private repo

Name		Name	Last commit message	Last commit date
Latest commit History 179 Commits
dashboard		dashboard
docs		docs
examples		examples
src/core		src/core
teams		teams
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
ci_failure_tracker.py		ci_failure_tracker.py
requirements.txt		requirements.txt
test_private_deck_auth.py		test_private_deck_auth.py

Folders and files

Latest commit

History

Repository files navigation

CI Failure Tracker

Overview

Quick Start for Your Team

Option 1: Use the Dashboard (Recommended)

Option 2: Use the Jira Bridge

Need Help?

Architecture

Features

Data Flow

1. Query ReportPortal

2. Extract Failed Tests

3. Analyze Patterns

4. Check Existing Tickets

5. Create/Update Ticket

Implementation

Directory Structure

Configuration (config.yaml)

Usage

Installation

Set Environment Variables

Run Manually

CLI Options

Schedule as Periodic Job

Ticket Creation Strategy

One Ticket Per Test Case

Deduplication Logic

Output Examples

Console Output

Created Jira Ticket Example

API References

Related Tools

WINC Test Pass Rate Dashboard

Future Enhancements

Maintenance

Adding New Versions

Adjusting Thresholds

Filtering Out Flaky Tests

Troubleshooting

"Authentication failed"

"No failures found"

"Duplicate tickets created"

Contributing

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages