Workflow Notes

Collect Flow

Use this when the user wants source material fetched before drafting training records.

First: try the IDE's native search/browsing tools to collect material.

Fallback: run scripts/collect.py for large collections or when IDE tools are unavailable:

# Web search
python3 scripts/collect.py --query "<topic>" --max-results 10 --tool-context codex
# Explicit URLs
python3 scripts/collect.py --urls <url1> [url2 ...] --tool-context codex
# Local files / repos
python3 scripts/collect.py --paths ./docs ./README.md --tool-context codex

The collector writes workspace/collected_<timestamp>.jsonl (records with status: collected).
Read the collected JSONL and draft canonical instruction/response records.

Import the drafts into the pipeline:

python3 scripts/generate.py --input workspace/drafts.jsonl \
    --source-type url_reference --tool-context codex

Continue with the standard verify → dedup → export pipeline.

Generate Flow

Use this when the user wants a new dataset or wants raw material turned into one.

decide request type, task_type (sft vs dpo), source_type, and target schema
set the target effective example count and coverage minima
if the user does not specify a size, default to 500
generate or collect records in batches instead of one monolithic pass
inspect recent runs in SQLite
collect or write canonical draft records with coverage metadata
prefer scripts/build_loop.py to orchestrate import, verify, incremental dedup, and coverage checks
if running manually, import drafts with scripts/generate.py --dedup-threshold 0.85
if running manually, run scripts/coverage.py to measure effective count, bucket gaps, joint-bucket skew, provenance, and response-prefix repetition
repeat steps 6–9 until the coverage plan is satisfied
augment if needed
generate preference pairs using dpo-pair-generator if task_type is dpo
verify
deduplicate
export

Recommended automated loop:

python3 scripts/build_loop.py \
    --batch workspace/drafts_batch_01.jsonl \
    --batch workspace/drafts_batch_02.jsonl \
    --plan-file workspace/coverage_plan.json \
    --source-type generated \
    --tool-context codex \
    --review-file workspace/review.jsonl \
    --verify-min-response-length 5

For label-only classification datasets such as VULNERABLE / NOT_VULNERABLE, lower --verify-min-response-length so short labels are not incorrectly rejected by the generic heuristic. If the plan sets require_review_file: true, build_loop.py requires --review-file so semantic judging happens during the build.

Equivalent manual generation-time coverage loop:

python3 scripts/generate.py --input workspace/drafts_batch_01.jsonl \
    --source-type generated --tool-context codex --dedup-threshold 0.85

python3 scripts/coverage.py \
    --from-status raw_generated \
    --from-status augmented \
    --from-status verified_pass \
    --threshold 0.85 \
    --plan-file workspace/coverage_plan.json

Treat the run as incomplete until both conditions are true:

effective count meets the planned target
every required bucket in the coverage plan meets its minimum

For higher-signal datasets, also treat the run as incomplete until:

required fields are present on every effective record
provenance rules are satisfied
joint-bucket mode collapse is below the configured threshold
repeated response openings stay below the configured prefix cap

If you do not provide a review-file, the loop can still steer coverage and filter heuristics, but the records remain judge_pending rather than fully validated verified_pass.

Generic plan fields now supported by scripts/coverage.py and scripts/verify.py:

required_fields
group_minimums
max_share_per_group
joint_group_rules
provenance
response_length
response_structure
response_prefix
model_visibility
require_review_file

Advanced quality sections are warn-only by default. Add blocking: true inside provenance, response_length, response_structure, or response_prefix only when you want that specific issue to prevent completion.

When the dataset keeps answer-bearing metadata for audit or analytics but the model should not see those values verbatim, define model_visibility in the plan. scripts/export.py and scripts/build_loop.py --export-format ... will apply those rules to exported instruction and context fields while leaving metadata intact. If model_visibility is omitted, export now applies a conservative built-in profile by default; set "enabled": false to disable it for a raw export.

Audit & Verify Flow

Use this when the user already has a dataset (or just generated one) and wants a structured quality assessment.

import the file with scripts/generate.py --source-type raw_dataset (if not already generated)
capture the generated run_id
If running Verify (fast heuristic checks):
- run scripts/verify.py --source-run-id <run_id>
- run scripts/dedup.py --source-run-id <run_id>
- export audit-ready outputs
If running Audit (dataset audit via dataset-auditor.md):
- agent runs verification, deduplication, and export to assemble metrics
- agent loads test/train splits to verify disjointness and scenario uniqueness
- agent samples records to detect context-leakage and synthetic fingerprints
- agent emits a final structured Markdown report with severity-classed findings.

This preserves lineage and keeps the verify-only path resumable.

Adversarial Security Data

If the dataset intentionally contains prompt injections, jailbreaks, or system-prompt-leak examples, import in injection-tolerant mode.

Red-team, security, pentest, and jailbreak requests now enable this by default from the request text. Use --allow-injections to force it explicitly or --enforce-security-flags to force strict flagging.

This bypasses prompt-injection regex flagging while still preserving control-character cleanup and normal canonicalization.

Export Flow

Use this when the verified data already exists in SQLite.

Options:

preset OpenAI export
preset HuggingFace export
flat CSV export
flat JSONL export
optional --plan-file so exported instruction and context honor any model_visibility rules
custom flat schema export

Source-Type Guide

generated: topic-driven examples created by the agent
url_reference: records derived from user-provided URLs or reference documents
raw_dataset: records imported from existing datasets
internet_research: records built from material gathered via browsing/search

Custom Flat Schema

Start from resources/templates/custom_flat_schema.json.

Rules:

mode must be flat
columns must be present
every column needs a unique name
every column needs a non-empty source

Example:

{
  "name": "my-export",
  "mode": "flat",
  "columns": [
    {"name": "prompt", "source": "instruction"},
    {"name": "answer", "source": "response.text"},
    {"name": "difficulty", "source": "metadata.difficulty"}
  ]
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow Notes

Collect Flow

Generate Flow

Audit & Verify Flow

Adversarial Security Data

Export Flow

Source-Type Guide

Custom Flat Schema

FilesExpand file tree

workflows.md

Latest commit

History

workflows.md

File metadata and controls

Workflow Notes

Collect Flow

Generate Flow

Audit & Verify Flow

Adversarial Security Data

Export Flow

Source-Type Guide

Custom Flat Schema