Use this when the user wants source material fetched before drafting training records.
- First: try the IDE's native search/browsing tools to collect material.
- Fallback: run
scripts/collect.pyfor large collections or when IDE tools are unavailable:# Web search python3 scripts/collect.py --query "<topic>" --max-results 10 --tool-context codex # Explicit URLs python3 scripts/collect.py --urls <url1> [url2 ...] --tool-context codex # Local files / repos python3 scripts/collect.py --paths ./docs ./README.md --tool-context codex
- The collector writes
workspace/collected_<timestamp>.jsonl(records withstatus: collected). - Read the collected JSONL and draft canonical instruction/response records.
- Import the drafts into the pipeline:
python3 scripts/generate.py --input workspace/drafts.jsonl \ --source-type url_reference --tool-context codex - Continue with the standard verify → dedup → export pipeline.
Use this when the user wants a new dataset or wants raw material turned into one.
- decide request type,
task_type(sftvsdpo),source_type, and target schema - set the target effective example count and coverage minima
- if the user does not specify a size, default to
500 - generate or collect records in batches instead of one monolithic pass
- inspect recent runs in SQLite
- collect or write canonical draft records with coverage metadata
- prefer
scripts/build_loop.pyto orchestrate import, verify, incremental dedup, and coverage checks - if running manually, import drafts with
scripts/generate.py --dedup-threshold 0.85 - if running manually, run
scripts/coverage.pyto measure effective count, bucket gaps, joint-bucket skew, provenance, and response-prefix repetition - repeat steps 6–9 until the coverage plan is satisfied
- augment if needed
- generate preference pairs using
dpo-pair-generatoriftask_typeisdpo - verify
- deduplicate
- export
Recommended automated loop:
python3 scripts/build_loop.py \
--batch workspace/drafts_batch_01.jsonl \
--batch workspace/drafts_batch_02.jsonl \
--plan-file workspace/coverage_plan.json \
--source-type generated \
--tool-context codex \
--review-file workspace/review.jsonl \
--verify-min-response-length 5For label-only classification datasets such as VULNERABLE / NOT_VULNERABLE, lower --verify-min-response-length so short labels are not incorrectly rejected by the generic heuristic.
If the plan sets require_review_file: true, build_loop.py requires --review-file so semantic judging happens during the build.
Equivalent manual generation-time coverage loop:
python3 scripts/generate.py --input workspace/drafts_batch_01.jsonl \
--source-type generated --tool-context codex --dedup-threshold 0.85
python3 scripts/coverage.py \
--from-status raw_generated \
--from-status augmented \
--from-status verified_pass \
--threshold 0.85 \
--plan-file workspace/coverage_plan.jsonTreat the run as incomplete until both conditions are true:
- effective count meets the planned target
- every required bucket in the coverage plan meets its minimum
For higher-signal datasets, also treat the run as incomplete until:
- required fields are present on every effective record
- provenance rules are satisfied
- joint-bucket mode collapse is below the configured threshold
- repeated response openings stay below the configured prefix cap
If you do not provide a review-file, the loop can still steer coverage and filter heuristics, but the records remain judge_pending rather than fully validated verified_pass.
Generic plan fields now supported by scripts/coverage.py and scripts/verify.py:
required_fieldsgroup_minimumsmax_share_per_groupjoint_group_rulesprovenanceresponse_lengthresponse_structureresponse_prefixmodel_visibilityrequire_review_file
Advanced quality sections are warn-only by default. Add blocking: true inside provenance, response_length, response_structure, or response_prefix only when you want that specific issue to prevent completion.
When the dataset keeps answer-bearing metadata for audit or analytics but the model should not see those values verbatim, define model_visibility in the plan. scripts/export.py and scripts/build_loop.py --export-format ... will apply those rules to exported instruction and context fields while leaving metadata intact. If model_visibility is omitted, export now applies a conservative built-in profile by default; set "enabled": false to disable it for a raw export.
Use this when the user already has a dataset (or just generated one) and wants a structured quality assessment.
- import the file with
scripts/generate.py --source-type raw_dataset(if not already generated) - capture the generated
run_id - If running Verify (fast heuristic checks):
- run
scripts/verify.py --source-run-id <run_id> - run
scripts/dedup.py --source-run-id <run_id> - export audit-ready outputs
- run
- If running Audit (
dataset auditviadataset-auditor.md):- agent runs verification, deduplication, and export to assemble metrics
- agent loads test/train splits to verify disjointness and scenario uniqueness
- agent samples records to detect context-leakage and synthetic fingerprints
- agent emits a final structured Markdown report with severity-classed findings.
This preserves lineage and keeps the verify-only path resumable.
If the dataset intentionally contains prompt injections, jailbreaks, or system-prompt-leak examples, import in injection-tolerant mode.
Red-team, security, pentest, and jailbreak requests now enable this by default from the request text. Use --allow-injections to force it explicitly or --enforce-security-flags to force strict flagging.
This bypasses prompt-injection regex flagging while still preserving control-character cleanup and normal canonicalization.
Use this when the verified data already exists in SQLite.
Options:
- preset OpenAI export
- preset HuggingFace export
- flat CSV export
- flat JSONL export
- optional
--plan-fileso exportedinstructionandcontexthonor anymodel_visibilityrules - custom flat schema export
generated: topic-driven examples created by the agenturl_reference: records derived from user-provided URLs or reference documentsraw_dataset: records imported from existing datasetsinternet_research: records built from material gathered via browsing/search
Start from resources/templates/custom_flat_schema.json.
Rules:
modemust beflatcolumnsmust be present- every column needs a unique
name - every column needs a non-empty
source
Example:
{
"name": "my-export",
"mode": "flat",
"columns": [
{"name": "prompt", "source": "instruction"},
{"name": "answer", "source": "response.text"},
{"name": "difficulty", "source": "metadata.difficulty"}
]
}