Craft verifiable procedural data generator targetting speficic capabilities. The data should be useful to learn cognitive primitives for language understanding and processing skill. Data is intended designed for both pre-training (next token prediction) or post-training. The data should provide high structural variety, but lexical/surface variety is not a priority. This data should be used alongside natural data already providing surface variations.
Implement tasks that are:
- concise in code, easy to audit
- preferaby solver-backed (use strong external libraries instead of re-implementing),
- distributionally broad (high structural variety),
- verifiable, formal and robustly scorable (
score_answer(generate().answer) == 1). - favour answer uniqueness if possible (e.g. specify lexicographic order) to ease next token prediction training.
Every task should provide:
Configsubclass withupdate(self, c).Tasksubclass implementing:generate(self) -> Problemprompt(self, metadata) -> strscore_answer(self, answer, entry) -> float | Reward(or rely on default exact match)
Problem must include:
metadata(dict/easydict),answer(ground-truth string),
Task.generate_example(...) automatically adds metadata:
_task,_level,_config,_time,_prompt_tokens,_cot_tokens.
Base Config protected fields:
c: difficulty step size,level: current level,seed: RNG seed (do not use it. do not seed anything explictly unless it is requested.)size: optional dataset size.
Important behavior:
- Int-typed fields (except
level/size/seed) are tracked internally as floats and stochastically rounded on read.
Design rules for update(c):
- monotonic difficulty increase,
- no mutation of
c, - keep generation solvable and diverse
- update should change knobs (problem sizes or reasoning depth, etc), not hardcode different subtasks (do not use "if level ... then ...")
Rough reference: Level 0 should be as simple as possible while ensuring diversity (for example in a task where we generate graphs for shortest path prediction, 3 nodes are not enough because the combinatorics run out quickly) Level 5 should be tough even for large LLMs.
- External libraries first:
- Use domain solvers/parsers/symbolic engines (
sympy, planning engines, grammar libs, etc.). - Do not hand-roll complex validators/solvers if a stable library exists.
- Concise generation logic:
- Keep task code short and auditable.
- Push heavy correctness checks to proven toolchains.
- High generality of distribution:
- Randomize structure, not just surface text.
- Avoid narrow templates that overfit lexical patterns.
- Prefer configurable families of instances over one fixed style.
- Reward quality over strict formatting:
- Reward semantic correctness first, with optional light format penalties.
- Use
Reward(...)tags when useful for diagnostics.
from dataclasses import dataclass
from reasoning_core.template import Task, Problem, Config, edict
from reasoning_core.utils import score_scalar
@dataclass
class MyTaskConfig(Config):
n_vars: int = 2
depth: int = 3
def update(self, c=1):
# Used to scale difficulty
# Values will be postprocessed, no need to cast as int explicitly
self.n_vars += c
self.depth += c
class MyTask(Task):
# Do not put "Task" in the task name
def __init__(self, config=MyTaskConfig()):
# Constructor dunder method
super().__init__(config=config)
def generate(self):
# Build instance using external libs when possible.
metadata = edict({"equation": "...", "cot": "...optional..."})
answer = "..."
return Problem(metadata=metadata, answer=answer)
def prompt(self, metadata):
# Specify the answer format clearly, refer to it as "the answer" or "answer".
# Do not use answer as a verb
# The "wording logic" should be in the prompt and not buried in the code.
return f"Solve for x: {metadata['equation']}\n Answer is a scalar."
def score_answer(self, answer, entry):
# Answer is the answer to score (e.g. LLM prediction)
# entry is a problem; entry.answer is the ground truth
# use ast.literal_eval for safety if evaluation is need
return score_scalar(answer, entry) # or custom semantic checkertask = MyTask(); x = task.generate_example()works.task.score_answer(x.answer, x) == 1.- Wrong/random answers do not all score
1. task.validate()passes.config.set_level(1)changes difficulty, notc.- Prompt is unambiguous about output format.
- Metadata is ideally sufficient for offline debugging (instance params, optional
cotentry). - Metadata is not too large (should not blow up memory).
- Any
Tasksubclass inreasoning_core/tasks/*.pyis auto-discovered by AST and lazy-loaded throughreasoning_core.__init__.py. task_namedefaults to snake_case class name.