Skip to content

Latest commit

 

History

History
124 lines (102 loc) · 5.1 KB

File metadata and controls

124 lines (102 loc) · 5.1 KB

Task Implementation Guide

Goal

Craft verifiable procedural data generator targetting speficic capabilities. The data should be useful to learn cognitive primitives for language understanding and processing skill. Data is intended designed for both pre-training (next token prediction) or post-training. The data should provide high structural variety, but lexical/surface variety is not a priority. This data should be used alongside natural data already providing surface variations.

Implement tasks that are:

  • concise in code, easy to audit
  • preferaby solver-backed (use strong external libraries instead of re-implementing),
  • distributionally broad (high structural variety),
  • verifiable, formal and robustly scorable (score_answer(generate().answer) == 1).
  • favour answer uniqueness if possible (e.g. specify lexicographic order) to ease next token prediction training.

Core Contract (from reasoning_core/template.py)

Every task should provide:

  • Config subclass with update(self, c).
  • Task subclass implementing:
    • generate(self) -> Problem
    • prompt(self, metadata) -> str
    • score_answer(self, answer, entry) -> float | Reward (or rely on default exact match)

Problem must include:

  • metadata (dict/easydict),
  • answer (ground-truth string),

Task.generate_example(...) automatically adds metadata:

  • _task, _level, _config, _time, _prompt_tokens, _cot_tokens.

Config and Difficulty Scaling

Base Config protected fields:

  • c: difficulty step size,
  • level: current level,
  • seed: RNG seed (do not use it. do not seed anything explictly unless it is requested.)
  • size: optional dataset size.

Important behavior:

  • Int-typed fields (except level/size/seed) are tracked internally as floats and stochastically rounded on read.

Design rules for update(c):

  • monotonic difficulty increase,
  • no mutation of c,
  • keep generation solvable and diverse
  • update should change knobs (problem sizes or reasoning depth, etc), not hardcode different subtasks (do not use "if level ... then ...")

Rough reference: Level 0 should be as simple as possible while ensuring diversity (for example in a task where we generate graphs for shortest path prediction, 3 nodes are not enough because the combinatorics run out quickly) Level 5 should be tough even for large LLMs.

Reasoning-Core Philosophy

  1. External libraries first:
  • Use domain solvers/parsers/symbolic engines (sympy, planning engines, grammar libs, etc.).
  • Do not hand-roll complex validators/solvers if a stable library exists.
  1. Concise generation logic:
  • Keep task code short and auditable.
  • Push heavy correctness checks to proven toolchains.
  1. High generality of distribution:
  • Randomize structure, not just surface text.
  • Avoid narrow templates that overfit lexical patterns.
  • Prefer configurable families of instances over one fixed style.
  1. Reward quality over strict formatting:
  • Reward semantic correctness first, with optional light format penalties.
  • Use Reward(...) tags when useful for diagnostics.

Minimal Task Skeleton

from dataclasses import dataclass
from reasoning_core.template import Task, Problem, Config, edict
from reasoning_core.utils import score_scalar

@dataclass
class MyTaskConfig(Config):
    n_vars: int = 2
    depth: int = 3

    def update(self, c=1):
        # Used to scale difficulty
        # Values will be postprocessed, no need to cast as int explicitly
        self.n_vars += c
        self.depth += c

class MyTask(Task):
    # Do not put "Task" in the task name
    def __init__(self, config=MyTaskConfig()):
        # Constructor dunder method
        super().__init__(config=config)

    def generate(self):
        # Build instance using external libs when possible.
        metadata = edict({"equation": "...", "cot": "...optional..."})
        answer = "..."
        return Problem(metadata=metadata, answer=answer)

    def prompt(self, metadata):
        # Specify the answer format clearly, refer to it as "the answer" or "answer".
        # Do not use answer as a verb
        # The "wording logic" should be in the prompt and not buried in the code.
        return f"Solve for x: {metadata['equation']}\n Answer is a scalar."

    def score_answer(self, answer, entry):
        # Answer is the answer to score (e.g. LLM prediction)
        # entry is a problem; entry.answer is the ground truth
        # use ast.literal_eval for safety if evaluation is need
        return score_scalar(answer, entry)  # or custom semantic checker

Quality Checklist

  • task = MyTask(); x = task.generate_example() works.
  • task.score_answer(x.answer, x) == 1.
  • Wrong/random answers do not all score 1.
  • task.validate() passes.
  • config.set_level(1) changes difficulty, not c.
  • Prompt is unambiguous about output format.
  • Metadata is ideally sufficient for offline debugging (instance params, optional cot entry).
  • Metadata is not too large (should not blow up memory).

Registration and Discovery

  • Any Task subclass in reasoning_core/tasks/*.py is auto-discovered by AST and lazy-loaded through reasoning_core.__init__.py.
  • task_name defaults to snake_case class name.