Skip to content

Commit 556dd06

Browse files
docs: add "What is an Environment?" concepts page and update concepts index
Signed-off-by: Chris Wing <cwing@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent b3c550a commit 556dd06

3 files changed

Lines changed: 81 additions & 7 deletions

File tree

96.8 KB
Loading

fern/versions/latest/pages/about/concepts/index.mdx

Lines changed: 18 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,9 @@
11
---
22
title: "Concepts"
3-
description: ""
3+
description: "Understand what environments are, how they power evaluation, agent optimization, and training, and how the same components serve all three."
44
position: 2
55
---
6-
NeMo Gym concepts explain the mental model behind building RL training environments: when to use RL over SFT, how environment components work together, and how verification signals drive learning. Use this page as a compass to decide which explanation to read next.
7-
8-
<Tip>
9-
New to RL for LLMs? Start with [training-approaches](/latest/about/concepts/training-approaches) for context on SFT, RL, and RLVR, or refer to [Key Terminology](/latest/about/concepts/key-terminology) for a quick glossary.
10-
11-
</Tip>
6+
Understand what environments are, how they power evaluation, agent optimization, and training, and how the same components serve all three.
127

138
---
149

@@ -18,6 +13,22 @@ Each explainer below covers one foundational idea and links to deeper material.
1813

1914
<Cards>
2015

16+
<Card title="What is an Environment?" href="/latest/about/concepts/what-is-an-environment">
17+
Where the concept comes from, what components make up an environment, and how environments unify the evaluate-improve loop.
18+
</Card>
19+
20+
<Card title="Environments for Evaluation" href="/latest/about/concepts/environments-for-eval">
21+
How environments power benchmarks, model eval, and agent eval.
22+
</Card>
23+
24+
<Card title="Environments for Agent Optimization" href="/latest/about/concepts/environments-for-agent-optimization">
25+
How environment scores guide iterative harness-level improvements.
26+
</Card>
27+
28+
<Card title="Environments for Training" href="/latest/about/concepts/environments-for-training">
29+
How the same environments become training environments for RL.
30+
</Card>
31+
2132
<Card title="Training Approaches" href="/latest/about/concepts/training-approaches">
2233
Understand the differences between SFT, DPO, and GRPO, and the rise of RLVR.
2334
</Card>
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
---
2+
title: "What is an Environment?"
3+
description: "Environments decompose into dataset, agent harness, verifier, and state — unifying evaluation, agent optimization, and training."
4+
position: 1.5
5+
---
6+
7+
**Goal**: Understand what an environment is, where the concept comes from, and how it decomposes into components that serve evaluation, agent optimization, and training.
8+
9+
## Environment Origins
10+
11+
The term *environment* comes from reinforcement learning, where it describes the world an agent interacts with. The agent takes actions, the environment returns observations and a reward, and the cycle repeats. Historically, the word "Gym" refers to a collection of RL training environments, which inspired the naming of NeMo Gym.
12+
13+
The concept is the same today but the usage of environments has expanded beyond RL to include much more, such as model evaluation, agent evaluation, agent harness optimization, and synthetic data generation.
14+
15+
Accordingly, an environment contains everything required for an agent to complete a task:
16+
- Dataset
17+
- Agent Harness
18+
- Verifier
19+
- State
20+
21+
The model powering the agent is external to the environment. It generates a response (actions), which the environment processes. The environment updates its internal state, returns observations (tool results, error messages, state changes), and produces a reward (a numerical score of how well the agent performed). This loop is the same for evaluation and training — the only difference is what happens with the reward afterward.
22+
23+
## Components of an Environment
24+
25+
### Dataset
26+
27+
A collection of tasks (prompts for the agent to solve), along with metadata and privileged information necessary for scoring each task attempt. Each task defines a problem for the model: e.g. a coding issue to fix, a math problem to solve, a tool-calling scenario to navigate.
28+
29+
Tasks can vary structurally: single-turn question answering, multi-step tool use, multi-turn dialogue, agentic workflows with sandboxed execution, and more.
30+
31+
### Agent Harness
32+
33+
Agent = model + agent harness. The agent harness defines how the model interacts with the environment. Language models perform stateless inference; the harness is what turns a model into an agent — it loops model calls, routes tool use, manages context, and decides when the task is done.
34+
35+
Harnesses exist on a spectrum. A simple harness just loops model calls until the task is complete. Other harnesses like Claude Code or OpenClaw include tools, planning, memory, self-correction, skills and more.
36+
37+
### Verifier
38+
39+
The verifier (sometimes referred to as a scorer or grader) scores a task attempt to calculate a reward, typically between 0 and 1. It defines what "good" means for this environment by scoring the model's output against task-specific metadata — e.g. expected answers, test cases, unit tests, ground truth, rubrics.
40+
41+
Common patterns include exact match, code execution (run tests and check if they pass), state matching, LLM-as-judge, and reward models.
42+
43+
The verifier is used for both evaluation and training:
44+
45+
- **For evaluation**: verifier scores become benchmark metrics (e.g. accuracy, pass@1, pass@k).
46+
- **For training**: the same scores become the reward signal that drives reinforcement learning.
47+
48+
### State
49+
50+
Per-task state that changes as the agent takes actions. Each task attempt starts from a clean state, this ensures attempts are independent and rewards are attributable to the agent's actions on that task.
51+
52+
Some environments have minimal state (e.g., math — just input and output). Others have rich state that evolves across turns: file systems being modified, databases being updated, code repositories being patched, game boards advancing.
53+
54+
Runtimes and sandboxes host and run the environment. A runtime is the execution infrastructure (e.g. local process, Docker, Apptainer). A sandbox is a runtime with isolated execution — for example, a container per episode — providing security boundaries when the agent can execute arbitrary code.
55+
56+
57+
## Environments Unify Agent Improvement
58+
59+
![Evaluate-improve loop](/assets/images/eval-improve-loop.png)
60+
61+
- **Evaluation:** scores become metrics (e.g. accuracy, pass@1, pass@k).
62+
- **Agent optimization:** scores guide harness-level changes: e.g. prompt rewrites, tool changes, context management, orchestration tuning.
63+
- **Training:** scores become the reward signal that drives reinforcement learning. A training framework consumes rewards and updates model weights, improving the capabilities of the underlying model.

0 commit comments

Comments
 (0)