This page provides a detailed description of the configuration files for AgentJet.
Overview
AgentJet uses YAML-format configuration files to set up data, algorithms, rewards, logging, and other runtime behaviors.
Default Configuration
The default config is located at ajet/default_config/ajet_default.yaml.
At a high level, a typical config contains a single root section ajet, which is divided into several logical parts:
-
Basic Metadata — Project name, experiment name, experiment directory, and backbone selection
project_name,experiment_name,experiment_dirbackbone: Select training backend (debug,trinity, orverl)
-
Data & Reward — How to load data and evaluate agents
task_reader: Load training/validation samplestask_judge: Evaluate agents and compute rewardsdata: Prompt/response length and batch sizes
-
Model & Rollout — Model configuration and agent interaction
model: Base model to trainrollout: Agent-environment interaction settingscontext_tracker: Conversation/history management
Model Configuration
Specifying the Model
| Source Type | Example |
|---|---|
| Local file | /mnt/data/models/Qwen2.5-14B-Instruct |
| HuggingFace repo | Qwen/Qwen2.5-14B-Instruct (auto-downloaded) |
Environment Variables for LLM-as-Judge
If using LLM-as-a-Judge, configure these environment variables:
# DashScope API key for remote LLM calling
export DASHSCOPE_API_KEY='sk-xxxxxx|sk-yyyyyy'
export DASHSCOPE_API_KEY_BACKUP='sk-zzzzzz'
Data Configuration
Task Reader
task_reader defines how to read training and validation data.
Task Judge
task_judge evaluates agent performance and calculates rewards.
ajet:
task_judge:
judge_type: customized_protocol # or 'rubrics_auto_grader'
judge_protocol: ajet.task_judge.env_service_as_judge->EnvServiceJudge
alien_llm_model: qwen3-235b-a22b-instruct-2507
alien_llm_response_length: 512
| Option | Description |
|---|---|
customized_protocol |
Use a custom Python class for scoring |
rubrics_auto_grader |
Use LLM-based automatic grading |
Training Configuration
Backend Selection
AgentJet supports three training backends:
| Backend | Description |
|---|---|
| trinity | Default. Flexible and scalable framework for RL fine-tuning |
| verl | Volcano Engine reinforcement learning for LLMs |
| debug | Allows breakpoint debugging in IDEs |
Rollout Configuration
Controls agent behavior during environment interaction:
ajet:
rollout:
user_workflow: tutorial.example_appworld.appworld->ExampleAgentScopeWorkflow
max_env_worker: 128
temperature: 0.9
top_p: 1.0
name: vllm
n_vllm_engine: 2
num_repeat: 4
| Parameter | Description |
|---|---|
user_workflow |
Path to workflow implementation class |
temperature / top_p |
Sampling parameters |
name |
Inference engine (e.g., vllm) |
n_vllm_engine |
Number of vLLM engines (Trinity only) |
Common Training Parameters
ajet:
trainer_common:
total_epochs: 50
save_freq: 20
test_freq: 20
val_before_train: False
val_pass_n: 4
nnodes: 1
n_gpus_per_node: 8
mini_batch_num: 1
fsdp_config:
param_offload: True
optimizer_offload: True
| Parameter | Description |
|---|---|
total_epochs |
Total training epochs |
save_freq |
Checkpoint save frequency (steps) |
test_freq |
Validation frequency (steps) |
nnodes / n_gpus_per_node |
Distributed training setup |
fsdp_config |
FSDP memory optimization |
Optimization Algorithms
ajet:
trainer_common:
algorithm:
adv_estimator: grpo
use_kl_in_reward: False
optim:
lr: 1e-6
use_kl_loss: True
kl_loss_coef: 0.002
kl_loss_type: low_var_kl
| Parameter | Description |
|---|---|
adv_estimator |
Advantage estimator (e.g., grpo) |
lr |
Learning rate |
use_kl_loss |
Include KL divergence in loss |
kl_loss_coef |
KL loss coefficient |
Debug Mode
When backbone: debug, additional settings are available:
ajet:
debug:
debug_max_parallel: 16
debug_first_n_tasks: 2
debug_vllm_port: 18000
debug_vllm_seed: 12345
debug_tensor_parallel_size: 4
Debug Mode Use Cases
- Limiting tasks: Quickly verify the pipeline on a few tasks
- Fixing randomness:
debug_vllm_seedhelps reproduce issues - Reduced parallelism: Easier to debug with smaller concurrency
Logging & Monitoring
Logger Selection
| Logger | Description |
|---|---|
console |
Standard output for quick progress checking |
wandb |
Weights & Biases experiment tracking |
swanlab |
SwanLab logging |
Output Structure
All experiment outputs are saved in ./launcher_record/{experiment_name}:
| Directory | Contents |
|---|---|
| Logs | Logs and error messages |
| Metrics | Training metrics (depends on logger) |
| Checkpoint | Model checkpoints |
Full Configuration Example
Complete Configuration Template
ajet:
project_name: "ajet_default_project"
experiment_name: "read_yaml_name"
experiment_dir: "auto"
backbone: debug
model:
path: /path/to/model/Qwen2.5-14B-Instruct
data:
max_prompt_length: 3000
max_response_length: 15000
train_batch_size: 32
rollout:
user_workflow: tutorial.example_appworld.appworld->ExampleAgentScopeWorkflow
force_disable_toolcalls: False
max_env_worker: 128
gamma: 1.0
compute_madness_checklist:
- "nonsense"
agent_madness_termination: True
agent_madness_reward: -1.0
max_response_length_in_one_turn: 4096
max_model_len: 18000
multi_turn:
max_sample_per_task: 30
max_steps: 30
expected_steps: 1
tensor_model_parallel_size: 1
n_vllm_engine: 2
max_num_seqs: 10
name: vllm
num_repeat: 4
temperature: 0.9
top_p: 1.0
val_kwargs:
temperature: 0.0
top_k: -1
top_p: 1.0
do_sample: False
num_repeat: 1
task_reader:
type: env_service
env_service:
env_type: "appworld"
env_url: "http://127.0.0.1:8080"
env_action_preference: code
training_split: train
validation_split: dev
task_judge:
judge_type: customized_protocol
judge_protocol: ajet.task_judge.env_service_as_judge->EnvServiceJudge
alien_llm_model: qwen3-235b-a22b-instruct-2507
alien_llm_response_length: 512
debug:
debug_max_parallel: 16
debug_first_n_tasks: 2
debug_vllm_port: 18000
debug_vllm_seed: 12345
debug_tensor_parallel_size: 4
trainer_common:
val_before_train: False
val_pass_n: 4
save_freq: 20
test_freq: 20
total_epochs: 50
nnodes: 1
n_gpus_per_node: 8
logger: swanlab
algorithm:
adv_estimator: grpo
use_kl_in_reward: False
mini_batch_num: 1
fsdp_config:
param_offload: True
optimizer_offload: True
optim:
lr: 1e-6
use_kl_loss: True
kl_loss_coef: 0.002
kl_loss_type: low_var_kl
ulysses_sequence_parallel_size: 1
checkpoint_base_dir: ./saved_checkpoints
context_tracker:
context_tracker_type: "linear"
alien_llm_model: qwen3-235b-a22b-instruct-2507
alien_llm_response_length: 512