Task judger will be disabled automatically when the user-defined workflow returned an effective WorkflowOutput.reward and WorkflowOutput.reward != None
Task Judger evaluates agent outputs and assigns rewards during training. This page covers built-in judgers for common scenarios and how to create custom judgers for specific evaluation needs.
Overview
A Task Judger evaluates the agent's execution results and returns two values:
| Return Value | Type | Description |
|---|---|---|
raw_reward |
float |
Numerical score representing output quality (often 0.0 to 1.0) |
is_success |
bool |
Whether the task was successfully completed |
These values guide the RL training process, helping agents learn which behaviors produce better outcomes.
Base Interface
All Task Judgers inherit from BaseJudge and implement the compute_reward method:
from ajet.task_judge.base_judge import BaseJudge
from ajet.workflow import WorkflowOutput, WorkflowTask
class BaseJudge:
def __init__(self, config):
self.config = config
def compute_reward(
self,
workflow_task: WorkflowTask,
workflow_output: WorkflowOutput
) -> tuple[float, bool]:
"""
Args:
workflow_task: Contains the task data, including metadata with reference answers
workflow_output: Contains the agent's output, including metadata with generated answers
Returns:
tuple: (raw_reward: float, is_success: bool)
"""
raise NotImplementedError
Built-in Task Judgers
AgentJet provides three built-in judgers for common evaluation scenarios:
1. MathAnswerAsJudge
Evaluates mathematical answers by exact string matching, designed for tasks where answers are formatted in LaTeX \boxed{} notation.
When to use
- Math problem solving tasks
- Tasks with deterministic, exact answers
- Answers formatted as
\boxed{result}
- Extracts the answer from
\boxed{...}in the agent's output - Compares with the reference answer from
workflow_task.task.metadata["answer"] - Returns
(1.0, True)for correct answers,(0.0, False)otherwise
Required metadata:
| Field | Source | Description |
|---|---|---|
final_answer |
workflow_output.metadata |
Agent's answer with \boxed{} format |
answer |
workflow_task.task.metadata |
Reference answer |
2. CountdownAnswerAsJudge
Evaluates mathematical equations with partial credit for proper formatting.
When to use
- Number puzzle tasks (e.g., Countdown game)
- Tasks where partial credit is appropriate
- Need to reward proper formatting even when answer is wrong
3. EnvServiceJudge
Delegates evaluation to an external environment service, useful for complex interactive environments.
When to use
- Tasks with external simulators (e.g., AppWorld)
- Complex state-based evaluation
- Interactive environments with built-in evaluators
ajet:
task_judge:
judge_type: customized_protocol
judge_protocol: ajet.task_judge.env_service_as_judge->EnvServiceJudge
How it works
- Calls
workflow_task.gym_env.evaluate()to get a score from the environment - Converts the score to a normalized reward:
- Success (score ≥ 1):
1.0 + score * 0.5 - Failure (score < 1):
0.0 + score * 0.5
- Success (score ≥ 1):
Creating Custom Task Judgers
For specialized evaluation needs, create your own judger by inheriting BaseJudge:
- Implement Your Judger Create a new file with your custom judger class.
- Configure Your Judger Point to your custom class in the YAML configuration.
- Pass Data to the Judger Populate `workflow_output.metadata` with the data your judger needs.
Step 1: Implement Your Judger
from ajet.task_judge.base_judge import BaseJudge
from ajet.workflow import WorkflowOutput, WorkflowTask
class MyCustomJudge(BaseJudge):
def __init__(self, config):
super().__init__(config)
self.threshold = 0.8
def compute_reward(
self,
workflow_task: WorkflowTask,
workflow_output: WorkflowOutput
) -> tuple[float, bool]:
agent_answer = workflow_output.metadata.get("final_answer", "")
reference_answer = workflow_task.task.metadata.get("answer", "")
similarity = self._compute_similarity(agent_answer, reference_answer)
is_success = similarity >= self.threshold
return similarity, is_success
def _compute_similarity(self, text1: str, text2: str) -> float:
return len(set(text1.split()) & set(text2.split())) / max(
len(text1.split()), len(text2.split()), 1
)
Step 2: Configure Your Judger
ajet:
task_judge:
judge_type: customized_protocol
judge_protocol: tutorial.my_task.my_judge->MyCustomJudge
Step 3: Pass Data to the Judger
class MyWorkflow(Workflow):
async def execute(self, task: WorkflowTask, tuner: AjetTuner) -> WorkflowOutput:
final_answer = await self.agent.reply(msg)
return WorkflowOutput(
reward=None, # Will be filled by the judger
metadata={
"final_answer": final_answer,
}
)
Configuration Summary
ajet:
task_judge:
judge_type: customized_protocol
judge_protocol: ajet.task_judge.<module>-><ClassName>