Task Judger

Task judger will be disabled automatically when the user-defined workflow returned an effective WorkflowOutput.reward and WorkflowOutput.reward != None

Task Judger evaluates agent outputs and assigns rewards during training. This page covers built-in judgers for common scenarios and how to create custom judgers for specific evaluation needs.

Overview

A Task Judger evaluates the agent's execution results and returns two values:

Return Value	Type	Description
`raw_reward`	`float`	Numerical score representing output quality (often 0.0 to 1.0)
`is_success`	`bool`	Whether the task was successfully completed

These values guide the RL training process, helping agents learn which behaviors produce better outcomes.

Base Interface

All Task Judgers inherit from BaseJudge and implement the compute_reward method:

base_judge.py

from ajet.task_judge.base_judge import BaseJudge
from ajet.workflow import WorkflowOutput, WorkflowTask

class BaseJudge:
    def __init__(self, config):
        self.config = config

    def compute_reward(
        self,
        workflow_task: WorkflowTask,
        workflow_output: WorkflowOutput
    ) -> tuple[float, bool]:
        """
        Args:
            workflow_task: Contains the task data, including metadata with reference answers
            workflow_output: Contains the agent's output, including metadata with generated answers

        Returns:
            tuple: (raw_reward: float, is_success: bool)
        """
        raise NotImplementedError

Built-in Task Judgers

AgentJet provides three built-in judgers for common evaluation scenarios:

1. MathAnswerAsJudge

Evaluates mathematical answers by exact string matching, designed for tasks where answers are formatted in LaTeX \boxed{} notation.

When to use

Math problem solving tasks
Tasks with deterministic, exact answers
Answers formatted as \boxed{result}

ConfigurationHow it works

config.yaml

ajet:
  task_judge:
    judge_type: customized_protocol
    judge_protocol: tutorial.example_math_agent.math_answer_as_judge->MathAnswerAsJudge

Extracts the answer from \boxed{...} in the agent's output
Compares with the reference answer from workflow_task.task.metadata["answer"]
Returns (1.0, True) for correct answers, (0.0, False) otherwise

Required metadata:

Field	Source	Description
`final_answer`	`workflow_output.metadata`	Agent's answer with `\boxed{}` format
`answer`	`workflow_task.task.metadata`	Reference answer

2. CountdownAnswerAsJudge

Evaluates mathematical equations with partial credit for proper formatting.

When to use

Number puzzle tasks (e.g., Countdown game)
Tasks where partial credit is appropriate
Need to reward proper formatting even when answer is wrong

ConfigurationScoring

config.yaml

ajet:
  task_judge:
    judge_type: customized_protocol
    judge_protocol: tutorial.example_countdown.countdown_answer_as_judge->CountdownAnswerAsJudge

Score	Condition
`0.0`	Invalid or missing answer
`0.1`	Properly formatted equation but wrong result
`1.0`	Correct equation and result

3. EnvServiceJudge

Delegates evaluation to an external environment service, useful for complex interactive environments.

When to use

Tasks with external simulators (e.g., AppWorld)
Complex state-based evaluation
Interactive environments with built-in evaluators

config.yaml

ajet:
  task_judge:
    judge_type: customized_protocol
    judge_protocol: ajet.task_judge.env_service_as_judge->EnvServiceJudge

How it works

Calls workflow_task.gym_env.evaluate() to get a score from the environment
Converts the score to a normalized reward:
- Success (score ≥ 1): 1.0 + score * 0.5
- Failure (score < 1): 0.0 + score * 0.5

Creating Custom Task Judgers

For specialized evaluation needs, create your own judger by inheriting BaseJudge:

Custom Judger Steps

Implement Your Judger Create a new file with your custom judger class.
Configure Your Judger Point to your custom class in the YAML configuration.
Pass Data to the Judger Populate `workflow_output.metadata` with the data your judger needs.

Step 1: Implement Your Judger

tutorial/my_task/my_judge.py

from ajet.task_judge.base_judge import BaseJudge
from ajet.workflow import WorkflowOutput, WorkflowTask

class MyCustomJudge(BaseJudge):
    def __init__(self, config):
        super().__init__(config)
        self.threshold = 0.8

    def compute_reward(
        self,
        workflow_task: WorkflowTask,
        workflow_output: WorkflowOutput
    ) -> tuple[float, bool]:
        agent_answer = workflow_output.metadata.get("final_answer", "")
        reference_answer = workflow_task.task.metadata.get("answer", "")

        similarity = self._compute_similarity(agent_answer, reference_answer)
        is_success = similarity >= self.threshold
        return similarity, is_success

    def _compute_similarity(self, text1: str, text2: str) -> float:
        return len(set(text1.split()) & set(text2.split())) / max(
            len(text1.split()), len(text2.split()), 1
        )

Step 2: Configure Your Judger

config.yaml

ajet:
  task_judge:
    judge_type: customized_protocol
    judge_protocol: tutorial.my_task.my_judge->MyCustomJudge

Step 3: Pass Data to the Judger

workflow.py

class MyWorkflow(Workflow):
    async def execute(self, task: WorkflowTask, tuner: AjetTuner) -> WorkflowOutput:
        final_answer = await self.agent.reply(msg)
        return WorkflowOutput(
            reward=None,  # Will be filled by the judger
            metadata={
                "final_answer": final_answer,
            }
        )

Configuration Summary

config.yaml

ajet:
  task_judge:
    judge_type: customized_protocol
    judge_protocol: ajet.task_judge.<module>-><ClassName>

Next Steps

Configuration

Complete reference for all configuration options.

$\text{[math]}$

Math Agent

See MathAnswerAsJudge in a complete training example.