AppWorld Agent

This tutorial demonstrates how to train an agent to interact with AppWorld and solve complex tasks through tool usage.

1. Overview

AppWorld is a high-fidelity execution environment of 9 day-to-day apps, operable via 457 APIs, populated with digital activities of 106 people living in a simulated world. The goal is to tune an agent that can effectively navigate and utilize these apps to complete complex tasks.

This document is organized as follows:

Quick Start: run the example with minimal setup
Understand: workflow loop, configuration, code locations, and reward
Results: training curve and qualitative cases

2. Quick Start

2.1 Preparation

First, download and unpack the Appworld services. The script below is idempotent: it clears any existing folder and re-downloads the archive.

base_path="/tmp"
export APPWORLD_PATH="${base_path}/pack_all_in_one"
export APPWORLD_SCRIPT="bash EnvService/env_sandbox/appworld.sh"

rm -rf "${APPWORLD_PATH}"
rm -f ./appworld_pack_v2.tar.gz

wget -q "https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/astuner_archive/appworld_pack_v2.tar.gz" -O appworld_pack_v2.tar.gz
tar -xzf ./appworld_pack_v2.tar.gz -C "${base_path}"

Then export the environment variables (re-run in every new shell):

export BASE_PATH=/tmp
export APPWORLD_PATH="${BASE_PATH}/pack_all_in_one"
export APPWORLD_SCRIPT="bash EnvService/env_sandbox/appworld.sh"

2.2 Start Training

Run the training script:

ajet --conf tutorial/example_appworld/appworld.yaml --with-appworld

Quick Debugging (Optional)

If you want to breakpoint-debug the workflow/judge locally:

# (optional) recommended cleanup before debug
# ajet --kill="python|ray"

clear && \
ajet --conf tutorial/example_appworld/math_agent.yaml --backbone='debug' --with-logview

When `--backbone=debug`, Ray is disabled. You can use a VSCode `.vscode/launch.json` like below:

{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "Python Debugger: Launch rollout",
      "type": "debugpy",
      "request": "launch",
      "module": "ajet.launcher",
      "console": "integratedTerminal",
      "args": [
        "--backbone", "debug",
        "--conf", "./path/to/yaml.yaml"
      ],
      "env": {}
    }
  ]
}

3. Understand

This section explains how the AppWorld example is assembled: workflow, reward, configuration, and code locations.

3.1 Core Process

The AgentScope workflow code for the AppWorld example is located at tutorial/example_appworld/appworld.py.

The code first defines the AgentScope workflow (set the agent's model to tuner.as_agentscope_model()):

agent = ReActAgent(
    name="Qwen",
    sys_prompt=first_msg["content"],
    model=tuner.as_agentscope_model(),
    formatter=DashScopeChatFormatter(),
    memory=InMemoryMemory(),
    toolkit=None,
    print_hint_msg=False,
)

env = workflow_task.gym_env

for step in range(tuner.config.ajet.rollout.multi_turn.max_steps):
    # agentscope deal with interaction message
    reply_message = await agent(interaction_message)
    # env service protocol
    obs, _, terminate, _ = env.step(
        action={"content": reply_message.content, "role": "assistant"}
    )
    # generate new message from env output
    interaction_message = Msg(name="env", content=obs, role="user")
    # is terminated?
    if terminate:
        break
    if tuner.get_context_tracker().context_overflow:
        break

In the above code:

env.step: simulates the gym interface. It takes an action as input and returns a four-tuple (observation, reward, terminate_flag, info).
tuner.get_context_tracker().context_overflow: checks whether the current context window has exceeded the token limit.

3.2 Reward

In ajet/task_judge/env_service_as_judge.py, we read the reward signal from the environment via env.evaluate(...).

You can also refer to this file to implement your own Judge for your specific task.

3.3 Configuration Details

Copy and modify the key parameters in tutorial/example_appworld/appworld.yaml. The parts most relevant to this document are marked with in the yaml file:

Read tasks (corresponding config field: ajet.task_reader)
Define the workflow (corresponding config field: ajet.rollout.user_workflow)
- Example: if the AgentScope workflow is defined in the ExampleAgentScopeWorkflow class in tutorial/example_appworld/appworld.py
- Then set ajet.rollout.user_workflow = "tutorial.example_appworld.appworld->ExampleAgentScopeWorkflow"
Define the scoring function (corresponding config field: ajet.task_judge.judge_protocol)
- Example: ajet.task_judge.judge_protocol = "ajet.task_judge.env_service_as_judge->EnvServiceJudge"
Specify the model (corresponding config field: ajet.model.path)

ajet:
  project_name: example_appworld
  experiment_name: "read_yaml_name"
  task_judge:
    # [key] Implement and select the evaluation function
    judge_protocol: ajet.task_judge.env_service_as_judge->EnvServiceJudge
  model:
    # [key] Set the model to be trained
    path: YOUR_MODEL_PATH
  rollout:
    # [key] Implement and select the Agent
    user_workflow: tutorial.example_appworld.appworld->ExampleAgentScopeWorkflow
    force_disable_toolcalls: True
  debug:
    debug_max_parallel: 1
    debug_first_n_tasks: 1

4. Results

4.1 Training Curve

Training curve (small batch)

Visualization: Training curves are generated by SwanLab. See Visualization Tools for setup and usage.

As training progresses, reward increases. This usually means the agent becomes more stable on two things:

Following correct API protocols: it learns to look up API documentation before calling, and uses valid API endpoints instead of hallucinating non-existent ones.
Completing multi-step workflows: it can properly obtain access tokens and chain multiple API calls to accomplish complex tasks.

4.2 Case Study

Before tuning:

Frequently call non-existent APIs

Before tuning

The agent hallucinates API names without checking whether they exist, leading to repeated failures.

Fail to follow the instructions to obtain an access token

Before tuning

The agent attempts to call protected APIs without first obtaining the required access token, resulting in authentication errors.

After tuning:

Look up the API documentation first, and learn to use valid APIs

After tuning

The agent now checks available APIs before making calls, avoiding hallucinated endpoints.

Learn to obtain an access token correctly

After tuning

The agent properly handles the authentication step before accessing protected APIs.

Token-level Visualization: These detailed logs are generated by Beast-Logger. See Beast-Logger Usage for more details.