Data Generation

Introduction

Data Generation Reader provides an intelligent data generation method designed to simplify the creation of high-quality training data. The method is flexible and efficient, capable of generating domain-specific tasks based on few-shot examples and optional documents.

Method

Data Generation Reader employs a two-stage task generation process:

Stage 1 (Optional): Document-based Data Generation

This stage is optional. Document-based Data Generation generates knowledge-based tasks based on the provided documents. Users can provide one or more documents (supporting formats like PDF, Word, TXT, etc.):

According to the Anti-Money Laundering and Counter-Terrorist Financing Ordinance and related Guideline, banks are required to identify and take reasonable measures to verify the identity of the beneficial owner of corporate customers so that the bank is ...

The generator reads the document content and guides the LLM to batch-generate tasks related to the document content:

[
  {
    "main_query": "What are the key requirements of Customer Due Diligence in AML procedures?",
    "related_doc": "Customer Due Diligence measures should include: (a) identifying the customer and verifying the customer's identity..."
  },
  {
    "main_query": "How should financial institutions handle Suspicious Transaction Reports?",
    "related_doc": "When someone knows or suspects that any property represents the proceeds of an indictable offense..."
  }
  ...
]

If documents are provided for data generation, the data generated in this stage will be added to the validation task set for the subsequent training process.

Stage 2: Few-shot Data Generation

This stage generates the final training tasks. Few-shot Data Generation combines a few user-provided tasks with the knowledge-based tasks generated in the first stage, and use the documents as references to generate training tasks. First, the user needs to provide a few task examples:

{"main_query": "Can banks ask corporate customers to provide information of its ownership?", "answer": "According to the Anti-Money Laundering and ..."}
{"main_query": "Can a bank close my account?", "answer": "Either a customer or a bank may close an account at any time subject to any specific terms and ..."}
...

These examples will be merged with the tasks generated in the first stage to form an example task set. The generator will sample from this set to be used as few-shot demonstrations, and combined with relevant documents, guide the LLM to batch-generate training tasks:

[
  {
    "main_query": "Are financial institutions required to verify the source of funds for corporate clients during account opening?"
  },
  {
    "main_query": "What are the requirements for banks to verify customer identities under anti-money laundering regulations?"
  }
  ...
]

Quick Start

Data Generation Reader can load a few user-provided tasks and optional documents (in various formats such as PDF, Word, and TXT) from a local path, then generates tasks and loads them as training tasks.

Step 1: Prepare data

Provide a few example tasks:

{"main_query": "What is the capital of France?", "answer": "..."}
{"main_query": "How to cook pasta?", "answer": "..."}

(Optional) Provide documents and place them in the specified directory:

mkdir -p dataset/document
cp your-document.pdf dataset/document/

Step 2: Generate Training Tasks

Method 1: Integrate Data Generation into the Training Pipeline

Copy and modify the key configuration parameters in ajet/default_config/ajet_default.yaml, and set ajet.task_reader.type to data_generation to enable this reader.

ajet:
  task_reader:
    type: data_generation
    # when `type == data_generation`
    data_generation:
      # Document reader configuration
      document_reader:
        document_path:
          - 'dataset/document/your-document1.pdf'
          - 'dataset/document/your-document2.pdf'
        languages:
          - eng
      # Task reader (for existing tasks)
      query_reader:
        type: jsonl_dataset_file
        jsonl_dataset_file:
          training:
            file_path: 'dataset/jsonl/your-queries.jsonl'
      # Number of tasks to generate
      task_num: 10
      # LLM config
      llm_model: qwen-long
      llm_response_length: 8192
      num_workers: 32
      sampling_params:
        temperature: 0
      # Task filtering config
      deduplication_filter:
        enabled: true
        params:
          similarity_threshold: 0.8
          db_path: ./.similarity_db
          model: text-embedding-v4
          api_key: null # load from the env
          base_url: https://dashscope.aliyuncs.com/compatible-mode/v1

Method 2: Run the Generation Script

from ajet.data_generator.config import *
from ajet.task_reader.data_generator_reader import DataGeneratorTaskReader

def run():
    config = TaskReaderConfig(
        data_generation=DataGenerationConfig(
            document_reader=DocumentReaderConfig(
                document_path=['dataset/document/your-document1.pdf', 'dataset/document/your-document2.pdf'],
                languages=["eng"],
                chunk_size=5120,
                split_by="sentence",
            ),
            query_reader=QueryReaderConfig(
                type="jsonl_dataset_file",
                jsonl_dataset_file=DatasetFileConfig(
                    training=TrainingDatasetConfig(file_path='dataset/jsonl/your-queries.jsonl')
                ),
            ),
            task_num=50,
            llm_model="qwen-long",
            num_workers=16,
            sampling_params=SamplingParamsConfig(temperature=0.0),
            deduplication_filter=DeduplicationFilterConfig(
                enabled=True,
                params=DeduplicationFilterParamsConfig(
                    similarity_threshold=0.8,
                    model="text-embedding-v4",
                ),
            ),
        )
    )
    reader = DataGeneratorTaskReader(reader_config=config)

run()

Generated Task Examples

Based on user-provided documents (optional) and a few task examples, the Data Generation Reader can batch-generate training tasks:

[
  {
    "main_query": "Are financial institutions required to verify the source of funds for corporate clients during account opening?"
  },
  {
    "main_query": "What are the requirements for banks to verify customer identities under anti-money laundering regulations?"
  }
  ...
]

Detailed Config Options

Parameter Path	Type	Default	Required	Description
`document_reader.document_path`	list[str]	-	No	List of document file paths. Supports PDF, Word, TXT, and more.
`document_reader.languages`	list[str]	`['eng']`	No	List of document languages for OCR and text parsing, e.g., `eng` (English), `chs` (Simplified Chinese).
`query_reader.type`	str	`jsonl_dataset_file`	Yes	Reader type. Options: `jsonl_dataset_file`, `env_service`, `huggingface_dat_repo`.
`query_reader.jsonl_dataset_file.training.file_path`	str	-	Yes	Path to the training tasks JSONL file (when `type: jsonl_dataset_file`).
`task_num`	int	`10`	Yes	Number of tasks to generate. The actual number may be reduced by filtering.
`llm_model`	str	`qwen-long`	Yes	LLM model name used for task generation.
`llm_response_length`	int	`8192`	No	Maximum number of tokens in the LLM response.
`num_workers`	int	`32`	No	Number of parallel worker threads for speeding up task generation.
`sampling_params.temperature`	float	`0`	No	Sampling temperature. `0` means greedy decoding (deterministic output); higher values make outputs more random.
`deduplication_filter.enabled`	bool	`true`	No	Whether to enable the deduplication filter.
`deduplication_filter.params.similarity_threshold`	float	`0.8`	Yes	Similarity threshold (0–1). Tasks above this threshold will be filtered out.
`deduplication_filter.params.db_path`	str	`./.similarity_db`	No	Path to the similarity database used to cache embeddings.
`deduplication_filter.params.model`	str	`text-embedding-v4`	Yes	Embedding model used to compute similarity.
`deduplication_filter.params.api_key`	str	`null`	No	API key. If `null`, it will be loaded from the `DASHSCOPE_API_KEY` environment variable.
`deduplication_filter.params.base_url`	str	`https://dashscope.aliyuncs.com/compatible-mode/v1`	No	Base URL for the embedding API.