Dataset Generator¶

DatasetGenerator is a powerful tool for generating and transforming datasets using Dria, supporting both prompt-based and singleton-based data generation workflows.

Core Features¶

Parallel execution capabilities
Automatic schema validation
Multiple model support
Search capabilities
Sequential workflow processing

Basic Usage¶

DatasetGenerator requires a DriaDataset to operate.

Using Prompts¶

Prompt-based generation is a simple way to generate data using a single prompt. Dria will apply the prompt to each instruction. Prompts are defined using the Prompt class.

import asyncio
from dria import Prompt, DatasetGenerator, DriaDataset, Model
from pydantic import BaseModel, Field


# Define output schema
class Tweet(BaseModel):
    topic: str = Field(..., title="Topic")
    tweet: str = Field(..., title="tweet")


# Create dataset
dataset = DriaDataset(
    name="tweet_test", description="A dataset of tweets!", schema=Tweet
)

After dataset is created, you can define instructions and prompts to apply to the instructions. Prompts accept variables with double curly braces {{variable}}.

instructions = [{"topic": "BadBadNotGood"}, {"topic": "Decentralized synthetic data"}]

prompter = Prompt(prompt="Write a tweet about {{topic}}", schema=Tweet)
generator = DatasetGenerator(dataset=dataset)

asyncio.run(
    generator.generate(
        instructions=instructions, singletons=prompter, models=Model.GPT4O
    )
)

print(dataset.to_pandas())

Using Singletons¶

Dria provides a factory for pre-built singletons. Singletons are custom classes that define a specific workflow for generating data.

from dria import DriaDataset, DatasetGenerator, Model
from dria.factory import GenerateSubtopics

my_dataset = DriaDataset(
    name="subtopics",
    description="A dataset for subtopics",
    schema=GenerateSubtopics.OutputSchema,
)
generator = DatasetGenerator(dataset=my_dataset)

Model Configuration¶

Single Model¶

models = Model.GPT4O

Multiple Models¶

models = [Model.GPT4O, Model.GEMINI_15_FLASH]

Model Pipeline¶

models = [
    [Model.GPT4O],           # For first singleton
    [Model.GEMINI_15_FLASH], # For second singleton
    [Model.LLAMA3_1_8B_FP16] # For third singleton
]