Dataset Generator¶
DatasetGenerator
is a powerful tool for generating and transforming datasets using Dria, supporting both prompt-based and singleton-based data generation workflows.
Core Features¶
- Parallel execution capabilities
- Automatic schema validation
- Multiple model support
- Search capabilities
- Sequential workflow processing
Basic Usage¶
DatasetGenerator requires a DriaDataset to operate.
Using Prompts¶
Prompt-based generation is a simple way to generate data using a single prompt. Dria will apply the prompt to each instruction.
Prompts are defined using the Prompt
class.
import asyncio
from dria import Prompt, DatasetGenerator, DriaDataset, Model
from pydantic import BaseModel, Field
# Define output schema
class Tweet(BaseModel):
topic: str = Field(..., title="Topic")
tweet: str = Field(..., title="tweet")
# Create dataset
dataset = DriaDataset(
name="tweet_test", description="A dataset of tweets!", schema=Tweet
)
After dataset is created, you can define instructions and prompts to apply to the instructions.
Prompts accept variables with double curly braces {{variable}}
.
instructions = [{"topic": "BadBadNotGood"}, {"topic": "Decentralized synthetic data"}]
prompter = Prompt(prompt="Write a tweet about {{topic}}", schema=Tweet)
generator = DatasetGenerator(dataset=dataset)
asyncio.run(
generator.generate(
instructions=instructions, singletons=prompter, models=Model.GPT4O
)
)
print(dataset.to_pandas())
Using Singletons¶
Dria provides a factory for pre-built singletons. Singletons are custom classes that define a specific workflow for generating data.
from dria import DriaDataset, DatasetGenerator, Model
from dria.factory import GenerateSubtopics
my_dataset = DriaDataset(
name="subtopics",
description="A dataset for subtopics",
schema=GenerateSubtopics.OutputSchema,
)
generator = DatasetGenerator(dataset=my_dataset)