Exporting Data¶
Data within DriaDataset can export to different formats:
# Export to pandas DataFrame
df = dataset.to_pandas()
# Export to JSONL
dataset.to_jsonl()
# Export to JSON
dataset.to_json()
# Export to custom path
dataset.to_json("export.jsonl")
Format for Training¶
DriaDataset enables TRL ready data exports for multiple training setups.
The Formatter
class is used to convert a dataset into a format that can be used by a specific trainer.
Data generated by Dria Network can be transformed into training-ready data using Formatter
Format Types¶
The Formatter
class supports the following format types:
- Standard
- Conversational
and following subtypes for each format type:
- LANGUAGE_MODELING
- PROMPT_ONLY
- PROMPT_COMPLETION
- PREFERENCE
- UNPAIRED_PREFERENCE
HuggingFace TRL Expected Dataset Formats¶
HuggingFace's TRL is a framework to train transformer language models with Reinforcement Learning, from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization (PPO) step.
Dria allows you to convert the generated data into the expected dataset format for each trainer in the TRL framework. Enabling seamless plug-n-play with HuggingFace's TRL.
Here is an example exporting MagPie data in CONVERSATIONAL_PROMPT_COMPLETION
format.
First create dataset and generate data.
Generate Data¶
from dria import DriaDataset, DatasetGenerator, Model
from dria.factory import MagPie
import asyncio
from dria.utils import ConversationMapping, FieldMapping, FormatType
instructions = [
{
"instructor_persona": "A math student",
"responding_persona": "An AI teaching assistant.",
"num_turns": 3,
},
{
"instructor_persona": "A chemistry student",
"responding_persona": "An AI teaching assistant.",
"num_turns": 3,
},
{
"instructor_persona": "A physics student",
"responding_persona": "An AI teaching assistant.",
"num_turns": 3,
},
{
"instructor_persona": "A music student",
"responding_persona": "An AI teaching assistant.",
"num_turns": 5,
},
{
"instructor_persona": "A visual arts student",
"responding_persona": "An AI teaching assistant.",
"num_turns": 2,
},
]
my_dataset = DriaDataset("magpie_test", "a test dataset", MagPie.OutputSchema)
generator = DatasetGenerator(dataset=my_dataset)
asyncio.run(
generator.generate(
instructions,
MagPie,
[
Model.ANTHROPIC_HAIKU_3_5_OR,
Model.QWEN2_5_72B_OR,
Model.LLAMA_3_1_8B_OR,
Model.LLAMA3_1_8B_FP16,
],
)
)
Export Data¶
You can export data by creating a ConversationMapping for CONVERSATIONAL_PROMPT_COMPLETION. MagPie outputs:
class DialogueTurn(BaseModel):
instructor: str = Field(..., description="Instructor's message")
responder: str = Field(..., description="Responder's message")
class DialogueOutput(BaseModel):
dialogue: List[DialogueTurn] = Field(..., description="List of dialogue turns")
model: str = Field(..., description="Model used for generation")
So DriaDataset will read the dialogue
field and map instructor
to user_message
and responder
to the assistant_message
.
DriaDataset will export a jsonl file in suitable format.
cmap = ConversationMapping(
conversation=FieldMapping(user_message="instructor", assistant_message="responder"),
field="dialogue",
)
my_dataset.format_for_training(
FormatType.CONVERSATIONAL_PROMPT_COMPLETION, cmap, output_format="jsonl"
)
Here are the full list of TRL formats mapped to Dria formats.
Trainer | Expected Dataset Type |
---|---|
BCOTrainer | FormatType.STANDARD_UNPAIRED_PREFERENCE |
CPOTrainer | FormatType.STANDARD_PREFERENCE |
DPOTrainer | FormatType.STANDARD_PREFERENCE |
GKDTrainer | FormatType.STANDARD_PROMPT_COMPLETION |
IterativeSFTTrainer | FormatType.STANDARD_UNPAIRED_PREFERENCE |
KTOTrainer | FormatType.STANDARD_UNPAIRED_PREFERENCE or FormatType.STANDARD_PREFERENCE |
NashMDTrainer | FormatType.STANDARD_PROMPT_ONLY |
OnlineDPOTrainer | FormatType.STANDARD_PROMPT_ONLY |
ORPOTrainer | FormatType.STANDARD_PREFERENCE |
PPOTrainer | FormatType.STANDARD_LANGUAGE_MODELING |
RewardTrainer | FormatType.STANDARD_PREFERENCE |
SFTTrainer | FormatType.STANDARD_LANGUAGE_MODELING |
XPOTrainer | FormatType.STANDARD_PROMPT_ONLY |