fifo-tool-datasets
provides standardized adapters to convert plain .dat
files into formats compatible with LLM training β including Hugging Face datasets.Dataset
and JSON message arrays.
It supports both:
- β A Python SDK β for structured loading and conversion
- β
A CLI β to upload/download
.dat
files to/from the Hugging Face Hub
.dat
files are plain-text datasets designed for LLM fine-tuning. They come in three styles:
- π¬
sqna
(single-turn): prompt-response pairs - π§
conversation
(multi-turn): role-tagged chat sessions - βοΈ
dsl
(structured): system β input β DSL output triplets
These files are human-editable, diffable, and ideal for version control β especially during dataset development and iteration.
This tool enables a complete round-trip workflow:
- Create and edit a
.dat
file locally - Convert and upload it as a training-ready Hugging Face
datasets.Dataset
- Later, download and deserialize it back into
.dat
for further edits
This gives you the best of both worlds:
- βοΈ Easy editing and version control via
.dat
- π Compatibility with HF pipelines using
load_dataset()
See format examples below in each adapter section.
- π Dataset Formats
- π Conversion Matrix
- π¦ Installation
- π CLI Usage
- π¦ SDK Usage
- π Available Adapters
- β Validation Rules
- π§ͺ Tests
- β License
- π Disclaimer
Format | Description |
---|---|
.dat |
Editable plain-text format with tags (e.g. > , < , --- ) |
Dataset |
Hugging Face datasets.Dataset object β used for fine-tuning |
wide_dataset |
Flattened Dataset with one row per message β format depends on adapter |
json |
A list of messages dictionaries |
hub |
A DatasetDict with train , validation , and test splits |
All datasets uploaded to the Hub β if not already split β are automatically divided into train
, validation
, and test
partitions using the wide format.
From \ To | dataset | wide_dataset | dat | hub | json |
---|---|---|---|---|---|
dataset | β | β | π§© | β | β |
wide_dataset | β | β | β | β | β |
dat | π§© | β | β | π§© | β |
hub | π§©π¦ | β π¦ | β | β | β |
json | β | β | β | β | β |
Legend:
- β direct: single-step conversion. Hover to view the function name.
- π§© indirect: composed of helper conversions. Hover to view the function name.
- π¦ returns dict: result is a
DatasetDict
.
Install both the CLI and SDK in one step:
python3 -m pip install -e .
This enables the fifo-tool-datasets
command.
fifo-tool-datasets <command> [options]
Upload or download datasets between .dat
files (or directories) and the Hugging Face Hub.
fifo-tool-datasets copy <src> <dst> --adapter <adapter> [--commit-message <msg>] [--seed <int>] [-y]
.dat
or directory β hub: requires--commit-message
- hub β
.dat
or directory: downloads as a file (datasets are merged) or as a directory (each split is preserved)
fifo-tool-datasets split <src> --adapter <adapter> [--to <dir>] [--split-ratio <train> <val> <test>] [-y]
Default split ratio is [0.7, 0.15, 0.15]
if --split-ratio
is omitted.
Recombine split .dat
files into a single dataset.
fifo-tool-datasets merge <dir> --adapter <adapter> [--to <file>] [-y]
Sort the samples of a DSL .dat
file by their full content: system prompt, user input, and assistant response. Sorting is done in place, meaning the original file is overwritten with the sorted result.
You can provide either a single file or a directory. If a directory is given, all .dat
files within it will be sorted in place.
fifo-tool-datasets sort <path> [--adapter dsl]
Currently, only the dsl
adapter is supported. If the --adapter
flag is omitted, it defaults to dsl
automatically.
# Upload
fifo-tool-datasets copy dsl.dat username/my-dataset --adapter dsl --commit-message "init"
# Download
fifo-tool-datasets copy username/my-dataset dsl.dat --adapter dsl
# Split
fifo-tool-datasets split dsl.dat --adapter dsl --to split_dsl
# Merge
fifo-tool-datasets merge split_dsl --adapter dsl --to full.dsl.dat
# Sort
fifo-tool-datasets sort dsl.dat --adapter dsl
from fifo_tool_datasets.sdk.hf_dataset_adapters.dsl import DSLAdapter
adapter = DSLAdapter()
# Upload to the Hugging Face Hub
adapter.from_dat_to_hub(
"dsl.dat",
"username/my-dataset",
commit_message="initial upload"
)
# Download from the Hub as a DatasetDict (train/validation/test)
splits = adapter.from_hub_to_dataset_dict("username/my-dataset")
# Access splits for fine-tuning
train_dataset = splits["train"]
test_dataset = splits["test"]
# You can now use train_dataset / test_dataset to fine-tune your LLM
# e.g., with Hugging Face Transformers Trainer, SFTTrainer, etc.
# You can also directly load from a local .dat file
dataset = adapter.from_dat_to_dataset("dsl.dat")
# Convert to structured JSON format
json_records = adapter.from_wide_dataset_to_json(dataset)
---
$
You are a helpful assistant.
>
Hi
<
Hello!
---
[
{"id_conversation": 0, "id_message": 0, "role": "system", "content": "You are a helpful assistant."},
{"id_conversation": 0, "id_message": 1, "role": "user", "content": "Hi"},
{"id_conversation": 0, "id_message": 2, "role": "assistant", "content": "Hello!"}
]
[
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hi"},
{"role": "assistant", "content": "Hello!"}
]
}
]
>What is 2+2?
<4
[
{"in": "What is 2+2?", "out": "4"}
]
[
{
"messages": [
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "4"}
]
}
]
---
$ You are a precise DSL parser.
> today at 5:30PM
< SET_TIME(TODAY, 17, 30)
---
Multi-line entries are also supported and can be freely mixed with single-line ones. A space after the marker on single-line entries is optional:
---
$
multi-line system
prompt
> single-line input
<single-line output
---
To reuse the previous system prompt across multiple samples, use ...
:
---
$ first prompt
> q1
< a1
---
$ ...
> q2
< a2
---
$
...
> q3
< a3
---
Any $
block that contains only ...
β either directly after the $
or on the following line β will inherit the most recent explicitly defined system prompt.
- At least one non-
...
system prompt is required in the file. - When generating
.dat
files, consecutive identical system prompts are automatically collapsed into$ ...
.
[
{"system": "You are a precise DSL parser.", "in": "today at 5:30PM", "out": "SET_TIME(TODAY, 17, 30)"}
]
[
{
"messages": [
{"role": "system", "content": "You are a precise DSL parser."},
{"role": "user", "content": "today at 5:30PM"},
{"role": "assistant", "content": "SET_TIME(TODAY, 17, 30)"}
]
}
]
Each adapter enforces its own parsing rules:
ConversationAdapter
: tag order, message required after each tag, conversation structureSQNAAdapter
: strictly>
then<
, per pairDSLAdapter
: each block must contain$
,>
,<
in this order.$ ...
reuses the previous system prompt. Values may span multiple lines; single-line values are written with a space after the tag when generating.dat
files. When writing.dat
files, consecutive identical system prompts are replaced by$ ...
automatically.
pytest tests/
MIT β see LICENSE
This project is not affiliated with or endorsed by Hugging Face or the Python Software Foundation.
It builds on their open-source technologies under their respective licenses.