`fifo-tool-datasets`

fifo-tool-datasets provides standardized adapters to convert plain .dat files into formats compatible with LLM training — including Hugging Face datasets.Dataset and JSON message arrays.

It supports both:

✅ A Python SDK — for structured loading and conversion
✅ A CLI — to upload/download .dat files to/from the Hugging Face Hub

.dat files are plain-text datasets designed for LLM fine-tuning. They come in three styles:

💬 sqna (single-turn): prompt-response pairs
🧠 conversation (multi-turn): role-tagged chat sessions
⚙️ dsl (structured): system → input → DSL output triplets

These files are human-editable, diffable, and ideal for version control — especially during dataset development and iteration.

This tool enables a complete round-trip workflow:

Create and edit a .dat file locally
Convert and upload it as a training-ready Hugging Face datasets.Dataset
Later, download and deserialize it back into .dat for further edits

This gives you the best of both worlds:

✍️ Easy editing and version control via .dat
🚀 Compatibility with HF pipelines using load_dataset()

See format examples below in each adapter section.

📚 Table of Contents

📐 Dataset Formats
🔁 Conversion Matrix
📦 Installation
🚀 CLI Usage
- 🛠️ Command Reference
- 💡 Command examples
📦 SDK Usage
🔌 Available Adapters
✅ Validation Rules
🧪 Tests
✅ License
📄 Disclaimer

📐 Dataset Formats

Format	Description
`.dat`	Editable plain-text format with tags (e.g. `>`, `<`, `---`)
`Dataset`	Hugging Face `datasets.Dataset` object — used for fine-tuning
`wide_dataset`	Flattened `Dataset` with one row per message — format depends on adapter
`json`	A list of `messages` dictionaries
`hub`	A `DatasetDict` with `train`, `validation`, and `test` splits

All datasets uploaded to the Hub — if not already split — are automatically divided into train, validation, and test partitions using the wide format.

🔁 Conversion Matrix

From \ To	dataset	wide_dataset	dat	hub	json
dataset	—	✅	🧩	—	—
wide_dataset	—	—	✅	—	✅
dat	🧩	✅	—	🧩	—
hub	🧩📦	✅📦	—	—	—
json	—	—	—	—	—

Legend:

✅ direct: single-step conversion. Hover to view the function name.
🧩 indirect: composed of helper conversions. Hover to view the function name.
📦 returns dict: result is a DatasetDict.

📦 Installation

Install both the CLI and SDK in one step:

python3 -m pip install -e .

This enables the fifo-tool-datasets command.

🚀 CLI Usage

🛠️ Command Reference

fifo-tool-datasets <command> [options]

`copy`

Upload or download datasets between .dat files (or directories) and the Hugging Face Hub.

fifo-tool-datasets copy <src> <dst> --adapter <adapter> [--commit-message <msg>] [--seed <int>] [-y]

.dat or directory → hub: requires --commit-message
hub → .dat or directory: downloads as a file (datasets are merged) or as a directory (each split is preserved)

`split`

fifo-tool-datasets split <src> --adapter <adapter> [--to <dir>] [--split-ratio <train> <val> <test>] [-y]

Default split ratio is [0.7, 0.15, 0.15] if --split-ratio is omitted.

`merge`

Recombine split .dat files into a single dataset.

fifo-tool-datasets merge <dir> --adapter <adapter> [--to <file>] [-y]

`sort`

Sort the samples of a DSL .dat file by their full content: system prompt, user input, and assistant response. Sorting is done in place, meaning the original file is overwritten with the sorted result.

You can provide either a single file or a directory. If a directory is given, all .dat files within it will be sorted in place.

fifo-tool-datasets sort <path> [--adapter dsl]

Currently, only the dsl adapter is supported. If the --adapter flag is omitted, it defaults to dsl automatically.

💡 Command examples

# Upload
fifo-tool-datasets copy dsl.dat username/my-dataset --adapter dsl --commit-message "init"

# Download
fifo-tool-datasets copy username/my-dataset dsl.dat --adapter dsl

# Split
fifo-tool-datasets split dsl.dat --adapter dsl --to split_dsl

# Merge
fifo-tool-datasets merge split_dsl --adapter dsl --to full.dsl.dat

# Sort
fifo-tool-datasets sort dsl.dat --adapter dsl

📦 SDK Usage

from fifo_tool_datasets.sdk.hf_dataset_adapters.dsl import DSLAdapter

adapter = DSLAdapter()

# Upload to the Hugging Face Hub
adapter.from_dat_to_hub(
    "dsl.dat",
    "username/my-dataset",
    commit_message="initial upload"
)

# Download from the Hub as a DatasetDict (train/validation/test)
splits = adapter.from_hub_to_dataset_dict("username/my-dataset")

# Access splits for fine-tuning
train_dataset = splits["train"]
test_dataset = splits["test"]

# You can now use train_dataset / test_dataset to fine-tune your LLM
# e.g., with Hugging Face Transformers Trainer, SFTTrainer, etc.

# You can also directly load from a local .dat file
dataset = adapter.from_dat_to_dataset("dsl.dat")

# Convert to structured JSON format
json_records = adapter.from_wide_dataset_to_json(dataset)

🔌 Available Adapters

🧠 `ConversationAdapter`

`.dat`

---
$
You are a helpful assistant.
>
Hi
<
Hello!
---

Wide Format

[
  {"id_conversation": 0, "id_message": 0, "role": "system", "content": "You are a helpful assistant."},
  {"id_conversation": 0, "id_message": 1, "role": "user",   "content": "Hi"},
  {"id_conversation": 0, "id_message": 2, "role": "assistant", "content": "Hello!"}
]

JSON Format

[
  {
    "messages": [
      {"role": "system",    "content": "You are a helpful assistant."},
      {"role": "user",      "content": "Hi"},
      {"role": "assistant", "content": "Hello!"}
    ]
  }
]

💬 `SQNAAdapter`

`.dat`

>What is 2+2?
<4

Wide Format

[
  {"in": "What is 2+2?", "out": "4"}
]

JSON Format

[
  {
    "messages": [
      {"role": "user", "content": "What is 2+2?"},
      {"role": "assistant", "content": "4"}
    ]
  }
]

⚙️ `DSLAdapter`

`.dat`

---
$ You are a precise DSL parser.
> today at 5:30PM
< SET_TIME(TODAY, 17, 30)
---

Multi-line entries are also supported and can be freely mixed with single-line ones. A space after the marker on single-line entries is optional:

---
$
multi-line system
prompt
> single-line input
<single-line output
---

To reuse the previous system prompt across multiple samples, use ...:

---
$ first prompt
> q1
< a1
---
$ ...
> q2
< a2
---
$
...
> q3
< a3
---

Any $ block that contains only ... — either directly after the $ or on the following line — will inherit the most recent explicitly defined system prompt.

At least one non-... system prompt is required in the file.
When generating .dat files, consecutive identical system prompts are automatically collapsed into $ ....

Wide Format

[
  {"system": "You are a precise DSL parser.", "in": "today at 5:30PM", "out": "SET_TIME(TODAY, 17, 30)"}
]

JSON Format

[
  {
    "messages": [
      {"role": "system", "content": "You are a precise DSL parser."},
      {"role": "user", "content": "today at 5:30PM"},
      {"role": "assistant", "content": "SET_TIME(TODAY, 17, 30)"}
    ]
  }
]

✅ Validation Rules

Each adapter enforces its own parsing rules:

ConversationAdapter: tag order, message required after each tag, conversation structure
SQNAAdapter: strictly > then <, per pair
DSLAdapter: each block must contain $, >, < in this order. $ ... reuses the previous system prompt. Values may span multiple lines; single-line values are written with a space after the tag when generating .dat files. When writing .dat files, consecutive identical system prompts are replaced by $ ... automatically.

🧪 Tests

pytest tests/

✅ License

MIT — see LICENSE

📄 Disclaimer

This project is not affiliated with or endorsed by Hugging Face or the Python Software Foundation.
It builds on their open-source technologies under their respective licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
fifo_tool_datasets		fifo_tool_datasets
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

`fifo-tool-datasets`

📚 Table of Contents

📐 Dataset Formats

🔁 Conversion Matrix

📦 Installation

🚀 CLI Usage

🛠️ Command Reference

`copy`

`split`

`merge`

`sort`

💡 Command examples

📦 SDK Usage

🔌 Available Adapters

🧠 `ConversationAdapter`

`.dat`

Wide Format

JSON Format

💬 `SQNAAdapter`

`.dat`

Wide Format

JSON Format

⚙️ `DSLAdapter`

`.dat`

Wide Format

JSON Format

✅ Validation Rules

🧪 Tests

✅ License

📄 Disclaimer

About

Uh oh!

Releases

Packages

Languages

License

gh9869827/fifo-tool-datasets

Folders and files

Latest commit

History

Repository files navigation

fifo-tool-datasets

📚 Table of Contents

📐 Dataset Formats

🔁 Conversion Matrix

📦 Installation

🚀 CLI Usage

🛠️ Command Reference

copy

split

merge

sort

💡 Command examples

📦 SDK Usage

🔌 Available Adapters

🧠 ConversationAdapter

.dat

Wide Format

JSON Format

💬 SQNAAdapter

.dat

Wide Format

JSON Format

⚙️ DSLAdapter

.dat

Wide Format

JSON Format

✅ Validation Rules

🧪 Tests

✅ License

📄 Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`fifo-tool-datasets`

`copy`

`split`

`merge`

`sort`

🧠 `ConversationAdapter`

`.dat`

💬 `SQNAAdapter`

`.dat`

⚙️ `DSLAdapter`

`.dat`

Packages