Skip to content

gh9869827/fifo-tool-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

43 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

License: MIT Python Test Status

fifo-tool-datasets

fifo-tool-datasets provides standardized adapters to convert plain .dat files into formats compatible with LLM training β€” including Hugging Face datasets.Dataset and JSON message arrays.

It supports both:

  • βœ… A Python SDK β€” for structured loading and conversion
  • βœ… A CLI β€” to upload/download .dat files to/from the Hugging Face Hub

.dat files are plain-text datasets designed for LLM fine-tuning. They come in three styles:

  • πŸ’¬ sqna (single-turn): prompt-response pairs
  • 🧠 conversation (multi-turn): role-tagged chat sessions
  • βš™οΈ dsl (structured): system β†’ input β†’ DSL output triplets

These files are human-editable, diffable, and ideal for version control β€” especially during dataset development and iteration.

This tool enables a complete round-trip workflow:

  1. Create and edit a .dat file locally
  2. Convert and upload it as a training-ready Hugging Face datasets.Dataset
  3. Later, download and deserialize it back into .dat for further edits

This gives you the best of both worlds:

  • ✍️ Easy editing and version control via .dat
  • πŸš€ Compatibility with HF pipelines using load_dataset()

See format examples below in each adapter section.


πŸ“š Table of Contents


πŸ“ Dataset Formats

Format Description
.dat Editable plain-text format with tags (e.g. >, <, ---)
Dataset Hugging Face datasets.Dataset object β€” used for fine-tuning
wide_dataset Flattened Dataset with one row per message β€” format depends on adapter
json A list of messages dictionaries
hub A DatasetDict with train, validation, and test splits

All datasets uploaded to the Hub β€” if not already split β€” are automatically divided into train, validation, and test partitions using the wide format.


πŸ” Conversion Matrix

From \ To dataset wide_dataset dat hub json
dataset β€” βœ… 🧩 β€” β€”
wide_dataset β€” β€” βœ… β€” βœ…
dat 🧩 βœ… β€” 🧩 β€”
hub πŸ§©πŸ“¦ βœ…πŸ“¦ β€” β€” β€”
json β€” β€” β€” β€” β€”

Legend:

  • βœ… direct: single-step conversion. Hover to view the function name.
  • 🧩 indirect: composed of helper conversions. Hover to view the function name.
  • πŸ“¦ returns dict: result is a DatasetDict.

πŸ“¦ Installation

Install both the CLI and SDK in one step:

python3 -m pip install -e .

This enables the fifo-tool-datasets command.


πŸš€ CLI Usage

πŸ› οΈ Command Reference

fifo-tool-datasets <command> [options]

copy

Upload or download datasets between .dat files (or directories) and the Hugging Face Hub.

fifo-tool-datasets copy <src> <dst> --adapter <adapter> [--commit-message <msg>] [--seed <int>] [-y]
  • .dat or directory β†’ hub: requires --commit-message
  • hub β†’ .dat or directory: downloads as a file (datasets are merged) or as a directory (each split is preserved)

split

fifo-tool-datasets split <src> --adapter <adapter> [--to <dir>] [--split-ratio <train> <val> <test>] [-y]

Default split ratio is [0.7, 0.15, 0.15] if --split-ratio is omitted.

merge

Recombine split .dat files into a single dataset.

fifo-tool-datasets merge <dir> --adapter <adapter> [--to <file>] [-y]

sort

Sort the samples of a DSL .dat file by their full content: system prompt, user input, and assistant response. Sorting is done in place, meaning the original file is overwritten with the sorted result.

You can provide either a single file or a directory. If a directory is given, all .dat files within it will be sorted in place.

fifo-tool-datasets sort <path> [--adapter dsl]

Currently, only the dsl adapter is supported. If the --adapter flag is omitted, it defaults to dsl automatically.


πŸ’‘ Command examples

# Upload
fifo-tool-datasets copy dsl.dat username/my-dataset --adapter dsl --commit-message "init"

# Download
fifo-tool-datasets copy username/my-dataset dsl.dat --adapter dsl

# Split
fifo-tool-datasets split dsl.dat --adapter dsl --to split_dsl

# Merge
fifo-tool-datasets merge split_dsl --adapter dsl --to full.dsl.dat

# Sort
fifo-tool-datasets sort dsl.dat --adapter dsl

πŸ“¦ SDK Usage

from fifo_tool_datasets.sdk.hf_dataset_adapters.dsl import DSLAdapter

adapter = DSLAdapter()

# Upload to the Hugging Face Hub
adapter.from_dat_to_hub(
    "dsl.dat",
    "username/my-dataset",
    commit_message="initial upload"
)

# Download from the Hub as a DatasetDict (train/validation/test)
splits = adapter.from_hub_to_dataset_dict("username/my-dataset")

# Access splits for fine-tuning
train_dataset = splits["train"]
test_dataset = splits["test"]

# You can now use train_dataset / test_dataset to fine-tune your LLM
# e.g., with Hugging Face Transformers Trainer, SFTTrainer, etc.

# You can also directly load from a local .dat file
dataset = adapter.from_dat_to_dataset("dsl.dat")

# Convert to structured JSON format
json_records = adapter.from_wide_dataset_to_json(dataset)

πŸ”Œ Available Adapters

🧠 ConversationAdapter

.dat

---
$
You are a helpful assistant.
>
Hi
<
Hello!
---

Wide Format

[
  {"id_conversation": 0, "id_message": 0, "role": "system", "content": "You are a helpful assistant."},
  {"id_conversation": 0, "id_message": 1, "role": "user",   "content": "Hi"},
  {"id_conversation": 0, "id_message": 2, "role": "assistant", "content": "Hello!"}
]

JSON Format

[
  {
    "messages": [
      {"role": "system",    "content": "You are a helpful assistant."},
      {"role": "user",      "content": "Hi"},
      {"role": "assistant", "content": "Hello!"}
    ]
  }
]

πŸ’¬ SQNAAdapter

.dat

>What is 2+2?
<4

Wide Format

[
  {"in": "What is 2+2?", "out": "4"}
]

JSON Format

[
  {
    "messages": [
      {"role": "user", "content": "What is 2+2?"},
      {"role": "assistant", "content": "4"}
    ]
  }
]

βš™οΈ DSLAdapter

.dat

---
$ You are a precise DSL parser.
> today at 5:30PM
< SET_TIME(TODAY, 17, 30)
---

Multi-line entries are also supported and can be freely mixed with single-line ones. A space after the marker on single-line entries is optional:

---
$
multi-line system
prompt
> single-line input
<single-line output
---

To reuse the previous system prompt across multiple samples, use ...:

---
$ first prompt
> q1
< a1
---
$ ...
> q2
< a2
---
$
...
> q3
< a3
---

Any $ block that contains only ... β€” either directly after the $ or on the following line β€” will inherit the most recent explicitly defined system prompt.

  • At least one non-... system prompt is required in the file.
  • When generating .dat files, consecutive identical system prompts are automatically collapsed into $ ....

Wide Format

[
  {"system": "You are a precise DSL parser.", "in": "today at 5:30PM", "out": "SET_TIME(TODAY, 17, 30)"}
]

JSON Format

[
  {
    "messages": [
      {"role": "system", "content": "You are a precise DSL parser."},
      {"role": "user", "content": "today at 5:30PM"},
      {"role": "assistant", "content": "SET_TIME(TODAY, 17, 30)"}
    ]
  }
]

βœ… Validation Rules

Each adapter enforces its own parsing rules:

  • ConversationAdapter: tag order, message required after each tag, conversation structure
  • SQNAAdapter: strictly > then <, per pair
  • DSLAdapter: each block must contain $, >, < in this order. $ ... reuses the previous system prompt. Values may span multiple lines; single-line values are written with a space after the tag when generating .dat files. When writing .dat files, consecutive identical system prompts are replaced by $ ... automatically.

πŸ§ͺ Tests

pytest tests/

βœ… License

MIT β€” see LICENSE


πŸ“„ Disclaimer

This project is not affiliated with or endorsed by Hugging Face or the Python Software Foundation.
It builds on their open-source technologies under their respective licenses.

About

Convert .dat files to/from Hugging Face datasets and JSON for LLM fine-tuning.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages