Welcome to Clotho data handling repository. This repository has the necessary code for
using the DataLoader class from PyTorch package (torch.utils.data.dataloader.DataLoader)
with the Clotho dataset.
You can use the present data loader of Clotho directly with the examples created by the Clotho baseline dataset repository.
If you are looking at this README file, then I suppose that you already know what is a DataLoader from PyTorch. Nevertheless, the Clotho dataset has sequences as inputs and outputs, and each sequence is of arbitrary length (15 to 30 seconds for the input and 8 to 20 words for the output). For that reason, this data loader already provides a collate function.
This repository is maintained by K. Drossos.
In the data_handling package, there is the clotho_dataset.py, which holds the ClothoDataset
class. This class offers the functionality of a PyTorch dataset object, tuned for the Clotho
dataset.
The ClothoDataset object needs the following arguments:
data_dirwhich is the directory that has the data of the Clotho dataset (i.e. the root directory of the Clotho dataset). This argument should be of typepathlib.Path.spiltwhich is the split that you want to use, and the argument should be of typestr.input_field_namewhich is the field name of thenumpy.recarraythat holds the input data to your audio captioning method. Currently, only single input fields are supported (i.e. you cannot specify multiple fields). The type of this argument should bestr.output_field_nameis the ield name of thenumpy.recarraythat holds the output data to your audio captioning method. Currently, only single output fields are supported (i.e. you cannot specify multiple fields). The type of this argument should bestr.load_into_memorywhich is aboolflag for indicating if the data in the dataset should be loaded into memory or read from the disk when needed.
The data loader is just a function, wrapping the creation of a torch.utils.data.DataLoader class,
that also offers functionality for instantiating the ClothoDataset class and the collate function,
that will be used with the data loader.
The data loader of Clotho needs the following arguments:
data_dirwhich is the directory that has the data of the Clotho dataset (i.e. the root directory of the Clotho dataset). This argument should be of typepathlib.Path.spiltwhich is the split that you want to use, and the argument should be of typestr.input_field_namewhich is the field name of thenumpy.recarraythat holds the input data to your audio captioning method. Currently, only single input fields are supported (i.e. you cannot specify multiple fields). The type of this argument should bestr.output_field_nameis the ield name of thenumpy.recarraythat holds the output data to your audio captioning method. Currently, only single output fields are supported (i.e. you cannot specify multiple fields). The type of this argument should bestr.load_into_memorywhich is aboolflag for indicating if the data in the dataset should be loaded into memory or read from the disk when needed.batch_sizeis the batch size to be used with the data loader. This argument should be anint.nb_t_steps_padis the number of time-steps to pad or truncate the sequences using the collate function. This argument can be anint(i.e. the actual time-steps) but also can be the stringsmaxormin, meaning pad/truncate to maximum/minimum amount of time-steps in the batch. Currently, zeros (input audio) and tokens (output words) are supported for padding.
the padding. Supported values forstraremaxandmin.shuffleflag to indicate the shuffling of the data, exactly as in thetorch.utils.data.DataLoaderclass. This argument should be abool.drop_lastflag to indicate the dropping of the examples that cannot grouped in a batch,
exactly as in thetorch.utils.data.DataLoaderclass. This argument should be abool.input_pad_atwhere to pad the input sequence at, i.e. at thestartor at theend? This argument should be astrand supported strings arestartandend.output_pad_atthe same asinput_pad_at, but for the output sequence.num_workersis the amount of workers to be used for the data loader, exactly as in thetorch.utils.data.DataLoaderclass. This argument should be anint.
To be able to use the sequences of Clotho in a batch, you most likely will need some kind of padding policy. This repository already offers a collate function to be used with the Clotho data.
With the provided collate function, you can choose to either:
- pad the data with zeros (for input audio data) and end-of-sequence symbol (for the output/words), to the length of the longest input (for the inputs) and output (for the outputs) sequence in tha batch
- truncate the input and the output to the minimum length of the input and output in the batch, and
- use a constant length for input and output, and either truncate or pad.
Enjoy and if you have any issues, please let me know in the issue section.