Is it possible to fine-tune a Vision Language Model (VLM)?

Hi there,

Just wondering, does this repo support fine-tuning a Vision Language Model (VLM), e.g https://huggingface.co/microsoft/Phi-3.5-vision-instruct?

Many thanks for any help, and for this amazing lib!