VLM Inference

This codebase runs image-text inference with SOTA vision-language models, locally or via Slurm. It is designed to be easily extensible to new models, datasets, and tasks. We support structured JSON generation via outlines and pydantic, using schema-constrained decoding for HuggingFace models and JSON mode with API-based models wherever applicable.

Installation

Basic install

conda create -n vizwiz-culture python=3.10
conda activate vizwiz-culture
pip install -e .

(Optional) Install flash-attention

pip install flash-attn --no-build-isolation

# Verify import; if output is empty installation was successful
python -c "import torch; import flash_attn_2_cuda"

Closed-access models API setup

OpenAI

Login, setup billing, and create API key on https://platform.openai.com/
Run export OPENAI_API_KEY=<your_key>

Google VertexAI

Login, setup billing, and create project on https://console.cloud.google.com/
Go to https://cloud.google.com/sdk/docs/install#linux and follow the instructions to install gcloud
Run gcloud init and follow the instructions
Run gcloud auth application-default login and follow the instructions
Run export GOOGLE_CLOUD_PROJECT=<your_project>

Anthropic

Login, setup billing, and create API key on https://console.anthropic.com/.
Run export ANTHROPIC_API_KEY=<your_key>.

Reka

Login, setup billing, and create API key on https://platform.reka.ai/.
Run export REKA_API_KEY=<your_key>.

Examples

General Usage

Registering models

You can register new models, datasets, and callbacks by adding them under src/vlm_inference/configuration/registry.py. We currently support the Google Gemini API, the OpenAI API and HuggingFace.

Callbacks

We currently use callbacks for logging, local saving of outputs, and uploading to Wandb.

You can get rid of default callbacks via '~_callback_dict.<callback_name>', e.g. remove the Wandb callback via '~_callback_dict.wandb' (mind the quotation marks).

You can also easily override values of the callbacks, e.g. _callback_dict.wandb.project=new-project.

Closed-access models

Note

Currently available OpenAI models:

gpt-4o (gpt-4o-2024-05-13)
gpt-4o-mini (gpt-4o-mini-2024-07-18)
gpt-4-turbo (gpt-4-turbo-2024-04-09)
gpt-4 (gpt-4-1106-vision-preview)

Currently available Google models:

gemini-1.0 (gemini-1.0-pro-vision-001)
gemini-1.5-flash (gemini-1.5-flash-preview-0514)
gemini-1.5-pro (gemini-1.5-pro-preview-0514)

Currently available Anthropic models:

claude-haiku (claude-3-haiku-20240307)
claude-sonnet (claude-3-sonnet-20240229)
claude-opus (claude-3-opus-20240229)
claude-3.5-sonnet (claude-3-5-sonnet-20240620)

Currently available Reka models:

reka-edge (reka-edge-20240208)
reka-flash (reka-flash-20240226)
reka-core (reka-core-20240415)

Example

python run.py \
  model=gpt-4o \
  model.json_mode=true \
  dataset=cultural_captioning \
  dataset.path=data/xm3600_images \
  dataset.template_name=culture_json

Open-access models via HuggingFace

Note

Currently available models:

blip2 (defaults to Salesforce/blip2-opt-6.7b)
instructblip (defaults to Salesforce/instructblip-vicuna-7b)
llava (defaults to llava-hf/llava-v1.6-mistral-7b-hf)
idefics2 (defaults to HuggingFaceM4/idefics2-8b)
paligemma (defaults to google/paligemma-3b-mix-448)
phi3-vision (defaults to microsoft/Phi-3-vision-128k-instruct)
minicpm-llama3-v2.5 (defaults to openbmb/MiniCPM-Llama3-V-2_5)
glm-4v (defaults to THUDM/glm-4v-9b)

You can also specify the size, e.g. model.size=13b for InstructBlip, model.size=34b for Llava or model.size=3b-pt-896 for PaliGemma.

Make sure to use a prompt template that works for the model (uses the correct special tokens, etc.).

Examples

PaliGemma (w/ non-JSON template and regular captioning)

python run.py \
  model=paligemma \
  model.json_mode=false \
  dataset.path=data/xm3600_images \
  dataset.template_name=paligemma_caption_en

LLaVa-1.6 (w/ JSON culture template and cultural captioning)

python run.py \
  model=llava \
  model.json_mode=true \
  dataset=cultural_captioning \
  dataset.path=data/xm3600_images \
  dataset.template_name=llava7b_culture_json

Idefics2

python run.py \
  model=idefics2 \
  model.json_mode=true \
  dataset=cultural_captioning \
  dataset.path=data/xm3600_images \
  dataset.template_name=idefics2_culture_json

Phi3-vision

python run.py \
  model=phi3-vision \
  model.json_mode=true \
  dataset=cultural_captioning \
  dataset.path=data/xm3600_images \
  dataset.template_name=phi3_culture_json

MiniCPM-Llama3-V-2.5

python run.py \
  model=minicpm-llama3-v2.5 \
  model.json_mode=true \
  dataset=cultural_captioning \
  dataset.path=data/xm3600_images \
  dataset.template_name=culture_json

GLM-4V-9B

python run.py \
  model=glm-4v \
  model.json_mode=true \
  dataset=cultural_captioning \
  dataset.path=data/xm3600_images \
  dataset.template_name=culture_json

Running on SLURM

Pass --multirun run=slurm to run on SLURM.

Important

You might need to adjust the Slurm parameters (see defaults in configs/run/slurm.yaml). To do so, either change them directly in the slurm.yaml, create a new yaml file, or pass them as hydra overrides, e.g. via hydra.launcher.partition=gpu or hydra.launcher.gpus_per_node=0.

You can launch different configurations in parallel using comma-separated arguments, e.g. model=gemini-1.5-flash,gpt-4o.

Example:

python run.py --multirun run=slurm \
  model=gemini-1.5-flash,gpt-4o \
  model.json_mode=true \
  dataset=cultural_captioning \
  dataset.path=data/xm3600_images \
  dataset.template_name=culture_json \
  hydra.sweep.dir=./closed_models_sweep \
  hydra.launcher.gpus_per_node=0 \
  hydra.launcher.cpus_per_task=4 \
  hydra.launcher.mem_gb=4

Data

XM3600

Download the images to a folder named xm3600_images like this:

mkdir -p xm3600_images
wget -O - https://open-images-dataset.s3.amazonaws.com/crossmodal-3600/images.tgz | tar -xvzf - -C xm3600_images

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
configs		configs
src/vlm_inference		src/vlm_inference
templates		templates
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py
setup.py		setup.py
vizwiz_culture.csv		vizwiz_culture.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VLM Inference

Installation

Closed-access models API setup

OpenAI

Google VertexAI

Anthropic

Reka

Examples

General Usage

Registering models

Callbacks

Closed-access models

Example

Open-access models via HuggingFace

Examples

PaliGemma (w/ non-JSON template and regular captioning)

LLaVa-1.6 (w/ JSON culture template and cultural captioning)

Idefics2

Phi3-vision

MiniCPM-Llama3-V-2.5

GLM-4V-9B

Running on SLURM

Data

XM3600

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

coastalcph/vizwiz-culture

Folders and files

Latest commit

History

Repository files navigation

VLM Inference

Installation

Closed-access models API setup

OpenAI

Google VertexAI

Anthropic

Reka

Examples

General Usage

Registering models

Callbacks

Closed-access models

Example

Open-access models via HuggingFace

Examples

PaliGemma (w/ non-JSON template and regular captioning)

LLaVa-1.6 (w/ JSON culture template and cultural captioning)

Idefics2

Phi3-vision

MiniCPM-Llama3-V-2.5

GLM-4V-9B

Running on SLURM

Data

XM3600

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages