This codebase runs image-text inference with SOTA vision-language models, locally or via Slurm. It is designed to be easily extensible to new models, datasets, and tasks. We support structured JSON generation via outlines and pydantic, using schema-constrained decoding for HuggingFace models and JSON mode with API-based models wherever applicable.
- Basic install
conda create -n vizwiz-culture python=3.10
conda activate vizwiz-culture
pip install -e .- (Optional) Install flash-attention
pip install flash-attn --no-build-isolation
# Verify import; if output is empty installation was successful
python -c "import torch; import flash_attn_2_cuda"-
Login, setup billing, and create API key on https://platform.openai.com/
-
Run
export OPENAI_API_KEY=<your_key>
-
Login, setup billing, and create project on https://console.cloud.google.com/
-
Go to https://cloud.google.com/sdk/docs/install#linux and follow the instructions to install
gcloud -
Run
gcloud initand follow the instructions -
Run
gcloud auth application-default loginand follow the instructions -
Run
export GOOGLE_CLOUD_PROJECT=<your_project>
-
Login, setup billing, and create API key on https://console.anthropic.com/.
-
Run
export ANTHROPIC_API_KEY=<your_key>.
-
Login, setup billing, and create API key on https://platform.reka.ai/.
-
Run
export REKA_API_KEY=<your_key>.
You can register new models, datasets, and callbacks by adding them under src/vlm_inference/configuration/registry.py. We currently support the Google Gemini API, the OpenAI API and HuggingFace.
We currently use callbacks for logging, local saving of outputs, and uploading to Wandb.
You can get rid of default callbacks via '~_callback_dict.<callback_name>', e.g. remove the Wandb callback via '~_callback_dict.wandb' (mind the quotation marks).
You can also easily override values of the callbacks, e.g. _callback_dict.wandb.project=new-project.
Note
Currently available OpenAI models:
gpt-4o(gpt-4o-2024-05-13)gpt-4o-mini(gpt-4o-mini-2024-07-18)gpt-4-turbo(gpt-4-turbo-2024-04-09)gpt-4(gpt-4-1106-vision-preview)
Currently available Google models:
gemini-1.0(gemini-1.0-pro-vision-001)gemini-1.5-flash(gemini-1.5-flash-preview-0514)gemini-1.5-pro(gemini-1.5-pro-preview-0514)
Currently available Anthropic models:
claude-haiku(claude-3-haiku-20240307)claude-sonnet(claude-3-sonnet-20240229)claude-opus(claude-3-opus-20240229)claude-3.5-sonnet(claude-3-5-sonnet-20240620)
Currently available Reka models:
reka-edge(reka-edge-20240208)reka-flash(reka-flash-20240226)reka-core(reka-core-20240415)
python run.py \
model=gpt-4o \
model.json_mode=true \
dataset=cultural_captioning \
dataset.path=data/xm3600_images \
dataset.template_name=culture_jsonNote
Currently available models:
blip2(defaults to Salesforce/blip2-opt-6.7b)instructblip(defaults to Salesforce/instructblip-vicuna-7b)llava(defaults to llava-hf/llava-v1.6-mistral-7b-hf)idefics2(defaults to HuggingFaceM4/idefics2-8b)paligemma(defaults to google/paligemma-3b-mix-448)phi3-vision(defaults to microsoft/Phi-3-vision-128k-instruct)minicpm-llama3-v2.5(defaults to openbmb/MiniCPM-Llama3-V-2_5)glm-4v(defaults to THUDM/glm-4v-9b)
You can also specify the size, e.g. model.size=13b for InstructBlip, model.size=34b for Llava or model.size=3b-pt-896 for PaliGemma.
Make sure to use a prompt template that works for the model (uses the correct special tokens, etc.).
python run.py \
model=paligemma \
model.json_mode=false \
dataset.path=data/xm3600_images \
dataset.template_name=paligemma_caption_enpython run.py \
model=llava \
model.json_mode=true \
dataset=cultural_captioning \
dataset.path=data/xm3600_images \
dataset.template_name=llava7b_culture_jsonpython run.py \
model=idefics2 \
model.json_mode=true \
dataset=cultural_captioning \
dataset.path=data/xm3600_images \
dataset.template_name=idefics2_culture_jsonpython run.py \
model=phi3-vision \
model.json_mode=true \
dataset=cultural_captioning \
dataset.path=data/xm3600_images \
dataset.template_name=phi3_culture_jsonpython run.py \
model=minicpm-llama3-v2.5 \
model.json_mode=true \
dataset=cultural_captioning \
dataset.path=data/xm3600_images \
dataset.template_name=culture_jsonpython run.py \
model=glm-4v \
model.json_mode=true \
dataset=cultural_captioning \
dataset.path=data/xm3600_images \
dataset.template_name=culture_jsonPass --multirun run=slurm to run on SLURM.
Important
You might need to adjust the Slurm parameters (see defaults in configs/run/slurm.yaml).
To do so, either change them directly in the slurm.yaml, create a new yaml file, or pass them as hydra overrides, e.g. via hydra.launcher.partition=gpu or hydra.launcher.gpus_per_node=0.
You can launch different configurations in parallel using comma-separated arguments, e.g. model=gemini-1.5-flash,gpt-4o.
Example:
python run.py --multirun run=slurm \
model=gemini-1.5-flash,gpt-4o \
model.json_mode=true \
dataset=cultural_captioning \
dataset.path=data/xm3600_images \
dataset.template_name=culture_json \
hydra.sweep.dir=./closed_models_sweep \
hydra.launcher.gpus_per_node=0 \
hydra.launcher.cpus_per_task=4 \
hydra.launcher.mem_gb=4Download the images to a folder named xm3600_images like this:
mkdir -p xm3600_images
wget -O - https://open-images-dataset.s3.amazonaws.com/crossmodal-3600/images.tgz | tar -xvzf - -C xm3600_images