Smallcap
SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation
Install / Use
/learn @RitaRamo/SmallcapREADME
SmallCap
We now have a demo, check it out: https://huggingface.co/spaces/RitaParadaRamos/SmallCapDemo :v:
Dependencies
The code was developed in Python 3.9.
conda create -n smallcap python=3.9
conda activate smallcap
pip install -r requirements.txt
Evaluation package
Download Stanford models for computing SPICE (a slightly modified version of this repo):
./coco-caption/get_stanford_models.sh
Interacting with SmallCap
Our pretrained model is available on HuggingFace at Yova/SmallCap7M.
To use it, you also need the retrieval datastore:
mkdir datastore
Download the COCO index and associated captions and place them in datastore/.
See SmallCap_demo.inynb for a demo of our pretrained model.
Training SmallCap
<details> <summary>Click to expand</summary>Data
Download the COCO Karpathy splits file dataset_coco.json from here and place it in data/.
Download all COCO images (train, val and test, 2017 version) from here and place them in data/images. The expected naming format is twelve digits followed by a .jpg extension, e.g. data/images/000000000001.jpg for image with COCO id 1.
Preprocessing
At the moment CLIP models based on ResNet are not available through HuggingFace so it is necessary to also install the original CLIP implementation from here:
pip install git+https://github.com/openai/CLIP.git
Extract train and val features:
mkdir features
python src/extract_features.py
Retrieve captions
python src/retrieve_captions.py
Model training
python train.py
Models are saved under name <rag/norag>_<num params>M, e.g. rag_7M for a model trained with retrieval augmentation and 7M trainable parameters.
Inference
python infer.py --model_path <MODEL_PATH>
If you also specify --checkpoint_path inference runs with only that checkpoint. Else, all checkpoints in --model_path are used.
If you specify --infer_test inference uses test data, else val data is used.
E.g. to run inference on the test split with model rag_7M, checkpoint 17712, run
python infer.py --model_path experiments/rag_7M --checkpoint_path checkpoint-17712 --infer_test
The model predictions are stored as <val/test>_preds.json in each respective checkpoint subdirectory.
Note: You can safely ignore the warning Some weights of ThisGPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized... It occurs because a new model is first built and then the pre-trained parameters are loaded into it.
Evaluate predictions
python coco-caption/run_eval.py <GOLD_ANN_PATH> <PREDICTIONS_PATH>
Paper
If you find our code/data/models or ideas useful in your research, please consider citing the paper:
@article{ramos2022smallcap,
title={SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation},
author={Ramos, Rita and Martins, Bruno and Elliott, Desmond and Kementchedjhieva, Yova},
journal={CVPR},
url={https://openaccess.thecvf.com/content/CVPR2023/papers/Ramos_SmallCap_Lightweight_Image_Captioning_Prompted_With_Retrieval_Augmentation_CVPR_2023_paper.pdf},
year={2023}
}
Related Skills
node-connect
341.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.4kCommit, push, and open a PR
