SkillAgentSearch skills...

Xmodaler

X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval).

Install / Use

/learn @YehLi/Xmodaler

README

X-modaler

X-modaler is a versatile and high-performance codebase for cross-modal analytics (e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval). This codebase unifies comprehensive high-quality modules in state-of-the-art vision-language techniques, which are organized in a standardized and user-friendly fashion.

The original paper can be found here.

<p align="center"> <img src="images/task.jpg" width="800"/> </p>

Installation

See installation instructions.

Requiremenets

  • Linux or macOS with Python ≥ 3.6
  • PyTorch ≥ 1.8 and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this
  • fvcore
  • pytorch_transformers
  • jsonlines
  • pycocotools

Getting Started

See Getting Started with X-modaler

Training & Evaluation in Command Line

We provide a script in "train_net.py", that is made to train all the configs provided in X-modaler. You may want to use it as a reference to write your own training script.

To train a model(e.g., UpDown) with "train_net.py", first setup the corresponding datasets following datasets, then run:

# Teacher Force
python train_net.py --num-gpus 4 \
 	--config-file configs/image_caption/updown.yaml

# Reinforcement Learning
python train_net.py --num-gpus 4 \
 	--config-file configs/image_caption/updown_rl.yaml

Model Zoo and Baselines

A large set of baseline results and trained models are available here.

<table> <tr> <td colspan="4" align="center"><font size=3><b>Image Captioning</b></font></td> </tr> <tr> <td>Attention</td> <td> Show, attend and tell: Neural image caption generation with visual attention </td> <td>ICML</td> <td>2015</td> </tr> <tr> <td>LSTM-A3</td> <td> Boosting image captioning with attributes </td> <td>ICCV</td> <td>2017</td> </tr> <tr> <td>Up-Down</td> <td> Bottom-up and top-down attention for image captioning and visual question answering </td> <td>CVPR</td> <td>2018</td> </tr> <tr> <td>GCN-LSTM</td> <td> Exploring visual relationship for image captioning </td> <td>ECCV</td> <td>2018</td> </tr> <tr> <td>Transformer</td> <td> Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning </td> <td>ACL</td> <td>2018</td> </tr> <tr> <td>Meshed-Memory</td> <td> Meshed-Memory Transformer for Image Captioning </td> <td>CVPR</td> <td>2020</td> </tr> <tr> <td>X-LAN</td> <td> X-Linear Attention Networks for Image Captioning </td> <td>CVPR</td> <td>2020</td> </tr> <tr> <td colspan="4" align="center"><font size=3><b>Video Captioning</b></font></td> </tr> <tr> <td>MP-LSTM</td> <td> Translating Videos to Natural Language Using Deep Recurrent Neural Networks </td> <td>NAACL HLT</td> <td>2015</td> </tr> <tr> <td>TA</td> <td> Describing Videos by Exploiting Temporal Structure </td> <td>ICCV</td> <td>2015</td> </tr> <tr> <td>Transformer</td> <td> Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning </td> <td>ACL</td> <td>2018</td> </tr> <tr> <td>TDConvED</td> <td> Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning </td> <td>AAAI</td> <td>2019</td> </tr> <tr> <td colspan="4" align="center"><font size=3><b>Vision-Language Pretraining</b></font></td> </tr> <tr> <td>Uniter</td> <td> UNITER: UNiversal Image-TExt Representation Learning </td> <td>ECCV</td> <td>2020</td> </tr> <tr> <td>TDEN</td> <td> Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network </td> <td>AAAI</td> <td>2021</td> </tr> </table>

Image Captioning on MSCOCO (Cross-Entropy Loss)

| Name | Model | BLEU@1 | BLEU@2 | BLEU@3 | BLEU@4 | METEOR | ROUGE-L | CIDEr-D | SPICE | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | LSTM-A3 | GoogleDrive| 75.3 | 59.0 | 45.4 | 35.0 | 26.7 | 55.6 | 107.7| 19.7 | | Attention | GoogleDrive | 76.4 | 60.6 | 46.9 | 36.1 | 27.6 | 56.6 | 113.0 | 20.4 | | Up-Down | GoogleDrive | 76.3 | 60.3 | 46.6 | 36.0 | 27.6 | 56.6 | 113.1 | 20.7 | | GCN-LSTM | GoogleDrive |76.8 | 61.1 | 47.6 | 36.9 | 28.2 | 57.2 | 116.3 | 21.2 | | Transformer | GoogleDrive | 76.4 | 60.3 | 46.5 | 35.8|28.2|56.7| 116.6| 21.3 | | Meshed-Memory | GoogleDrive | 76.3 | 60.2 | 46.4 | 35.6 | 28.1 | 56.5 | 116.0 | 21.2 | | X-LAN | GoogleDrive | 77.5 | 61.9 | 48.3 | 37.5 | 28.6 | 57.6 | 120.7 | 21.9 | | TDEN | GoogleDrive | 75.5 | 59.4 | 45.7 | 34.9 | 28.7 | 56.7 | 116.3 | 22.0 |

Image Captioning on MSCOCO (CIDEr Score Optimization)

| Name | Model | BLEU@1 | BLEU@2 | BLEU@3 | BLEU@4 | METEOR | ROUGE-L | CIDEr-D | SPICE | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | LSTM-A3 | GoogleDrive| 77.9 | 61.5| 46.7| 35.0| 27.1| 56.3| 117.0| 20.5 | | Attention | GoogleDrive | 79.4| 63.5| 48.9| 37.1| 27.9| 57.6| 123.1| 21.3 | | Up-Down | GoogleDrive | 80.1| 64.3| 49.7| 37.7| 28.0| 58.0| 124.7| 21.5 | | GCN-LSTM | GoogleDrive | 80.2| 64.7| 50.3| 38.5| 28.5| 58.4| 127.2| 22.1 | | Transformer | GoogleDrive | 80.5| 65.4| 51.1| 39.2| 29.1| 58.7| 130.0| 23.0 | | Meshed-Memory | GoogleDrive | 80.7| 65.5| 51.4| 39.6| 29.2| 58.9| 131.1| 22.9 | | X-LAN | GoogleDrive | 80.4| 65.2| 51.0| 39.2| 29.4| 59.0| 131.0| 23.2 | | TDEN | GoogleDrive | 81.3| 66.3| 52.0| 40.1| 29.6| 59.8| 132.6| 23.4 |

Video Captioning on MSVD

| Name | Model | BLEU@1 | BLEU@2 | BLEU@3 | BLEU@4 | METEOR | ROUGE-L | CIDEr-D | SPICE | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | MP-LSTM | GoogleDrive| 77.0 | 65.6 | 56.9 | 48.1 | 32.4 | 68.1 | 73.1 | 4.8 | | TA | GoogleDrive| 80.4 | 68.9 | 60.1 | 51.0 | 33.5 | 70.0 | 77.2 | 4.9 | | Transformer | GoogleDrive| 79.0 | 67.6 | 58.5 | 49.4 | 33.3 | 68.7 | 80.3 | 4.9 | | TDConvED | GoogleDrive| 81.6 | 70.4 | 61.3 | 51.7 | 34.1 | 70.4 | 77.8 | 5.0 |

Video Captioning on MSR-VTT

| Name | Model | BLEU@1 | BLEU@2 | BLEU@3 | BLEU@4 | METEOR | ROUGE-L | CIDEr-D | SPICE | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | MP-LSTM | GoogleDrive| 73.6 | 60.8 | 49.0 | 38.6 | 26.0 | 58.3 | 41.1 | 5.6 | | TA | GoogleDrive| 74.3 | 61.8 | 50.3 | 39.9 | 26.4 | 59.4 | 42.9 | 5.8 | | Transformer | GoogleDrive | 75.4 | 62.3 | 50.0 | 39.2 | 26.5 | 58.7 | 44.0 | 5.9 | | TDConvED | GoogleDrive| 76.4 | 62.3 | 49.9 | 38.9 | 26.3 | 59.0 | 40.7 | 5.7 |

Visual Question Answering

| Name | Model | Overall | Yes/No | Number | Other | | :---: | :---: | :---: | :---: | :---: | :---: | | Uniter | GoogleDrive | 70.1 | 86.8 | 53.7 | 59.6 | | TDEN | GoogleDrive | 71.9 | 88.3 | 54.3 | 62.0 |

Caption-based image retrieval on Flickr30k

| Name | Model | R1 | R5 | R10 | | :---: | :---: | :---: | :---: | :---: | | Uniter | GoogleDrive |61.6 | 87.7 |92.8| | TDEN | GoogleDrive | 62.0 | 86.6 | 92.4 |

Visual commonsense reasoning

| Name | Model | Q -> A | QA -> R | Q -> AR | | :---: | :---: | :---: | :---: |

Related Skills

View on GitHub
GitHub Stars970
CategoryData
Updated21d ago
Forks104

Languages

Python

Security Score

85/100

Audited on Mar 4, 2026

No findings