SkillAgentSearch skills...

DALR

The implementation of our ACL 2025 paper "DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning"

Install / Use

/learn @kangverse/DALR
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

DALR

License: MIT Python 3.7+ PRs Welcome

English | 中文

Overview

We propose DALR (Dual-level Alignment Learning for multimodal sentence Representation Learning).

To achieve cross-modal fine-grained alignment, we propose a cross-modal alignment method to mitigate the cross-modal misalignment bias (CMB) issue. To alleviate the intra-modal semantic divergence (ISD) issue, we integrate ranking distillation with global alignment learning to effectively align intra-modal representations.

The figure below illustrates the overall model architecture.

DALR model architecture


Table of Contents


Getting Started

Environment Setup

We recommend creating a virtual environment first:

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

Install PyTorch (CUDA 11.1):

pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 \
    -f https://download.pytorch.org/whl/torch_stable.html

For CUDA < 11 or CPU-only:

pip install torch==1.8.1

Then install the remaining dependencies:

pip install -r requirements.txt

Download Datasets

Download Flickr30k and MS-COCO from their official websites and organize them as follows:

REPO ROOT
├── data
│   ├── Flickr/
│   ├── MS-COCO/
│   └── wiki1m_for_simcse.txt
├── Model/
│   ├── bert-base-uncased/
│   ├── simcse/
│   ├── DiffCSE/
│   └── clip/
│       └── ViT-L-14.pt

Wiki1M (used for text training):

wget https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/resolve/main/wiki1m_for_simcse.txt \
    -P data/

SentEval evaluation datasets (from SimCSE):

cd SentEval/data/downstream/
bash download_dataset.sh

Pretrained models (SimCSE, DiffCSE, BERT-base, CLIP ViT-L/14) can be downloaded from Hugging Face and placed in the Model/ directory.


Quick Start: Use DALR

import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Model/DALR")
model = AutoModel.from_pretrained("Model/DALR")

texts = [
    "There's a kid on a skateboard.",
    "A kid is skateboarding.",
    "A kid is inside the house.",
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output

cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2))

Evaluation

Run evaluation on SentEval benchmarks:

python src/evaluation.py \
    --model_name_or_path Model/DALR \
    --pooler cls_before_pooler \
    --task_set sts \
    --mode test

Additional evaluation scripts are provided in scripts/:

bash scripts/run_eval.sh        # STS evaluation
bash scripts/run_eval_coco.sh   # COCO retrieval evaluation

Train Your Own Models

Wiki + Flickr30k

bash scripts/run_wiki_flickr.sh

Wiki + MS-COCO

bash scripts/run_wiki_coco.sh

You can freely adjust hyperparameters (learning rate, batch size, margins, lambda, etc.) in the respective shell scripts. Key arguments:

| Argument | Description | Default | |---|---|---| | --framework | Training framework (simcse / mse) | mse | | --learning_rate | Learning rate | 2e-5 | | --per_device_train_batch_size | Batch size per device | 128 | | --num_train_epochs | Number of training epochs | 4 | | --lbd | Weight for distillation loss | 0.01 | | --margin1 / --margin2 | Ranking margins | 0.18 | | --distillation_loss | Distillation loss type | listmle | | --alpha_ / --beta_ / --gamma_ | Loss weights | 0.33 / 1.0 / 1.0 |


Project Structure

DALR/
├── clip/                   # CLIP model utilities
├── data/                   # Data directory (datasets downloaded here)
├── figure/                 # Figures used in the paper / README
├── scripts/                # Training and evaluation shell scripts
│   ├── run_wiki_flickr.sh
│   ├── run_wiki_coco.sh
│   ├── run_eval.sh
│   └── run_eval_coco.sh
├── SentEval/               # SentEval toolkit (evaluation)
├── src/                    # Core source code
│   ├── model_dalr.py       # DALR model definition
│   ├── train_mix.py        # Main training script
│   ├── data.py             # Dataset and data loading
│   ├── evaluation.py       # SentEval evaluation
│   ├── teachers.py         # Teacher model wrappers
│   ├── utils.py            # Utility functions
│   ├── vit.py              # Vision Transformer implementation
│   ├── xbert.py            # Extended BERT utilities
│   ├── tool.py             # Miscellaneous tools
│   └── randaugment.py      # RandAugment data augmentation
├── requirements.txt
├── LICENSE
├── CONTRIBUTING.md
├── README.md
└── README_zh.md

Citation

If you find this work useful in your research, please consider citing:

@inproceedings{he-etal-2025-dalr,
    title = "{DALR}: Dual-level Alignment Learning for Multimodal Sentence Representation Learning",
    author = "He, Kang  and
      Ding, Yuzhe  and
      Wang, Haining  and
      Li, Fei  and
      Teng, Chong  and
      Ji, Donghong",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    year = "2025",
    pages = "3586--3601",   
}

Acknowledgements


Contributing

We welcome contributions! Please read our Contributing Guide to get started. Feel free to open an Issue or submit a Pull Request.

Related Skills

View on GitHub
GitHub Stars78
CategoryEducation
Updated1d ago
Forks18

Languages

Python

Security Score

95/100

Audited on Mar 29, 2026

No findings