7 skills found
CASIA-IVA-Lab / VAST[NIPS2023] Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
RunpeiDong / ACT[ICLR 2023] Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning?
cclaess / SPECTRE[CVPR 2026] This repo contains the code and models of SPECTRE: Self-Supervised & Cross-Modal Pretraining for CT Representation Extraction.
mshukor / VLPCookOfficial implementation of VLPCook: Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval
mlbio-epfl / STRUCTURE[NeurIPS 2025] TL;DR: Aligning pretrained unimodal models with the proposed framework using limited paired data yields ~52% gains in cross-modality zero-shot classification and ~92% in retrieval.
chincharles / U Emo[TIP-2025] Official code for work "UniEmoX: Cross-modal Semantic-Guided Large-Scale Pretraining for Universal Scene Emotion Perception".
Heidelberg-NLP / Counting ProbeCounting dataset for Vision & Language models. Introduced in the paper "Seeing Past Words: Testing the Cross-Modal Capabilities of Pretrained V&L Models". https://arxiv.org/abs/2012.12352