I2M2

I2M2: Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning (NeurIPS 2024)

Generate Convert Improve

Install / Use

/learn @divyam3897/I2M2

About this skill

Quality Score

0/100

README

Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning

</div>

TL;DR: We distinguish between different modeling paradigms for multi-modal learning from the perspective of generative models and offer a general recipe for designing models that efficiently leverage multi-modal data, leading to more accurate predictions.

Abstract

Supervised multi-modal learning involves mapping multiple modalities to a target label. Previous studies in this field have concentrated on capturing in isolation either the inter-modality dependencies (the relationships between different modalities and the label) or the intra-modality dependencies (the relationships within a single modality and the label). We argue that these conventional approaches that rely solely on either inter- or intra-modality dependencies may not be optimal in general.We view the multi-modal learning problem from the lens of generative models where we consider the target as a source of multiple modalities and the interaction between them. Towards that end, we propose inter- & intra-modality modeling (I2M2) framework, which captures and integrates both the inter- and intra-modality dependencies, leading to more accurate predictions. We evaluate our approach using real-world healthcare and vision-and-language datasets with state-of-the-art models, demonstrating superior performance over traditional methods focusing only on one type of modality dependency.

Prerequisites

$ pip install -r requirements.txt

📚 Datasets Overview and Instructions

Our project utilizes several datasets, each organized within specific folders. Below is an overview of the datasets and links to detailed instructions in their respective folders:

AVMNIST

Description: Audio-Vision MNIST (AV-MNIST) combines audio and visual modalities for MNIST digit (0-9) recognition task.
Instructions: For detailed instructions on how to use this datasets, refer to the README in the AVMNIST_and_MIMIC folder.

fastMRI

Description: The fastMRI dataset is a large-scale dataset that consists of raw k-space knee data alongside anonymized clinical magnetic resonance (MR) images and pathology labels.
Instructions: Detailed steps for using the fastMRI dataset can be found in the README in the fastMRI folder.

MIMIC-III

Description: he MIMIC-III dataset encompasses ten years of intensive care unit (ICU) patient data from Beth Israel Deaconess Medical Center. The dataset is divided into two modalities: 1) time-series modality, which includes hourly medical measurements over 24 hours, and 2) static modality, capturing a patient’s medical information. We consider three tasks: a) predicting the mortality of a patient within 1 day, 2 days, 3 days, 1 week, 1 year and beyond, and b) two binary classification tasks for ICD-9 codes, one to assess if a patient falls under group 1 (codes 140-239; neoplasms) and another for group 7 (codes 460-519; diseases of respiratory system).
Instructions: For detailed instructions on how to use this datasets, refer to the README in the AVMNIST_and_MIMIC folder.

VQA

Description: The objective of VQA is to answer questions about images. The eval- uation encompasses the IID and nine out-of-distribution (OOD) test-sets released by VQA-VS dataset.
Instructions: Comprehensive guidelines on these datasets are available in the README in the VQA_and_NLVR2 folder.

NLVR

Description: NLVR2 represents a binary classification task in which the goal is to determine whether the text description correctly describes a pair of two images.
Instructions: Comprehensive guidelines on these datasets are available in the README in the VQA_and_NLVR2 folder.

Contributing

We'd love to accept your contributions to this project. Please feel free to open an issue, or submit a pull request as necessary. If you have implementations of this repository in other ML frameworks, please reach out so we may highlight them here.

License

This codebase is released under MIT License.

📌 Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

@inproceedings{
  madaan2024jointly,
  title={Jointly Modeling Inter- \& Intra-Modality Dependencies for Multi-modal Learning},
  author={Divyam Madaan, Taro Makino, Sumit Chopra, Kyunghyun Cho},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
  year={2024},
  url={https://openreview.net/forum?id=XAKALzI3Gw}
}

Related Skills

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

399

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

codebase-to-course

Turn any codebase into a beautiful, interactive single-page HTML course that teaches how the code works to non-technical people. Use this skill whenever someone wants to create an interactive course, tutorial, or educational walkthrough from a codebase or project. Also trigger when users mention 'turn this into a course,' 'explain this codebase interactively,' 'teach this code,' 'interactive tutorial from code,' 'codebase walkthrough,' 'learn from this codebase,' or 'make a course from this project.' This skill produces a stunning, self-contained HTML file with scroll-based navigation, animated visualizations, embedded quizzes, and code-with-plain-English side-by-side translations.

academic-pptx

Use this skill whenever the user wants to create or improve a presentation for an academic context — conference papers, seminar talks, thesis defenses, grant briefings, lab meetings, invited lectures, or any presentation where the audience will evaluate reasoning and evidence. Triggers include: 'conference talk', 'seminar slides', 'thesis defense', 'research presentation', 'academic deck', 'academic presentation'. Also triggers when the user asks to 'make slides' in combination with academic content (e.g., 'make slides for my paper on X', 'create a presentation for my dissertation defense', 'build a deck for my grant proposal'). This skill governs CONTENT and STRUCTURE decisions. For the technical work of creating or editing the .pptx file itself, also read the pptx SKILL.md.

divyam3897

View profile

View on GitHub

GitHub Stars22

CategoryEducation

Updated7mo ago

Forks2

divyam3897/I2M2

Languages

Python

Security Score

82/100

Audited on Aug 8, 2025

No findings