SkillAgentSearch skills...

CrossModalRetrieval

Pytorch implementation of 'See, Hear, and Read: Deep Aligned Representations'

Install / Use

/learn @jingliao132/CrossModalRetrieval
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Pytorch implementation of paper 'See, Hear, and Read: Deep Aligned Representations' [arxiv]

In this paper, a Teacher-Student model-like network is designed to learn a deep discriminative representations shared across three major natural modalities: vision, sound and language. The student network accepts input either an image, a sound, or a text, and produces correponding modality-specific representation (gray). The teacher network produces a common shared representation (blue) that is aligned across modality from the modality-specific representations.

<img src="./image/LearningDeepRepresentations.jpg" width="900px" />

Experiment results show that this representation is useful for several tasks, such as cross-modal retrieval or transferring classifiers between modalities.

Prerequisites

Get started

Clone this repository

run command in terminal

git clone https://github.com/jingliao132/CrossModalRetrieval.git
cd CrossModalRetrieval

Download the CUB-200 dataset

./datasets/download_dataset.sh CUB_200_2011

will download and unzip the CUB-200 data in folder 'CUB_200_2011' under ./datasets

Download CUB-200 caption data(torch format) with browser or wget (refer to Download Google Drive files with WGET and Extract the file under foler 'CUB_200_2011'

Each caption file contains a dict object with keys:

'char', a character-level one-hot mapping of 10 text descriptions on the image;

'img', the file name of the image;

'word', word-level coding of 10 text descriptions

and 'txt', 1024-dimentional text feature by pretrained GoogLeNet (Details in https://github.com/reedscot/icml2016). We use only 'char' and 'img'.

<img src="./image/caption_data_example.jpg" width="900px" />

Set up training and validation manifest file

Example files train.txt val.txt train_val.txt (for producing word embeddings files) are provided in ./datasets. Move them into folder 'CUB_200_2011'.

Customize your train/val by editing train.txt & val.txt

New folder 'models' , and Download pre-trained word-to-vector model (binary format) to 'models'

Produce word embeddings files using the pre-trained model

python3 ./utils/word_embeddings.py

Train the model

python3 __init__.py

Acknowledgement

@techreport{WelinderEtal2010, Author = {P. Welinder and S. Branson and T. Mita and C. Wah and F. Schroff and S. Belongie and P. Perona}, Institution = {California Institute of Technology}, Number = {CNS-TR-2010-001}, Title = {{Caltech-UCSD Birds 200}}, Year = {2010} }

@inproceedings{reed2016generative, title={Generative Adversarial Text-to-Image Synthesis}, author={Scott Reed and Zeynep Akata and Xinchen Yan and Lajanugen Logeswaran and Bernt Schiele and Honglak Lee}, booktitle={Proceedings of The 33rd International Conference on Machine Learning}, year={2016} }

Reference

@article{DBLP:journals/corr/AytarVT17, author = {Yusuf Aytar and Carl Vondrick and Antonio Torralba}, title = {See, Hear, and Read: Deep Aligned Representations}, journal = {CoRR}, volume = {abs/1706.00932}, year = {2017}, url = {http://arxiv.org/abs/1706.00932}, archivePrefix = {arXiv}, eprint = {1706.00932}, timestamp = {Mon, 13 Aug 2018 16:48:33 +0200}, biburl = {https://dblp.org/rec/bib/journals/corr/AytarVT17}, bibsource = {dblp computer science bibliography, https://dblp.org} }

Related Skills

View on GitHub
GitHub Stars33
CategoryDevelopment
Updated1y ago
Forks13

Languages

Python

Security Score

60/100

Audited on Aug 22, 2024

No findings