Observe before Generate: Emotion-Cause aware Video Caption for Multimodal Emotion Cause Generation in Conversations

Fanfan Wang, Heqing Ma, Xiangqing Shen, Jianfei Yu*, Rui Xia*

This repository contains the code for ObG, a multimodal pipeline framework that first generates emotion-cause aware video captions (Observe) and then facilitates the generation of emotion causes (Generate).

Task

Multimodal Emotion Cause Generation in Conversations (MECGC) aims to generate the abstractive causes of given emotions based on multimodal context.

Dataset

ECGF is constructed by manually annotating the abstractive causes for each emotion labeled in the existing ECF dataset.

Requirements

conda env create -f environment.yml
conda activate obg

# install nlgeval for evaluation
pip install git+https://github.com/Maluuba/nlg-eval.git

Usage

1. Emotion-cause aware video captioning

Few-shot Data Synthesis

Gemini-Pro-Vision is used to generate emotion-cause aware video captions as supervised data for training ECCap. For the detailed instruction template, please refer to Figure 3 in our paper.

Data Format

{
    "emo_utt_id": "dia14utt4",
    "input": "question: What visual caption suggests the emotion causes for All's anger in U4? \\ context: U1. <extra_id_1> <extra_id_51> Chandler: xxx | U2. <extra_id_2> <extra_id_52> Phoebe: xxx | U3. <extra_id_3> <extra_id_53> All: xxx | U4. <extra_id_4> <extra_id_54> All: xxx",
    "output": "A man is smoking."
}

Note: 'xxx' refers to the utterance text, and the context window is [-3, 0].

Model Training

# modify the data_dir, output_dir
bash ECCap.sh

2. Multimodal emotion cause generation

Data Format

{
    "emo_utt_id": "dia14utt4",
    "input": "question: Why does All feel anger in U4? \\ caption: xxx \\ context: U1. <extra_id_1> <extra_id_51> Chandler: xxx | U2. <extra_id_2> <extra_id_52> Phoebe: xxx | U3. <extra_id_3> <extra_id_53> All: xxx | U4. <extra_id_4> <extra_id_54> All: xxx | U5. <extra_id_5> <extra_id_55> Rachel: xxx | U6. <extra_id_6> <extra_id_56> Chandler: xxx",
    "output": "Chandler is smoking."
}

Note: The context window is [-5, 2].

Model Training

# modify the data_dir, output_dir
bash CGM.sh

Citation

@inproceedings{wang2024obg,
  title={Observe before Generate: Emotion-Cause aware Video Caption for Multimodal Emotion Cause Generation in Conversations},
  author={Wang, Fanfan and Ma, Heqing and Shen, Xiangqing and Yu, Jianfei and Xia, Rui},
  booktitle={Proceedings of the 32st ACM International Conference on Multimedia},
  pages = {5820–5828},
  year={2024},
  doi = {10.1145/3664647.3681601}
}


@ARTICLE{ma2024monica,
  author={Ma, Heqing and Yu, Jianfei and Wang, Fanfan and Cao, Hanyu and Xia, Rui},
  journal={IEEE Transactions on Affective Computing}, 
  title={From Extraction to Generation: Multimodal Emotion-Cause Pair Generation in Conversations}, 
  year={2024},
  volume={},
  number={},
  pages={1-12},
  doi={10.1109/TAFFC.2024.3446646}}

Acknowledgements

Our code benefits from VL-T5 and CICERO. We appreciate their valuable contributions.

MECGC

Install / Use

README

Observe before Generate: Emotion-Cause aware Video Caption for Multimodal Emotion Cause Generation in Conversations

Task

Dataset

Requirements

Usage

1. Emotion-cause aware video captioning

Few-shot Data Synthesis

Data Format

Model Training

2. Multimodal emotion cause generation

Data Format

Model Training

Citation

Acknowledgements