IIANet

This is the demo of our paper "IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation".

Generate Convert Improve

Install / Use

/learn @JusperLee/IIANet

About this skill

Quality Score

0/100

README

简体中文 | English

IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation

GitHub stars GitHub forks

By [1] Tsinghua University, [2]Chinese Institute for Brain Research.

Kai Li[1], Runxuan Yang[1], Fuchun Sun[1], Xiaolin Hu[1,2].

This repository is an official implementation of the IIANet accepted to ICML 2024 (Poster).

✨Key Highlights:

We propose an attention-based cross-modal speech separation network called IIANet, which extensively uses intra-attention (IntraA) and inter-attention (InterA) mechanisms within and across the speech and video modalities.
Compared with existing CNN and Transformer methods, IIANet achieves significantly better separation quality on three audio-visual speech separation datasets while greatly reducing computational complexity and memory usage.
A faster version, IIANet-fast, surpasses CTCNet by 1.1 dB on the challenging LRS2 dataset with only 11% MACs of CTCNet.
Qualitative evaluations on real-world YouTube scenarios show that IIANet generates higher-quality separated speech than other separation models.

🚀Overall Pipeline

overall

🪢IIANet Architecture

separation

🔧Installation

Clone the repository:

git clone https://github.com/JusperLee/IIANet.git 
cd IIANet/

Create and activate the conda environment:

conda create -n iianet python=3.8 
conda activate iianet

Install PyTorch and torchvision following the official instructions. The code requires python>=3.8, pytorch>=1.11, torchvision>=0.13.
Install other dependencies:

pip install -r requirements.txt

📊Model Performance

We evaluate IIANet and its fast version IIANet-fast on three datasets: LRS2, LRS3, and VoxCeleb2. The results show that IIANet achieves significantly better speech separation quality than existing methods while maintaining high efficiency [1].

| Method | Dataset | SI-SNRi | SDRi | PESQ | Params | MACs | GPU Infer Time | Download | |:---:|:-----:|:------:|:----:|:----:|:------:|:-----:|:-----------:|:----:|
| IIANet | LRS2 | 16.0 | 16.2 | 3.23 | 3.1 | 18.6 | 110.11 ms | Config/Model | | IIANet | LRS3 | 18.3 | 18.5 | 3.28 | 3.1 | 18.6 | 110.11 ms | Config/Model | | IIANet | VoxCeleb2 | 13.6 | 14.3 | 3.12 | 3.1 | 18.6 | 110.11 ms| Config/Model |

💥Real-world Evaluation

For single video inference, please refer to inference.py.

# Inference on a single video
# You can modify the video path in inference.py
python inference.py

📚Training

Before starting training, please modify the parameter configurations in configs.

A simple example of training configuration:

data_config:
  train_dir: DataPreProcess/LRS2/tr
  valid_dir: DataPreProcess/LRS2/cv
  test_dir: DataPreProcess/LRS2/tt
  n_src: 1
  sample_rate: 16000
  segment: 2.0
  normalize_audio: false
  batch_size: 3
  num_workers: 24
  pin_memory: true
  persistent_workers: false

Use the following commands to start training:

python train.py --conf_dir configs/LRS2-IIANet.yml
python train.py --conf_dir configs/LRS3-IIANet.yml
python train.py --conf_dir configs/Vox2-IIANet.yml

📈Testing/Inference

To evaluate a model on one or more GPUs, specify the CUDA_VISIBLE_DEVICES, dataset, model and checkpoint:

python test.py --conf_dir checkpoints/lrs2/conf.yml
python test.py --conf_dir checkpoints/lrs3/conf.yml
python test.py --conf_dir checkpoints/vox2/conf.yml

💡Future Work

Validate the effectiveness and robustness of IIANet on larger-scale datasets such as AVSpeech.
Further optimize the architecture and training strategies of IIANet to improve speech separation quality while reducing computational costs.
Explore the applications of IIANet in other multimodal tasks, such as speech enhancement, speaker recognition, etc.

📜Citation

If you find our work helpful, please consider citing:

@inproceedings{lee2024iianet,
  title={IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation}, 
  author={Kai Li and Runxuan Yang and Fuchun Sun and Xiaolin Hu},
  booktitle={International Conference on Machine Learning},
  year={2024}
}

Related Skills

node-connect

352.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.5k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

JusperLee

View profile

View on GitHub

GitHub Stars108

CategoryDevelopment

Updated2mo ago

Forks14

JusperLee/IIANet

Languages

Python

Security Score

95/100

Audited on Jan 15, 2026

No findings

IIANet

Install / Use

README

<font color=E7595C>I</font><font color=F6C446>I</font><font color=00C7EE>A</font><font color=00D465>Net</font>: An <font color=E7595C>I</font>ntra- and <font color=F6C446>I</font>nter-Modality <font color=00C7EE>A</font>ttention <font color=00D465>Net</font>work for Audio-Visual Speech Separation

✨Key Highlights:

🚀Overall Pipeline

🪢IIANet Architecture

🔧Installation

📊Model Performance

💥Real-world Evaluation

📚Training

📈Testing/Inference

💡Future Work

📜Citation

Related Skills