EVP

[ECCV 2024] EVP model for metric depth estimation from a single image and referring segmentation

Generate Convert Improve

Install / Use

/learn @Lavreniuk/EVP

About this skill

Quality Score

0/100

README

[ECCV 2024] EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment

by Mykola Lavreniuk, Shariq Farooq Bhat, Matthias Müller, Peter Wonka

This repository contains PyTorch implementation for paper "EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment".

EVP (<ins>E</ins>nhanced <ins>V</ins>isual <ins>P</ins>erception) builds on the previous work VPD which paved the way to use the Stable Diffusion network for computer vision tasks.

intro

Installation

Clone this repo, and run

git submodule init
git submodule update

Download the checkpoint of stable-diffusion (we use v1-5 by default) and put it in the checkpoints folder. Please also follow the instructions in stable-diffusion to install the required packages.

Referring Image Segmentation with EVP

EVP achieves 76.35 overall IoU and 77.61 mean IoU on the validation set of RefCOCO.

Please check refer.md for detailed instructions on training and inference.

Depth Estimation with EVP

EVP obtains 0.224 RMSE on NYUv2 depth estimation benchmark, establishing the new state-of-the-art.

| | RMSE | d1 | d2 | d3 | REL | log_10 | |---------|-------|-------|--------|------|-------|-------| | EVP | 0.224 | 0.976 | 0.997 | 0.999 | 0.061 | 0.027 |

EVP obtains 0.048 REL and 0.136 SqREL on KITTI depth estimation benchmark, establishing the new state-of-the-art.

| | REL | SqREL | RMSE | RMSE log | d1 | d2 | d3 | |---------|-------|-------|--------|------|-------|-------|-------| | EVP | 0.048 | 0.136 | 2.015 | 0.073 | 0.980 | 0.998 | 1.000 |

Please check depth.md for detailed instructions on training and inference.

License

MIT License

Acknowledgements

This code is based on stable-diffusion, mmsegmentation, LAVT, MIM-Depth-Estimation and VPD

Citation

If you find our work useful in your research, please consider citing:

@inproceedings{lavreniuk2024evp,
  title={EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment},
  author={Mykola Lavreniuk and Shariq Farooq Bhat and Matthias Muller and Peter Wonka},
  booktitle={European Conference on Computer Vision Workshops (ECCVW)},
  pages={206--225},
  year={2024}
}

Related Skills

node-connect

352.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。