EVP
[ECCV 2024] EVP model for metric depth estimation from a single image and referring segmentation
Install / Use
/learn @Lavreniuk/EVPREADME
[ECCV 2024] EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment
<a href='https://lavreniuk.github.io/EVP'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2312.08548'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/spaces/MykolaL/evp'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue'></a>
by Mykola Lavreniuk, Shariq Farooq Bhat, Matthias Müller, Peter Wonka
This repository contains PyTorch implementation for paper "EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment".
EVP (<ins>E</ins>nhanced <ins>V</ins>isual <ins>P</ins>erception) builds on the previous work VPD which paved the way to use the Stable Diffusion network for computer vision tasks.

Installation
Clone this repo, and run
git submodule init
git submodule update
Download the checkpoint of stable-diffusion (we use v1-5 by default) and put it in the checkpoints folder. Please also follow the instructions in stable-diffusion to install the required packages.
Referring Image Segmentation with EVP
EVP achieves 76.35 overall IoU and 77.61 mean IoU on the validation set of RefCOCO.
Please check refer.md for detailed instructions on training and inference.
Depth Estimation with EVP
EVP obtains 0.224 RMSE on NYUv2 depth estimation benchmark, establishing the new state-of-the-art.
| | RMSE | d1 | d2 | d3 | REL | log_10 | |---------|-------|-------|--------|------|-------|-------| | EVP | 0.224 | 0.976 | 0.997 | 0.999 | 0.061 | 0.027 |
EVP obtains 0.048 REL and 0.136 SqREL on KITTI depth estimation benchmark, establishing the new state-of-the-art.
| | REL | SqREL | RMSE | RMSE log | d1 | d2 | d3 | |---------|-------|-------|--------|------|-------|-------|-------| | EVP | 0.048 | 0.136 | 2.015 | 0.073 | 0.980 | 0.998 | 1.000 |
Please check depth.md for detailed instructions on training and inference.
License
MIT License
Acknowledgements
This code is based on stable-diffusion, mmsegmentation, LAVT, MIM-Depth-Estimation and VPD
Citation
If you find our work useful in your research, please consider citing:
@inproceedings{lavreniuk2024evp,
title={EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment},
author={Mykola Lavreniuk and Shariq Farooq Bhat and Matthias Muller and Peter Wonka},
booktitle={European Conference on Computer Vision Workshops (ECCVW)},
pages={206--225},
year={2024}
}
Related Skills
node-connect
352.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
