AViD

Framework that enables fine-tuning of vision-language grounding models on custom datasets

Generate Convert Improve

Install / Use

/learn @levy-tech-spark/AViD

About this skill

Quality Score

0/100

README

Adaptive Vision-Language Detector (AViD) 🦖

A streamlined toolkit for fine-tuning state-of-the-art vision-language detection models with parameter-efficient adaptation. Built on Grounding DINO with LoRA support and EMA stabilization.

Key Features 🔑

AViD extends the powerful Grounding DINO framework by adding fine-tuning capabilities for image-to-text grounding. This functionality is critical for applications requiring precise alignment between textual descriptions and image regions.

For example, when the model encounters a caption like "a cat on the sofa," it can accurately localize both the "cat" and the "sofa" in the corresponding image.

Fine-tuning Pipeline: Complete workflow for fine-tuning Grounding DINO on custom datasets
Parameter-Efficient Training with LoRA: Train just ~2% of parameters while maintaining performance
- Uses rank-32 LoRA adapters by default (smaller ranks also available)
- Significantly reduces storage requirements for fine-tuned models
EMA (Exponential Moving Average): Retains pre-trained knowledge during fine-tuning
Sample Dataset: Includes fashion dataset subset for immediate experimentation
Optional Phrase-Based NMS: Removes redundant boxes for the same objects

Installation ⚙️

# Clone repository
git clone https://github.com/levy-tech-spark/AViD
cd AViD

# Install dependencies
pip install -r requirements.txt

# Install package (add CUDA flags if needed)
pip install -e .

CUDA Configuration Tip:

For custom GPU setups, set architecture compatibility:

nvidia-smi --query-gpu=gpu_name,compute_cap --format=csv
export TORCH_CUDA_ARCH_LIST="<your-arch>"
export FORCE_CUDA=1

If you have an older GPU or if the architecture is not recognized automatically:

# Check if CUDA_HOME is set correctly
# e.g export CUDA_HOME=/usr/local/cuda
nvidia-smi --query-gpu=gpu_name,compute_cap --format=csv

# Add your GPU architecture capability from previous command
export TORCH_CUDA_ARCH_LIST="6.0;6.1;7.0;7.5;8.0;8.6" 
export FORCE_CUDA=1

Quick Start 🚦

Get Sample Dataset

gdown https://drive.google.com/uc?id=1D2qphEE98Dloo3fUURRnsxaIRw076ZXX
unzip fashion_dataset_subset.zip -d multimodal-data

Start Training (LoRA Example)

python train.py --config configs/train_config.yaml

Evaluate Model

python test.py --config configs/test_config.yaml

Configuration Guide ⚙️

Customize training through YAML configs:

# Example config snippet
training:
  num_epochs: 200
  learning_rate: 1e-4
  use_lora: true
  lora_rank: 32

data:
  batch_size: 8
  num_workers: 4

Performance Highlights 📈

| Metric | Baseline | Fine-Tuned | |-----------------|----------|------------| | mAP@0.5 (Shirt) | 0.62 | 0.89 | | mAP@0.5 (Pants) | 0.58 | 0.85 | | mAP@0.5 (Bag) | 0.65 | 0.91 |

Advanced Features 🧪

Parameter-Efficient LoRA

# Enable in config
training:
  use_lora: true
  lora_rank: 16  # Reduce for higher compression

EMA Stabilization

# Automatic in training loop
model = ModelEMA(model, decay=0.999)

Interactive Demo

python demo/gradio_app.py --share

Contribution & Roadmap 🤝

Current Priorities:

[x] Add LoRA for efficient fine-tuning
[x] Add comprehensive model evaluation metrics
[ ] Implement techniques to prevent catastrophic forgetting
[ ] Add auxiliary losses as described in the original paper
[ ] Quantization support
[ ] Distributed training
[ ] HuggingFace integration

How to Contribute:

Fork the repository
Create feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open Pull Request

Model Evaluation 📊

AViD includes a comprehensive evaluation framework for measuring detection performance:

Metrics: mAP, Precision, Recall, F1 Score with customizable IoU thresholds
Visualizations: Automated generation of prediction vs. ground truth visualizations
Integration: Evaluate during training or as a standalone process
Reporting: Detailed metrics reports with per-class breakdown

For detailed information on using the evaluation tools, see EVALUATION.md.

# Run standalone evaluation
python evaluate.py --config configs/evaluation_config.yaml

License 📜

This project is licensed under the MIT License - see the LICENSE file for details.

Related Skills

node-connect

342.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

85.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

342.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

342.5k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。