AViD
Framework that enables fine-tuning of vision-language grounding models on custom datasets
Install / Use
/learn @levy-tech-spark/AViDREADME
Adaptive Vision-Language Detector (AViD) 🦖
A streamlined toolkit for fine-tuning state-of-the-art vision-language detection models with parameter-efficient adaptation. Built on Grounding DINO with LoRA support and EMA stabilization.
<div align="center"> <img src="results/after_train_0.jpg" width="45%"/> <img src="results/after_train_1.jpg" width="45%"/> </div>Key Features 🔑
AViD extends the powerful Grounding DINO framework by adding fine-tuning capabilities for image-to-text grounding. This functionality is critical for applications requiring precise alignment between textual descriptions and image regions.
For example, when the model encounters a caption like "a cat on the sofa," it can accurately localize both the "cat" and the "sofa" in the corresponding image.
- Fine-tuning Pipeline: Complete workflow for fine-tuning Grounding DINO on custom datasets
- Parameter-Efficient Training with LoRA: Train just ~2% of parameters while maintaining performance
- Uses rank-32 LoRA adapters by default (smaller ranks also available)
- Significantly reduces storage requirements for fine-tuned models
- EMA (Exponential Moving Average): Retains pre-trained knowledge during fine-tuning
- Sample Dataset: Includes fashion dataset subset for immediate experimentation
- Optional Phrase-Based NMS: Removes redundant boxes for the same objects
Installation ⚙️
# Clone repository
git clone https://github.com/levy-tech-spark/AViD
cd AViD
# Install dependencies
pip install -r requirements.txt
# Install package (add CUDA flags if needed)
pip install -e .
CUDA Configuration Tip:
For custom GPU setups, set architecture compatibility:
nvidia-smi --query-gpu=gpu_name,compute_cap --format=csv
export TORCH_CUDA_ARCH_LIST="<your-arch>"
export FORCE_CUDA=1
If you have an older GPU or if the architecture is not recognized automatically:
# Check if CUDA_HOME is set correctly
# e.g export CUDA_HOME=/usr/local/cuda
nvidia-smi --query-gpu=gpu_name,compute_cap --format=csv
# Add your GPU architecture capability from previous command
export TORCH_CUDA_ARCH_LIST="6.0;6.1;7.0;7.5;8.0;8.6"
export FORCE_CUDA=1
Quick Start 🚦
- Get Sample Dataset
gdown https://drive.google.com/uc?id=1D2qphEE98Dloo3fUURRnsxaIRw076ZXX
unzip fashion_dataset_subset.zip -d multimodal-data
- Start Training (LoRA Example)
python train.py --config configs/train_config.yaml
- Evaluate Model
python test.py --config configs/test_config.yaml
Configuration Guide ⚙️
Customize training through YAML configs:
# Example config snippet
training:
num_epochs: 200
learning_rate: 1e-4
use_lora: true
lora_rank: 32
data:
batch_size: 8
num_workers: 4
Performance Highlights 📈
| Metric | Baseline | Fine-Tuned | |-----------------|----------|------------| | mAP@0.5 (Shirt) | 0.62 | 0.89 | | mAP@0.5 (Pants) | 0.58 | 0.85 | | mAP@0.5 (Bag) | 0.65 | 0.91 |
Advanced Features 🧪
Parameter-Efficient LoRA
# Enable in config
training:
use_lora: true
lora_rank: 16 # Reduce for higher compression
EMA Stabilization
# Automatic in training loop
model = ModelEMA(model, decay=0.999)
Interactive Demo
python demo/gradio_app.py --share
Contribution & Roadmap 🤝
Current Priorities:
- [x] Add LoRA for efficient fine-tuning
- [x] Add comprehensive model evaluation metrics
- [ ] Implement techniques to prevent catastrophic forgetting
- [ ] Add auxiliary losses as described in the original paper
- [ ] Quantization support
- [ ] Distributed training
- [ ] HuggingFace integration
How to Contribute:
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open Pull Request
Model Evaluation 📊
AViD includes a comprehensive evaluation framework for measuring detection performance:
- Metrics: mAP, Precision, Recall, F1 Score with customizable IoU thresholds
- Visualizations: Automated generation of prediction vs. ground truth visualizations
- Integration: Evaluate during training or as a standalone process
- Reporting: Detailed metrics reports with per-class breakdown
For detailed information on using the evaluation tools, see EVALUATION.md.
# Run standalone evaluation
python evaluate.py --config configs/evaluation_config.yaml
License 📜
This project is licensed under the MIT License - see the LICENSE file for details.
Related Skills
node-connect
342.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
85.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
342.5kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
