330 skills found · Page 1 of 11
salesforce / BLIPPyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
stepfun-ai / Step1X EditA SOTA open-source image editing model, which aims to provide comparable performance against the closed-source models like GPT-4o and Gemini 2 Flash.
PKU-YuanGroup / LLaVA CoT[ICCV 2025] LLaVA-CoT, a visual language model capable of spontaneous, systematic reasoning
cvlab-columbia / ViperCode for the paper "ViperGPT: Visual Inference via Python Execution for Reasoning"
zhaochen0110 / Awesome Think With ImagesResources and paper list for "Thinking with Images for LVLMs". This repository accompanies our survey on how LVLMs can leverage visual information for complex reasoning, planning, and generation.
YehLi / XmodalerX-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval).
facebookresearch / Clevr IepInferring and Executing Programs for Visual Reasoning
jokieleung / Awesome Visual Question AnsweringA curated list of Visual Question Answering(VQA)(Image/Video Question Answering),Visual Question Generation ,Visual Dialog ,Visual Commonsense Reasoning and related area.
Alibaba-NLP / ViDoRAG[EMNLP 2025] ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents
facebookresearch / Clevr Dataset GenA Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
Fancy-MLLM / R1 OnevisionR1-onevision, a visual language model capable of deep CoT reasoning.
jqtangust / Robust R1🔥🔥🔥[AAAI 2026 Oral] Official Implementation of Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding
rowanz / R2cRecognition to Cognition Networks (code for the model in "From Recognition to Cognition: Visual Commonsense Reasoning", CVPR 2019)
MILVLG / Mcan VqaDeep Modular Co-Attention Networks for Visual Question Answering
deepcs233 / Visual CoT[Neurips'24 Spotlight] Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
groundlight / R1 VlmBuild your own visual reasoning model
Mini-o3 / Mini O3Official Code for "Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search"
Atomic-man007 / Awesome Multimodel LLMAwesome_Multimodel is a curated GitHub repository that provides a comprehensive collection of resources for Multimodal Large Language Models (MLLM). It covers datasets, tuning techniques, in-context learning, visual reasoning, foundational models, and more. Stay updated with the latest advancement.
lupantech / MathVistaMathVista: data, code, and evaluation for Mathematical Reasoning in Visual Contexts
davidmascharka / Tbd NetsPyTorch implementation of "Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning"