11 skills found
inclusionAI / UI VenusUI-Venus is a native UI agent designed to perform precise GUI element grounding and effective navigation using only screenshots as input.
FlagOpen / RoboBrain2.5RoboBrain 2.5: Advanced version of RoboBrain. Depth in Sight, Time in Mind. 🎉🎉🎉
JIA-Lab-research / Seg ZeroProject Page For "Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement"
jqtangust / Robust R1🔥🔥🔥[AAAI 2026 Oral] Official Implementation of Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding
Atomic-man007 / Awesome Multimodel LLMAwesome_Multimodel is a curated GitHub repository that provides a comprehensive collection of resources for Multimodal Large Language Models (MLLM). It covers datasets, tuning techniques, in-context learning, visual reasoning, foundational models, and more. Stay updated with the latest advancement.
sun-hailong / TVC[ACL 2025] The code repository for "Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning" in PyTorch.
theboringhumane / EchoOLlama🦙 echoOLlama: A real-time voice AI platform powered by local LLMs. Features WebSocket streaming, voice interactions, and OpenAI API compatibility. Built with FastAPI, Redis, and PostgreSQL. Perfect for private AI conversations and custom voice assistants.
BIGBALLON / UME SearchToward Universal Multimodal Embedding
xinyanghuang7 / Basic Visual Language ModelBuild a simple basic multimodal large model from scratch. 从零搭建一个简单的基础多模态大模型🤖
zhangguanghao523 / CMMCoT[AAAI'26] Official implementation of CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation
SufyanDanish / VLM Survey A comprehensive survey of Vision–Language Models: Pretrained models, fine-tuning, prompt engineering, adapters, and benchmark datasets