248 skills found · Page 1 of 9
om-ai-lab / VLM R1Solve Visual Understanding with Reinforced VLMs
luigifreda / PyslampySLAM is a hybrid Python/C++ Visual SLAM pipeline supporting monocular, stereo, and RGB-D cameras. It provides a broad set of modern local and global feature extractors, multiple loop-closure strategies, a volumetric reconstruction module, integrated depth-prediction models, and semantic segmentation capabilities for enhanced scene understanding.
DAMO-NLP-SG / Video LLaMA[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
eslint / Config InspectorA visual tool for inspecting and understanding your ESLint flat configs.
PKU-YuanGroup / Chat UniVi[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
PKU-YuanGroup / UniWorldUniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
CircleRadon / Osprey[CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning"
photonlines / Intuitive Guide To Maxwells EquationsAn intuitive and visual guide to understanding Maxwell's equations.
VUE / VUEVisual Understanding Environment
FoundationVision / UniTok[NeurIPS 2025 Spotlight] A Unified Tokenizer for Visual Generation and Understanding
OpenGVLab / All Seeing[ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of the Open World"
jqtangust / Robust R1🔥🔥🔥[AAAI 2026 Oral] Official Implementation of Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding
CoreyGinnivan / WhocanuseWhoCanUse is a tool that brings attention and understanding to how color contrast can affect different people with visual impairments.
mit-han-lab / Vila U[ICLR 2025] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
google-deepmind / VideoprismOfficial repository for "VideoPrism: A Foundational Visual Encoder for Video Understanding" (ICML 2024)
VARGPT-family / VARGPTVARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
Open3DA / LL3DA[CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning"; an interactive Large Language 3D Assistant.
shabie / DocformerImplementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)
SALT-NLP / LLaVARCode/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding"
jiweil / Visualizing And Understanding Neural Models In NLPNo description available