2,688 skills found · Page 1 of 90
huggingface / Transformers🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
mudler / LocalAILocalAI is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.
pytorch / VisionDatasets, Transforms and Models specific to Computer Vision
getomni-ai / ZeroxOCR & Document Extraction using vision models
vikhyat / Moondreamtiny vision language model
roboflow / NotebooksA collection of tutorials on state-of-the-art computer vision models and techniques. Explore everything from foundational architectures like ResNet to cutting-edge models like RF-DETR, YOLO11, SAM 3, and Qwen3-VL.
rednote-hilab / Dots.ocrMultilingual Document Layout Parsing in a Single Vision-Language Model
GetStream / Vision AgentsOpen Vision Agents by Stream. Build Vision Agents quickly with any model or video provider. Uses Stream's edge network for ultra-low latency.
apple / Ml FastvlmThis repository contains the official implementation of "FastVLM: Efficient Vision Encoding for Vision Language Models" - CVPR 2025
QwenLM / Qwen VLThe official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
deepseek-ai / DeepSeek VL2DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
unslothai / Notebooks250+ Fine-tuning & RL Notebooks for text, vision, audio, embedding, TTS models.
Deci-AI / Super GradientsEasily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.
joanrod / Star VectorStarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.
QwenLM / Qwen2.5 OmniQwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
hustvl / Vim[ICML 2024] Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
NVlabs / VILAVILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
MiniMax-AI / MiniMax 01The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention
JIA-Lab-research / MGMOfficial repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
VainF / Torch Pruning[CVPR 2023] DepGraph: Towards Any Structural Pruning; LLMs, Vision Foundation Models, etc.