Results for "vision-encoder-decoder"

Claude Code Claude Desktop GitHub Copilot Cursor Windsurf Cline Zed JetBrains

📄SKILL.md 🤖CLAUDE.md ⚡Claude Commands 📐.cursorrules 📐Cursor Rules 🕹️AGENTS.md 🧬codex.md 🏄.windsurfrules 🔧.clinerules 🧑‍✈️Copilot Instructions

All Development Operations Data Product Marketing Customer Design Sales

14 skills found

alephpi / Texo

765

A minimalist SOTA LaTeX OCR model with only 20M parameters, running in browser. Full training pipeline available for self-reproduction. | 超轻量SOTA LaTeX公式识别模型，仅20M参数量，可在浏览器中运行。训练全流程代码开源，以便自学复现。

universal

computer-visiondeep-learningdistillation-model+16

Updated 1d ago

lucidrains / MaMMUT Pytorch

104

Implementation of MaMMUT, a simple vision-encoder text-decoder architecture for multimodal tasks from Google, in Pytorch

universal

artificial-intelligencecontrastive-learningdeep-learning+1

Updated 1mo ago

microsoft / Encoder Decoder Slm

Efficient encoder-decoder architecture for small language models (≤1B parameters) with cross-architecture knowledge distillation and vision-language capabilities

universal

decoder-onlyencoder-decoderllm+1

Updated 19d ago

arturxe2 / T DEED

PyTorch Implementation of "T-DEED: Temporal-Discriminability Enhancer Encoder-Decoder for Precise Event Spotting in Sports Videos", 10th International Workshop on Computer Vision in Sports (CVsports) at CVPR 2024.

universal

Updated 1d ago

ShmuelRonen / ComfyUI Pixtral Large

A ComfyUI custom node that integrates Mistral AI's Pixtral Large vision model, enabling powerful multimodal AI capabilities within ComfyUI. Pixtral Large is a 124B parameter model (123B decoder + 1B vision encoder) that can analyze up to 30 high-resolution images simultaneously.

universal

Updated 3mo ago

Hamtech-ai / Persian Image Captioning

A Persian Image Captioning model based on Vision Encoder Decoder Models of the transformers🤗.

universal

berthuggingfaceimage-captioning+3

Updated 3mo ago

d-gurgurov / Im2latex

A repo for the Formula Recognition Model (im2latex) based on Vision Encoder Decoder Model

universal

Updated 3mo ago

Quanato607 / XLSTM HVED

[ISBI 2025] XLSTM-HVED: Cross-Modal Brain Tumor Segmentation and MRI Reconstruction Method Using Vision XLSTM and Heteromodal Variational Encoder-Decoder

universal

Updated 1mo ago

JiaqiLi404 / Super Resolution DINO

The application of large pre-trained vision model DINOv2 from MetaAI for feature points matching, and a ViT decoder used for Auto Encoder

universal

Updated 7mo ago

mzbac / Image2dsl

This repository contains the implementation of an Image to DSL (Domain Specific Language) model. The model uses a pre-trained Vision Transformer (ViT) as an encoder to extract image features and a custom Transformer Decoder to generate DSL code from the extracted features.

universal

Updated 1y ago

inuwamobarak / Depth Estimation DPT

This repository contains the implementation of Depth Prediction Transformers (DPT), a deep learning model for accurate depth estimation in computer vision tasks. DPT leverages the transformer architecture and an encoder-decoder framework to capture fine-grained details, model long-range dependencies, and generate precise depth predictions.

universal

depthdepth-estimationdpt+4

Updated 1mo ago

donahowe / VE MLD

Vision Transformer Encoder-Multilabel Decoder for multilabel image classification

universal

Updated 1y ago

IshanTiwari-030800 / Social Media Caption Generation

This project aims to generate social media captions given an image. It leverages a Vision Encoder-Decoder model to generate descriptive image captions, which are then refined using the LLaMA 3 3B Instruct model through carefully designed prompts to produce rich, creative final captions.

universal

Updated 10mo ago

dhavalthakur / Image Caption Generation Using Deep Learning 14k Flickr Dataset

With current scenario in 2020, where quarantine is the buzz word and work-from-home has become a norm, there is an increasing usage of internet. At the touch of a screen we can order groceries, medicines etc. However, not everyone is fortunate enough to use it as seamlessly. For example, the people who suffer from impaired vision might find it cumbersome and frustrating to distinguish between blueberries and grapes. This project aims to create a neural network model that can help such demographics. The complexity and novelty in creating such a model is that it should not simply detect the object but also give useful and accurate information about that object. Hence, this project proposes an ‘Image caption generator (using deep learning)’ that processes the image and describes it in a short sentence using a natural language such as English. The model is an amalgamation of two types of neural networks, CNN (Convolutional Neural Network) for image processing and LSTM (Long short-term memory), a type of Recurrent Neural Network, for text processing. A subset of 14,000 images, along with their sample captions, has been selected from Flickr_30K dataset. The generated caption is evaluated using human judgement as well as BLEU-1 score. Furthermore, the model has been trained and tested with several variations such as incorporation of pre-trained GloVe embeddings, different dropout and regularizer rates, and two types of feature extraction models for images: Xception and VG16. Most relevant and fitting captions were obtained using features from Xception model with an encoder-decoder based architecture. Highest BLEU-1 scores (above 0.5 on a scale of 0 to 1) were obtained with VG16 model using GloVe embeddings.