14 skills found
alephpi / TexoA minimalist SOTA LaTeX OCR model with only 20M parameters, running in browser. Full training pipeline available for self-reproduction. | 超轻量SOTA LaTeX公式识别模型,仅20M参数量,可在浏览器中运行。训练全流程代码开源,以便自学复现。
lucidrains / MaMMUT PytorchImplementation of MaMMUT, a simple vision-encoder text-decoder architecture for multimodal tasks from Google, in Pytorch
microsoft / Encoder Decoder SlmEfficient encoder-decoder architecture for small language models (≤1B parameters) with cross-architecture knowledge distillation and vision-language capabilities
arturxe2 / T DEEDPyTorch Implementation of "T-DEED: Temporal-Discriminability Enhancer Encoder-Decoder for Precise Event Spotting in Sports Videos", 10th International Workshop on Computer Vision in Sports (CVsports) at CVPR 2024.
ShmuelRonen / ComfyUI Pixtral LargeA ComfyUI custom node that integrates Mistral AI's Pixtral Large vision model, enabling powerful multimodal AI capabilities within ComfyUI. Pixtral Large is a 124B parameter model (123B decoder + 1B vision encoder) that can analyze up to 30 high-resolution images simultaneously.
Hamtech-ai / Persian Image CaptioningA Persian Image Captioning model based on Vision Encoder Decoder Models of the transformers🤗.
d-gurgurov / Im2latexA repo for the Formula Recognition Model (im2latex) based on Vision Encoder Decoder Model
Quanato607 / XLSTM HVED[ISBI 2025] XLSTM-HVED: Cross-Modal Brain Tumor Segmentation and MRI Reconstruction Method Using Vision XLSTM and Heteromodal Variational Encoder-Decoder
JiaqiLi404 / Super Resolution DINOThe application of large pre-trained vision model DINOv2 from MetaAI for feature points matching, and a ViT decoder used for Auto Encoder
mzbac / Image2dslThis repository contains the implementation of an Image to DSL (Domain Specific Language) model. The model uses a pre-trained Vision Transformer (ViT) as an encoder to extract image features and a custom Transformer Decoder to generate DSL code from the extracted features.
inuwamobarak / Depth Estimation DPTThis repository contains the implementation of Depth Prediction Transformers (DPT), a deep learning model for accurate depth estimation in computer vision tasks. DPT leverages the transformer architecture and an encoder-decoder framework to capture fine-grained details, model long-range dependencies, and generate precise depth predictions.
donahowe / VE MLDVision Transformer Encoder-Multilabel Decoder for multilabel image classification
IshanTiwari-030800 / Social Media Caption GenerationThis project aims to generate social media captions given an image. It leverages a Vision Encoder-Decoder model to generate descriptive image captions, which are then refined using the LLaMA 3 3B Instruct model through carefully designed prompts to produce rich, creative final captions.
dhavalthakur / Image Caption Generation Using Deep Learning 14k Flickr Dataset With current scenario in 2020, where quarantine is the buzz word and work-from-home has become a norm, there is an increasing usage of internet. At the touch of a screen we can order groceries, medicines etc. However, not everyone is fortunate enough to use it as seamlessly. For example, the people who suffer from impaired vision might find it cumbersome and frustrating to distinguish between blueberries and grapes. This project aims to create a neural network model that can help such demographics. The complexity and novelty in creating such a model is that it should not simply detect the object but also give useful and accurate information about that object. Hence, this project proposes an ‘Image caption generator (using deep learning)’ that processes the image and describes it in a short sentence using a natural language such as English. The model is an amalgamation of two types of neural networks, CNN (Convolutional Neural Network) for image processing and LSTM (Long short-term memory), a type of Recurrent Neural Network, for text processing. A subset of 14,000 images, along with their sample captions, has been selected from Flickr_30K dataset. The generated caption is evaluated using human judgement as well as BLEU-1 score. Furthermore, the model has been trained and tested with several variations such as incorporation of pre-trained GloVe embeddings, different dropout and regularizer rates, and two types of feature extraction models for images: Xception and VG16. Most relevant and fitting captions were obtained using features from Xception model with an encoder-decoder based architecture. Highest BLEU-1 scores (above 0.5 on a scale of 0 to 1) were obtained with VG16 model using GloVe embeddings.