SkillAgentSearch skills...

Otter

๐Ÿฆฆ Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.

Install / Use

/learn @EvolvingLMMs-Lab/Otter

README

<p align="center" width="100%"> <img src="https://i.postimg.cc/mksBCbV9/brand-title.png" width="80%" height="80%"> </p>

Twitter Hits litellm

Project Credits | Otter Paper | OtterHD Paper | MIMIC-IT Paper

Checkpoints:

For who in the mainland China: Open in OpenXLab | Open in OpenXLab

Disclaimer: The code may not be perfectly polished and refactored, but all opensourced codes are tested and runnable as we also use the code to support our research. If you have any questions, please feel free to open an issue. We are eagerly looking forward to suggestions and PRs to improve the code quality.

๐Ÿฆพ Update

[2023-11]: Supporting GPT4V's Evaluation on 8 Benchmarks; Anouncing OtterHD-8B, improved from Fuyu-8B. Checkout OtterHD for details.

<div style="text-align:center"> <img src="https://i.postimg.cc/dtxQQzt6/demo0.png" width="100%" height="100%"> </div>
  1. ๐Ÿฆฆ Added OtterHD, a multimodal fine-tuned from Fuyu-8B to facilitate fine-grained interpretations of high-resolution visual input without a explicit vision encoder module. All image patches are linear transformed and processed together with text tokens. This is a very innovative and elegant exploration. We are fascinated and paved in this way, we opensourced the finetune script for Fuyu-8B and improve training throughput by 4-5 times faster with Flash-Attention-2. Try our finetune script at OtterHD.
  2. ๐Ÿ” Added MagnifierBench, an evaluation benchmark tailored to assess whether the model can identify the tiny objects' information (1% image size) and spatial relationships.
  3. Improved pipeline for Pretrain | SFT | RLHF with (part of) current leading LMMs.
    1. Models: Otter | OpenFlamingo | Idefics | Fuyu
    2. Training Datasets Interface: (Pretrain) MMC4 | LAION2B | CC3M | CC12M, (SFT) MIMIC-IT | M3IT | LLAVAR | LRV | SVIT...
      • We tested above datasets for both pretraining and instruction tuning with OpenFlamingo and Otter. We also tested the datasets with Idefics and Fuyu for instruction tuning. We will opensource the training scripts gradually.
    3. Benchmark Interface: MagnifierBench/MMBench/MM-VET/MathVista/POPE/MME/SicenceQA/SeedBench. Run them can be in one-click, please see Benchmark for details.
        datasets:
        - name: magnifierbench
            split: test
            prompt: Answer with the option's letter from the given choices directly.
            api_key: [Your API Key] # GPT4 or GPT3.5 to evaluate the answers and ground truth.
            debug: true # put debug=true will save the model response in log file.
        - name: mme
            split: test
            debug: true
        - name: mmbench
            split: test
            debug: true
    
        models:
        - name: gpt4v
            api_key: [Your API Key] # to call GPT4V model.
    
    1. Code refactorization for organizing multiple groups of datasets with integrated yaml file, see details at managing datasets in MIMIC-IT format. For example,
        IMAGE_TEXT: # Group name should be in [IMAGE_TEXT, TEXT_ONLY, IMAGE_TEXT_IN_CONTEXT]
            LADD: # Dataset name can be assigned at any name you want
                mimicit_path: azure_storage/json/LA/LADD_instructions.json # Path of the instruction json file
                images_path: azure_storage/Parquets/LA.parquet # Path of the image parquet file
                num_samples: -1 # Number of samples you want to use, -1 means use all samples, if not set, default is -1.
            M3IT_CAPTIONING:
                mimicit_path: azure_storage/json/M3IT/captioning/coco/coco_instructions.json
                images_path: azure_storage/Parquets/coco.parquet
                num_samples: 20000
    
    This is a major change and would result previous code not runnable, please check the details.

[2023-08]

  1. Added Support for using Azure, Anthropic, Palm, Cohere models for Self-Instruct with Syphus pipeline, for information on usage modify this line with your selected model and set your API keys in the environment. For more information see LiteLLM

[2023-07]: Anouncing MIMIC-IT dataset for multiple interleaved image-text/video instruction tuning.

  1. ๐Ÿค— Checkout MIMIC-IT on Huggingface datasets.
  2. ๐Ÿฅš Update Eggs section for downloading MIMIC-IT dataset.
  3. ๐Ÿฅƒ Contact us if you wish to develop Otter for your scenarios (for satellite images or funny videos?). We aim to support and assist with Otter's diverse use cases. OpenFlamingo and Otter are strong models with the Flamingo's excellently designed architecture that accepts multiple images/videos or other modality inputs. Let's build more interesting models together.

[2023-06]

  1. ๐Ÿงจ Download MIMIC-IT Dataset. For more details on navigating the dataset, please refer to MIMIC-IT Dataset README.
  2. ๐ŸŽ๏ธ Run Otter Locally. You can run our model locally with at least 16G GPU mem for tasks like image/video tagging and captioning and identifying harmful content. We fix a bug related to video inference where frame tensors were mistakenly unsqueezed to a wrong vision_x.

    Make sure to adjust the sys.path.append("../..") correctly to access otter.modeling_otter in order to launch the model.

  3. ๐Ÿค— Check our paper introducing MIMIC-IT in details. Meet MIMIC-IT, the first multimodal in-context instruction tuning dataset with 2.8M instructions! From general scene understanding to spotting subtle differences and enhancing egocentric view comprehension for AR headsets, our MIMIC-IT dataset has it all.

๐Ÿฆฆ Why In-Context Instruction Tuning?

Large Language Models (LLMs) have demonstrated exceptional universal aptitude as few/zero-shot learners for numerous tasks, owing to their pre-training on extensive text data. Among these LLMs, GPT-3 stands out as a prominent model with significant capabilities. Additionally, variants of GPT-3, namely InstructGPT and ChatGPT, have proven effective in interpreting natural language instructions to perform complex real-world tasks, thanks to instruction tuning.

Motivated by the upstream interleaved format pretraining of the Flamingo model, we present ๐Ÿฆฆ Otter, a multi-modal model based on OpenFlamingo (the open-sourced version of DeepMind's Flamingo). We train our Otter in an in-context instruction tuning way on our proposed MI-Modal In-Context Instruction Tuning (MIMIC-IT) dataset. Otter showcases improved instruction-following and in-context learning ability in both images and videos.

๐Ÿ—„ MIMIC-IT Dataset Details

<p align="center" width="100%"> <img src="https://i.postimg.cc/yYMm1G5X/mimicit-logo.png" width="80%" height="80%"> </p>

MIMIC-IT enables the application of egocentric visual assistant model that can serve that can answer your questions like Hey, Do you think I left my keys on the table?. Harness the power of MIMIC-IT to unlock the full potential of your AI-driven visual assistant and elevate your interactive vision-language tasks to new heights.

<p align="center" width="100%"> <img src="https://i.postimg.cc/RCGp0vQ1/syphus.png" width="80%" height="80%"> </p>

We also introduce Syphus, an automated pipeline for generating high-quality instruction-response pairs in multiple languages. Building upon the framework proposed by LLaVA, we utilize ChatGPT to generate instruction-response pairs based on visual content. To ensure the quality of the generated instruction-response pairs, our pipeline incorporates system messages, visual annotations, and in-context examples as prompts for ChatGPT.

For more details, please check the MIMIC-IT dataset.

๐Ÿค–

Related Skills

View on GitHub
GitHub Stars3.4k
CategoryEducation
Updated16h ago
Forks209

Languages

Python

Security Score

100/100

Audited on Mar 26, 2026

No findings