NEO

NEO Series: Native Vision-Language Models from First Principles

Generate Convert Improve

Install / Use

/learn @EvolvingLMMs-Lab/NEO

About this skill

Quality Score

0/100

README

<img src="docs/NEO_logo.jpg" style="vertical-align: -10px;" :height="40px" width="40px"> NEO Series: Native Vision-Language Models

2025/09: From Pixels to Words -- Towards Native Vision-Language Primitives at Scale (ICLR 2026)

📜 News

[2026/01] 🔥🔥🔥 The training code of NEO is released !
[2025/10] The paper, weights, and evaluation code of NEO are released !
[2025/09] 💥💥💥 NEO has been completed !

📋 Todo List

[x] Evaluation guide
[x] Training guide

💡 Motivation

What constraints set native VLMs apart from modular ones, and to what extent can they be overcome?
How to make native VLMs more accessible and democratized, thereby accelerating their progress?

💡 Highlights

🔥 Native Architecture: NEO innovates a native VLM primitive that unifies pixel-word encoding, alignment, and reasoning within a dense, monolithic model architecture.
🔥 Superior Efficiency: With merely 390M image-text examples, NEO develops strong visual perception from scratch, rivaling top-tier modular VLMs and outperforming native ones.
🔥 Promising Roadmap: NEO pioneers a promising route for scalable and powerful native VLMs, paired with diverse reusable components that foster a cost-effective and extensible ecosystem.

🤖 Model Zoo

We release 2B and 9B NEO in Pre-Training (PT), Mid-Training (MT), and Supervised Fine-Tuning (SFT).

| Model Name | Model Weight| | ---------- | ------------------------------------------------------- | | NEO-2B-PT | 🤗 NEO-2B-PT HF link | | NEO-2B-MT | 🤗 NEO-2B-MT HF link | | NEO-2B-SFT | 🤗 NEO-2B-SFT HF link | | NEO-9B-PT | 🤗 NEO-9B-PT HF link | | NEO-9B-MT | 🤗 NEO-9B-MT HF link | | NEO-9B-SFT | 🤗 NEO-9B-SFT HF link |

📊 Benchmark Results

TABLE NOTE:

“# Data” = data scale for pre-training / mid-training / supervised fine-tuning.

“†“ = vision-language models using Reinforcement Learning (RL).

“Any Res.” = any resolution; “Tile-wise” = image split into tiles; “Any Rat.” = any aspect ratio; “Fix Res.” = fixed resolution.

“MoE“ = Mixture-of-Experts; “DaC“ = Divide-and-Conquer.

Bold = best score in each column.

| Model_NAME | Base_LLM_NAME | #Data_PT·MT·SFT | Input_TYPE | RoPE_TYPE | MMMU | MMB | MMVet | MMStar | SEED_I | POPE | HallB | AI2D | DocVQA | ChartQA | InfoVQA | TextVQA | OCRBench | |:--|:--|:--|:--|:--|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | 🔻Modular_VLMs_(2B) ||||||||||||||||||| | Qwen2-VL | Qwen2-1.5B | --·--·-- | Any Res. | M-RoPE | 41.1 | 74.9 | 49.5 | 48.0 | -- | -- | 41.7 | 74.7 | 90.1 | 73.5 | 65.5 | 79.7 | 80.9 | | InternVL2.5 | InternLM2.5-1.8B | >6B·100M·16M | Tile-wise | 1D-RoPE | 43.6 | 74.7 | 60.8 | 53.7 | -- | 90.6 | 42.6 | 74.9 | 88.7 | 79.2 | 60.9 | 74.3 | 80.4 | | InternVL3† | Qwen2.5-1.5B | >6B·100M·22M | Tile-wise | 1D-RoPE | 48.6 | 81.1 | 62.2 | 60.7 | -- | 89.6 | 42.5 | 78.7 | 88.3 | 80.2 | 66.1 | 77.0 | 83.5 | | Qwen2.5-VL† | Qwen2.5-3B | --·--·-- | Any Res. | M-RoPE | 51.2 | 79.1 | 61.8 | 55.9 | -- | -- | 46.3 | 81.6 | 93.9 | 84.0 | 77.1 | 79.3 | 79.7 | | Encoder_Based | Qwen3-1.7B | >6B·40M·4M | Tile-wise | 1D-RoPE | 47.1 | 75.8 | 37.4 | 52.7 | 73.6 | 87.0 | 44.4 | 77.4 | 89.9 | 78.4 | 65.9 | 73.3 | 83.5 | | 🔻Native_VLMs_(2B) ||||||||||||||||||| | Mono-InternVL | InternLM2-1.8B | 1.2B·143M·7M | Tile-wise | 1D-RoPE | 33.7 | 65.5 | 40.1 | -- | 67.4 | -- | 34.8 | 68.6 | 80.0 | 73.7 | 43.0 | 72.6 | 76.7 | | Mono-InternVL-1.5 | InternLM2-1.8B | 400M·150M·7M | Tile-wise | 1D-RoPE | 39.1 | 64.0 | 54.0 | -- | 66.9 | -- | 32.5 | 67.4 | 81.7 | 72.2 | 47.9 | 73.7 | 80.1 | | HoVLE | InternLM2-1.8B | 550M·50M·7M | Tile-wise | 1D-RoPE | 32.2 | 73.3 | 43.8 | -- | 70.9 | 87.4 | 38.4 | 73.0 | 86.1 | 78.6 | 55.7 | 70.9 | 74.0 | | OneCAT | Qwen2.5-1.5B | 436M·70M·13M | Any Res. | M-RoPE | 39.0 | 72.4 | 42.4 | -- | 70.9 | -- | -- | 72.4 | 87.1 | 76.2 | 56.3 | 67.0 | -- | | NEO | Qwen3-1.7B | 345M·40M·4M | Any Res. | Native_RoPE | 48.6 | 76.0 | 49.6 | 54.2 | 74.2 | 87.5 | 43.1 | 80.1 | 89.9 | 81.2 | 63.2 | 74.0 | 77.1 |

| Model_NAME | Base_LLM_NAME | #Data_PT·MT·SFT | Input_TYPE | RoPE_TYPE | MMMU | MMB | MMVet | MMStar | SEED_I | POPE | HallB | AI2D | DocVQA | ChartQA | InfoVQA | TextVQA | OCRBench | |:--|:--|:--|:--|:--|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:| | 🔻Modular_VLMs_(8B) ||||||||||||||||||| | Qwen2-VL | Qwen2-7B | --·--·-- | Any Res. | M-RoPE | 54.1 | 83.0 | 62.0 | 60.7 | -- | 88.1 | 50.6 | 83.0 | 94.5 | 83.0 | 76.5 | 84.3 | 86.6 | | InternVL2.5 | InternLM2.5-7B | >6B·50M·4M | Tile-wise | 1D-RoPE | 56.0 | 84.6 | 62.8 | 64.4 | -- | 90.6 | 50.1 | 84.5 | 93.0 | 84.8 | 77.6 | 79.1 | 82.2 | | Qwen2.5-VL† | Qwen2.5-7B | --·--·-- | Any Res. | M-RoPE | 55.0 | 83.5 | 67.1 | 63.9 | -- | 86.4 | 52.9 | 83.9 | 95.7 | 87.3 | 82.6 | 84.9 | 86.4 | | InternVL3† | Qwen2.5-7B | >6B·100M·22M | Tile-wise | 1D-RoPE | 62.7 | 83.4 | 81.3 | 68.2 | -- | 91.1 | 49.9 | 85.2 | 92.7 | 86.6 | 76.8 | 80.2 | 88.0 | | Encoder-Based | Qwen3-8B | >6B·40M·4M | Tile-wise | 1D-RoPE | 54.1 | 84.0 | 60.0 | 63.5 | 76.2 | 87.8 | 51.4 | 82.9 | 92.1 | 83.5 | 75.0 | 77.1 | 85.3 | | 🔻Native_VLMs_(8B) ||||||||||||||||||| | Fuyu | Persimmon-8B | --·--·-- | Any Res. | 1D-RoPE | 27.9 | 10.7 | 21.4 | -- | 59.3 | 84.0 | -- | 64.5 | -- | -- | -- | -- | 36.6 | | Chameleon | from scratch | 1.4B·0M·1.8M | Fix Res. | 1D-RoPE | 25.4 | 31.1 | 8.3 | -- | 30.6 | 19.4 | 17.1 | 46.0 | 1.5 | 2.9 | 5.0 | 4.8 | 0.7 | | EVE | Vicuna-7B | 33M·0M·1.8M | Any Rat. | 1D-RoPE | 32.6 | 52.3 | 25.7 | -- | 64.6 | 85.0 | 26.4 | 61.0 | 53.0 | 59.1 | 25.0 | 56.8 | 39.8 | | SOLO | Mistral-7B | 44M·0M·2M | Any Res. | 1D-RoPE | -- | 67.7 | 30.4 | -- | 64.4 | 78.6 | -- | 61.4 | -- | -- | -- | -- | 12.6 | | Emu3 | from scratch | --·--·-- | Fix Res. | 1D-RoPE | 31.6 | 58.5 | 37.2 | -- | 68.2 | 85.2 | -- | 70.0 | 76.3 | 68.6 | 43.8 | 64.7 | 68.7 | | EVEv2 | Qwen2.5-7B | 77M·15M·7M | Any Rat. | 1D-RoPE | 39.3 | 66.3 | 45.0 | -- | 71.4 | 87.6 | -- | 74.8 | -- | 73.9 | -- | 71.1 | 70.2 | | BREEN | Qwen2.5-7B | 13M·0M·4M | Any Res. | 1D-RoPE | 42.7 | 71.4 | 38.9 | 51.2 | -- | -- | 37.0 | 76.4 | -- | -- | -- | 65.7 | -- | | VoRA | Qwen2.5-7B | 30M·0M·0.6M | Any Res. | 1D-RoPE | 32.0 | 61.3 | 33.7 | -- | 68.9 | 85.5 | -- | 61.1 | -- | -- | -- | 58.7 | -- | | SAIL | Mistral-7B | 512M·86M·6M | Any Res. | M-RoPE | -- | 70.1 | 46.3 | 53.1 | 72.9 | 85.8 | 54.2 | 76.7 | -- | -- | -- | 77.1 | 78.3 | | NEO | Qwen3-8B | 345M·40M·4M | Any Res. | Native_RoPE | 54.6 | 82.1 | 53.6 | 62.4 | 76.3 | 88.4 | 46.4 | 83.1 | 88.6 | 82.1 | 60.9 | 75.0 | 77.7 |

✒️ Citation

If NEO series is helpful for your research, please consider star ⭐ and citation 📝 :

@article{Diao2025NEO,
  title        = {From Pixels to Words--Towards Native Vision-Language Primitives at Scale},
  author       = {Diao, Haiwen and Li, Mingxuan and Wu, Silei and Dai, Linjun and Wang, Xiaohua and Deng, Hanming and Lu, Lewei and Lin, Dahua and Liu, Ziwei},
  journal      = {arXiv preprint arXiv:2510.14979},
  year         = {2025}
}

📄 License

The content of this project itself is licensed under LICENSE.

Related Skills

node-connect

337.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

337.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.1k

Commit, push, and open a PR

EvolvingLMMs-Lab

View profile

View on GitHub

GitHub Stars683

CategoryDevelopment

Updated12h ago

Forks24

EvolvingLMMs-Lab/NEO

Languages

Python

Security Score

100/100

Audited on Mar 26, 2026

No findings

NEO