50 skills found · Page 1 of 2
EmbodiedBench / EmbodiedBench[ICML 2025 Oral] Official repo of EmbodiedBench, a comprehensive benchmark designed to evaluate MLLMs as embodied agents.
yipoh / AesBenchAn expert benchmark aiming to comprehensively evaluate the aesthetic perception capacities of MLLMs.
Xuchen-Li / Llm Arxiv DailyAutomatically update arXiv papers about LLM Reasoning, LLM Evaluation, LLM & MLLM and Video Understanding using Github Actions.
OpenGVLab / MM NIAH[NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.
EvolvingLMMs-Lab / EASIHolistic Evaluation of Multimodal LLMs on Spatial Intelligence
SaFo-Lab / JailBreakV 28K[COLM 2024] JailBreakV-28K: A comprehensive benchmark designed to evaluate the transferability of LLM jailbreak attacks to MLLMs, and further assess the robustness and safety of MLLMs against a variety of jailbreak attacks.
thunxxx / MLLM Jailbreak Evaluation MMJ BenchNo description available
InternRobotics / OST Bench[NeurIPS 2025] OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding
FreedomIntelligence / MLLM BenchMLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria
Chenyu-Wang567 / All Angles BenchSeeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
luo-junyu / FinMME[ACL 2025] FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation
AdaCheng / EgoThink[CVPR'24 Highlight] The official code and data for paper "EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models"
jiayuww / SpatialEval[NeurIPS'24] SpatialEval: a benchmark to evaluate spatial reasoning abilities of MLLMs and LLMs
MetaAgentX / OpenCaptchaWorld[NeurIPS 2025] The first web-based benchmark and platform to evaluate visual reasoning and interaction capabilities of MLLM powered agents through diverse and dynamic CAPTCHA puzzles.
zhousheng97 / EgoTextVQA[CVPR'25] 🌟🌟 EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
HiThink-Research / GAGEGeneral AI evaluation and Gauge Engine. A unified evaluation engine for LLMs, MLLMs, audio, and diffusion models.
RaptorMai / MLLM CompBench[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes
MileBench / MileBenchThis repo contains evaluation code for the paper "MileBench: Benchmarking MLLMs in Long Context"
Candice-yu / GeoLauxA Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines
Jeffjeno / MLLM Reasoning Enhancement GuideA curated guide to reasoning-enhancement methods for Multimodal Large Language Models (MLLMs), including dataset construction, training strategies, architectural designs, and evaluation benchmarks.