Results for "mllm-evaluation"

Claude Code Claude Desktop GitHub Copilot Cursor Windsurf Cline Zed JetBrains

📄SKILL.md 🤖CLAUDE.md ⚡Claude Commands 📐.cursorrules 📐Cursor Rules 🕹️AGENTS.md 🧬codex.md 🏄.windsurfrules 🔧.clinerules 🧑‍✈️Copilot Instructions

All Development Operations Data Product Marketing Customer Design Sales

50 skills found · Page 1 of 2

EmbodiedBench / EmbodiedBench

277

[ICML 2025 Oral] Official repo of EmbodiedBench, a comprehensive benchmark designed to evaluate MLLMs as embodied agents.

universal

Updated 5d ago

yipoh / AesBench

256

An expert benchmark aiming to comprehensively evaluate the aesthetic perception capacities of MLLMs.

universal

Updated 1mo ago

Xuchen-Li / Llm Arxiv Daily

140

Automatically update arXiv papers about LLM Reasoning, LLM Evaluation, LLM & MLLM and Video Understanding using Github Actions.

universal

Updated 3h ago

OpenGVLab / MM NIAH

124

[NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.

universal

benchmarklong-contextmultimodal-large-language-models+1

Updated 11d ago

EvolvingLMMs-Lab / EASI

Holistic Evaluation of Multimodal LLMs on Spatial Intelligence

universal

mllmmllm-evaluationmultimodal-models+1

Updated 1d ago

SaFo-Lab / JailBreakV 28K

[COLM 2024] JailBreakV-28K: A comprehensive benchmark designed to evaluate the transferability of LLM jailbreak attacks to MLLMs, and further assess the robustness and safety of MLLMs against a variety of jailbreak attacks.

universal

jailbreakv-28k

Updated 1mo ago

thunxxx / MLLM Jailbreak Evaluation MMJ Bench

No description available

universal

Updated 11d ago

InternRobotics / OST Bench

[NeurIPS 2025] OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

universal

Updated 15d ago

FreedomIntelligence / MLLM Bench

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

universal

Updated 21d ago

Chenyu-Wang567 / All Angles Bench

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

universal

3dbenchmarkmllm+1

Updated 6d ago

luo-junyu / FinMME

[ACL 2025] FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation

universal

mllm-evaluationmllm-reasoning

Updated 4d ago

AdaCheng / EgoThink

[CVPR'24 Highlight] The official code and data for paper "EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models"

universal

egocentric-visionmllm-evaluation

Updated 4mo ago

jiayuww / SpatialEval

[NeurIPS'24] SpatialEval: a benchmark to evaluate spatial reasoning abilities of MLLMs and LLMs

claude codeclaude desktop+1

claudefoundation-modelsgemini+9

Updated 3d ago

MetaAgentX / OpenCaptchaWorld

[NeurIPS 2025] The first web-based benchmark and platform to evaluate visual reasoning and interaction capabilities of MLLM powered agents through diverse and dynamic CAPTCHA puzzles.

universal

Updated 6d ago

zhousheng97 / EgoTextVQA

[CVPR'25] 🌟🌟 EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

universal

egocentric-qa-assistancemllm-evaluationscene-text-videoqa+2

Updated 1mo ago

HiThink-Research / GAGE

General AI evaluation and Gauge Engine. A unified evaluation engine for LLMs, MLLMs, audio, and diffusion models.

universal

agentgame-arenallm+3

Updated 1d ago

RaptorMai / MLLM CompBench

[NeurIPS'25] MLLM-CompBench evaluates the comparative reasoning of MLLMs with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes

universal

benchmarkevaluation-llmsfoundation-models+10

Updated 27d ago

MileBench / MileBench

This repo contains evaluation code for the paper "MileBench: Benchmarking MLLMs in Long Context"

universal

benchmarkcomputer-visiondeep-learning+15

Updated 2mo ago

Candice-yu / GeoLaux

A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines

universal

Updated 3h ago

Jeffjeno / MLLM Reasoning Enhancement Guide

A curated guide to reasoning-enhancement methods for Multimodal Large Language Models (MLLMs), including dataset construction, training strategies, architectural designs, and evaluation benchmarks.

universal

Updated 6d ago