Results for "llm-evaluation"

Claude Code Claude Desktop GitHub Copilot Cursor Windsurf Cline Zed JetBrains

📄SKILL.md 🤖CLAUDE.md ⚡Claude Commands 📐.cursorrules 📐Cursor Rules 🕹️AGENTS.md 🧬codex.md 🏄.windsurfrules 🔧.clinerules 🧑‍✈️Copilot Instructions

All Development Operations Data Product Marketing Customer Design Sales

1,160 skills found · Page 3 of 39

evalplus / Evalplus

1.7k

Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024

zed

benchmarkchatgptefficiency+4

Updated 2d ago

huggingface / Aisheets

1.6k

Build, enrich, and transform datasets using AI models with no code

universal

aillm-evaluationllms+3

Updated 23h ago

mbzuai-oryx / Video ChatGPT

1.5k

[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

universal

chatbotclipgpt-4+8

Updated 4d ago

OpenGenerativeAI / Llm Colosseum

1.5k

Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM

universal

benchmarkgenaillm+1

Updated 2d ago

mattpocock / Evalite

1.4k

Evaluate your LLM-powered apps with TypeScript

universal

aievalstypescript

Updated 2h ago

Barca0412 / Introduction To Quantitative Finance

1.3k

入门资料整理：1.多因子股票量化框架开源教程 2.学界和业界的经典资料收录 3.AI + 金融的相关工作，包括LLM, Agent, benchmark(evaluation), etc.

universal

agentai4finfinance+8

Updated 11h ago

cyberark / FuzzyAI

1.3k

A powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify and mitigate potential jailbreaks in their LLM APIs.

universal

aiai-red-teamfuzzing+7

Updated 23h ago

yueliu1999 / Awesome Jailbreak On LLMs

1.3k

Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses.

universal

aijailbreakllm+6

Updated 13h ago

Scale3-Labs / Langtrace

1.2k

Langtrace 🔍 is an open-source, Open Telemetry based end-to-end observability tool for LLM applications, providing real-time tracing, evaluations and metrics for popular LLMs, LLM frameworks, vectorDBs and more.. Integrate using Typescript, Python. 🚀💻📊

universal

aidatasetsevaluations+11

Updated 13h ago

microsoft / Prompty

1.2k

Prompty makes it easy to create, manage, debug, and evaluate LLM prompts for your AI applications. Prompty is an asset class and format for LLM prompts designed to enhance observability, understandability, and portability for developers.

universal

generative-aillm-evaluationllms+2

Updated 3d ago

thu-coai / Safety Prompts

1.1k

Chinese safety prompts for evaluating and improving the safety of LLMs. 中文安全prompts，用于评估和提升大模型的安全性。

universal

attack-defensechatgptchinese-language+5

Updated 15h ago

cvs-health / Uqlm

1.1k

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

universal

ai-evaluationai-safetyconfidence-estimation+11

Updated 1d ago

rlancemartin / Auto Evaluator

1.1k

Evaluation tool for LLM QA chains

universal

Updated 2d ago

EmbeddedLLM / JamAIBase

1.1k

The collaborative spreadsheet for AI. Chain cells into powerful pipelines, experiment with prompts and models, and evaluate LLM responses in real-time. Work together seamlessly to build and iterate on AI applications.

universal

agentsaiai-agents-framework+17

Updated 21h ago