SkillAgentSearch skills...

Evals

A benchmark suite for evaluating how coding models solve real React Native tasks.

Install / Use

/learn @callstackincubator/Evals
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

React Native Evals banner

A benchmark suite for evaluating how coding models solve real React Native tasks.

Available Evals

Groups map to top-level folders under evals/.

| Group | Path | Status | | --- | --- | --- | | animation | evals/animation | Active | | async-state | evals/async-state | Active | | navigation | evals/navigation | Active | | react-native-apis | evals/react-native-apis | Active | | expo-sdk | evals/expo-sdk | WIP | | brownfield | evals/brownfield | WIP | | nitro-modules | evals/nitro-modules | WIP | | lists | evals/lists | Active |

Want a group that is not listed here? Open an issue to request it. Contributions are also welcome.

Getting Started

bun install
bun runner/run.ts --model openai/gpt-4.1-mini --output generated/my-generated
bun runner/judge.ts --model openai/gpt-5.3-codex --input generated/my-generated

For full command reference and workflows, see docs and CONTRIBUTING.md.

Whitepaper

Methodology and scoring details are documented in the benchmark methodology whitepaper.

The benchmark evaluates model-generated React Native implementations using requirement-based assessment. Each eval specifies a fixed task context and a set of explicit, judgeable requirements. Model outputs are judged against these requirements using file-level evidence, and per-eval scores are computed from requirement outcomes with optional weighting. Aggregate run metrics summarize performance across evals under a consistent evaluation protocol.

Requests And Contributions

If you want to request new features to be evaluated, open an issue. We are open to covering the most popular ecosystem libraries and will continue expanding coverage.

Contributions are welcome. Start with CONTRIBUTING.md and AGENTS.md.

License

MIT (LICENSE)

View on GitHub
GitHub Stars69
CategoryDevelopment
Updated21h ago
Forks1

Languages

TypeScript

Security Score

100/100

Audited on Apr 6, 2026

No findings