Back to Resources
๐
LM Evaluation Harness
EvaluationUnified framework for evaluating language models on 200+ benchmarks.
8kstars2kforksPython
About
The LM Evaluation Harness by EleutherAI provides standardized evaluation of language models across academic benchmarks including MMLU, HellaSwag, and ARC. The standard for open model leaderboards.
Key Features
- 200+ benchmarks
- Standardized evaluation
- Few-shot prompting
- Reproducibility
Tags
EvaluationBenchmarksMMLULLM