Back to Resources
๐
LM Evaluation Harness
EvaluationUnified framework for evaluating language models on benchmarks.
7kstars1.8kforksPython
About
The LM Evaluation Harness by EleutherAI provides standardized evaluation of language models across 200+ academic benchmarks including MMLU, HellaSwag, and ARC.
Key Features
- 200+ benchmarks
- Standardized evaluation
- Few-shot prompting
- Reproducibility
Tags
EvaluationBenchmarksMMLULLM