Back to Resources
๐Ÿ“‹

LM Evaluation Harness

Evaluation

Unified framework for evaluating language models on 200+ benchmarks.

8kstars2kforksPython

About

The LM Evaluation Harness by EleutherAI provides standardized evaluation of language models across academic benchmarks including MMLU, HellaSwag, and ARC. The standard for open model leaderboards.

Key Features

  • 200+ benchmarks
  • Standardized evaluation
  • Few-shot prompting
  • Reproducibility

Tags

EvaluationBenchmarksMMLULLM

Related Resources