Back to Resources
๐Ÿ“‹

LM Evaluation Harness

Evaluation

Unified framework for evaluating language models on benchmarks.

7kstars1.8kforksPython

About

The LM Evaluation Harness by EleutherAI provides standardized evaluation of language models across 200+ academic benchmarks including MMLU, HellaSwag, and ARC.

Key Features

  • 200+ benchmarks
  • Standardized evaluation
  • Few-shot prompting
  • Reproducibility

Tags

EvaluationBenchmarksMMLULLM

Related Resources